khomin
Posted on April 9, 2021
One day i needed a solution that could parse meta graph tags from a input line and produce a title and an icon
Of course there were an infinite number of libraries that used jsoup's, but that was not what i needed, i wanted to use qt and c++
I thought as soon i enter my query - "c++ parser meta tags" i will see all the solutions o was looking for
But in reality, everything is a little more complicated.
What i did:
1) Prepare step, parse the input and decide if there is a valid url or just text
This step seems to be expensive (not so much as loading everything from input,
but I think it is too extra work)
static bool checkIsContainsHyperlink(QString line) {
static QRegularExpression regex(web_pattern);
QRegularExpressionMatch match = regex.match(line);
return match.hasMatch();
}
2) Download with the ability to handle redirects
Many sites do not provide tags on simple web pages, and they often use redirect for reasons which i don't know
connect(&m_WebCtrl, SIGNAL (finished(QNetworkReply*)), this, SLOT (fileDownloaded(QNetworkReply*)));
QNetworkRequest request(url);
request.setAttribute(QNetworkRequest::RedirectPolicyAttribute, true);
m_WebCtrl.get(request);
3) Saving the page we downloaded it seems strange, why we save this page is probably surprising you
The problem is that some sites can ban a specific IP, which makes a lot of requests
For me it was enough to change 3-5 symbols in the url line and i got banned for a few minutes
Caching downloaded pages solved this problem
connect(m_downloader_image, &FileDownloader::downloaded, [&, imagePathName]() {
QByteArray array = m_downloader_image->downloadedData();
if(!array.isEmpty()) {
QFile imageFile(imagePathName);
if(imageFile.open(QIODevice::WriteOnly)) {
imageFile.write(array);
m_result.og_image_local_path = imagePathName;
}
}
emit signalParserDone(m_result);
});
4) Parsing
So we have a web-page in the local folder, it's time to parse it and get what we need
Unfortunately, for me, gumbo-parser turned out to be very unfriendly
So for first start i decided to use regex, hoping to change it to something else in the future
QRegularExpression site_name_regex(og_site_name);
QRegularExpression title_regex(og_title);
QRegularExpression description_regex(og_description);
QRegularExpression url_regex(og_url);
QRegularExpression image_regex(og_image);
QRegularExpressionMatch match;
match = site_name_regex.match(html);
if (match.hasMatch()) {
res.og_site_name = match.captured(1);
}
match = title_regex.match(html);
if (match.hasMatch()) {
res.og_title = match.captured(1);
}
match = description_regex.match(html);
if (match.hasMatch()) {
res.og_description = match.captured(1);
}
match = url_regex.match(html);
if (match.hasMatch()) {
res.og_url = match.captured(1);
}
match = image_regex.match(html);
if (match.hasMatch()) {
res.og_image = match.captured(1);
}
Finally, we can enter URL-address and enjoy the preview and title
Posted on April 9, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.