'How Facebook parses blogspot.com open graph properties

Some pages of blogspot.com do not contain open graph tags, but Facebook Object Debugger still parse the open graph properties correctly. How does it get the open graph information?

For example, I don't see any open graph meta tag in http://sushiwens.blogspot.com/ source code. But it is correctly parsed by facebook https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fsushiwens.blogspot.com%2F

I need to implement something like the open graph parsing function in python, so I need to know how to do it.



Solution 1:[1]

I don't have true source to be sure how facebook does, but this site may help you.
I used his ideas to develop a parser in python. If it can help you, the python project is here.

If I try to sum up a strategy to get data without og tags:

  • Title:
    • search the title tag
    • search h1 in body
    • search h2 in body ...
  • Description:
    • search in <meta name="description">
    • search in visible text in body (first <p> for instance)
    • search for <meta name="twitter:description"> is a solution, but I don't do that: usually the description is bad, more related to twitter stuff than the true content of the link.
  • Domain name:
    • search <link rel="canonical">
    • search og:url
    • but I do simpler: extract the domain from the target link (in python: urlparse(url).netloc
  • Last but not least: images:
    • search <link rel="image_src" href="image url" />
    • parse the target link html for all <img> tags and "sort" them:
      • small images: those with one dimension <= 50px
      • bad ratio images: the remaining with ratio longest side / shortest side > 3
      • good images: the remaining
    • Then choose the biggest image in good images. If no good image: the biggest in bad ratio. Otherwise: the biggest in small images. (biggest = the max width x height)
    • Fetching all images can be time consuming! One can get the dimensions with the first bytes of the images, but that's another story (see 2nd link)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1