Saturday, December 14, 2024

Google BlogSearch, Ranking Blog Documents Patent

Share

For a long time my blogs have performed amazingly well with Google Blog Search. I always appear in the relevant results quickly, and the results I obtain have some reasonable longevity, even when I am not the original source of a story.

Considering how much competition I often have for certain search terms which everyone seems to be writing about because of common interest, I must have been doing a number of things right.

Bill Slawski of SEO By The Sea a few of days ago broke the news of Google’s Patent Application for Ranking Blog Documents.

SEO Round Table posted a synopsis lifted from the Cre8asite Forums that had been posted by Bill, and seems to be the easiest to understand.

I am going to do a little bit of mix and match here, and inject my own commentary but my interpretation of the patent is actually slightly different to those that I have read so far.

It should be noted I am working my way through the patent itself, and not recompiling the summaries of others.

Relevancy & Quality – Blog | Blogpost

It should first of all be noted that in the patent Google doesn’t differentiate between individual blog posts and whole blogs.

The phrase “blog document,” as used hereinafter, is to be broadly interpreted to include a blog, a blog post, or both a blog and a blog post. It will be appreciated that the techniques described herein are equally applicable to blogs and blog posts.

Later on in the patent, they also mention that feeds are also included within the documents that are compared and rated.

two distinct sets of data are used to determine a score of a blog (or blog post) in response to a search query–the topical relevance of the blog (or blog post) to the terms in the search query and the quality of the blog (or blog post), which is independent of the query terms. The quality of the blog (or blog post) may positively or negatively affect the score of the blog (or blog post)

Relevancy – this applies to the search term, thus Google will analyse the blog page, and they will also in some way determine the relevance to the whole blog.
Quality – this is irrespective of the search term, so think about factors from outside your niche

Google Blog Search – Positive Factors Affecting Search Quality | Relevancy

  • Popularity of the blog document
  • A number of news aggregator sites (commonly called “news readers” or “feed readers”) exist where individuals can subscribe to a blog document (through its feed). Such aggregators store information describing how many individuals have subscribed to given blog documents. A blog document having a high number of subscriptions implies a higher quality for the blog document

    This patent was first of all applied for 13th September 2005, with Google Blog Search launched 13 September 2005. At the time they were logically not basing this on numbers available for Google Reader subscribers. The Google Reader blog was launched October 21, 2005 with a post saying they had been up and running for 2 weeks.
    Maybe there is a coincidence between the 2 events.

    So which data were Google basing this part of their patent on? Some services such as Technorati and Bloglines do provide readership data, as does Feedburner, though most services report readership data as they are collecting new blog posts to a service like Feedburner, who aggregate the statistics.

    It seems there might be some value is collecting Technorati favorites (my reciprocation policy might be well worth it) beyond limited bragging rights. Google of course through Google Reader now have access to lots of usage data, so maybe other sources will eventually be phased out.

  • Implied popularity of the blog document
  • This implied popularity may be identified by, for example, examining the click stream of search results. For example, if a certain blog document is clicked more than other blog documents when the blog document appears in result sets, this may be an indication that the blog document is popular and, thus, a positive indicator of the quality of the blog document.

    Click data from search results, possible from Google Toolbar users.

  • Existence of the blog document in blogrolls
  • The existence of the blog document in blogrolls may be a positive indication of the quality of the blog document. It will be appreciated that blog documents often contain not only recent entries (i.e., posts), but also “blogrolls,” which are a dense collection of links to external sites (usually other blogs) in which the author/blogger is interested. A blogroll link to a blog document is an indication of popularity of that blog document, so aggregated blogroll links to a blog document can be counted and used to infer magnitude of popularity for the blog document.

    Everything I have ever read has suggested that for normal search, blogroll links that are site wide carry diminishing value. Just because it is listed here as part of the calculation does not necessarily mean that everyone should start building up huge blogrolls… well unless they want to game Technorati and have a blog network.

  • Existence of the blog document in a high quality blogroll
  • The existence of the blog document in a high quality blogroll may be a positive indication of the quality of the blog document. A high quality blogroll is a blogroll that links to well-known or trusted bloggers. Therefore, a high quality blogroll that also links to the blog document is a positive indicator of the quality of the blog document.

    Another revelation, links on high quality pages are worth more than links on low quality pages.

    Remember that “blog document” can mean both blog page and blog site.

    Can blogroll just refer to a list of links on what is identified as a blog. Thus a column of links to related pages might also class as a blogroll, whether in the sidebar or below the content.
    Thus a list of links to related documents on the same site could be looked on as a blogroll on a blog document.

    Related links plugins are very powerful, especially if you also include them in content that gets syndicated by design, or by sploggers.

  • Tagging of the blog document
  • Tagging of the blog document may be a positive indication of the quality of the blog document. Some existing sites allow users to add “tags” to (i.e., to “categorize”) a blog document. These custom categorizations are an indicator that an individual has evaluated the content of the blog document and determined that one or more categories appropriately describe its content, and as such are a positive indicator of the quality of the blog document.

    Well some sites do allow you to tag in a meaningful way, maybe Google uses shared tags from Del.icio.us and other sites, but many of those use nofollow extensively.
    It is my own belief that self tagging content heavily with plugins such as Ultimate Tag Warrior helps a huge amount. I have given lots of examples before, but more recent examples include

    toolbar pagerank
    google reader feedburner
    feedburner google reader
    compete toolbar
    duplicate content supplemental results

    Yes, I am just going down the inbound traffic results looking for likely candidates that rank well in both blog and normal search and aren’t totally obscure. These are subjects that sites in my niche have also talked about, with the keywords in the title, and which you would expect to rank higher than my own content.

    This doesn’t just affect blogsearch, Google have been using it for some time with the main results as well.
    Here are my observations regarding tagging from back in November, especially how they could relate to LSI calculations.

  • References to the blog document by other sources
  • Wow revelation again, god links are worth having either to pages or blog.

  • Pagerank of the blog document

Pagerank is still relevant, who knows for how long and how much.

It will be appreciated that other indicators may also be used.

What seems to be missing, at least at time of application?

  • Domain age?
  • Trustrank?
  • Page Titles?
  • URLs?
  • Growth rate of link popularity

Plus lots more that also factor into it, but general search patents probably also cover blog search.

Google Blog Search – Negative Factors Affecting Search Relevancy | Quality

  • Frequency of new posts on the blog document
  • The frequency at which new posts are added to the blog document may be a negative indication of the quality of that blog document. Feeds typically include only the most recent posts from a blog document. Spammers often generate new posts in spurts (i.e., many new posts appear within a short time period) or at predictable intervals (one post every 10 minutes, or a post every 3 hours at 32 minutes past the hour). Both behaviors are correlated with malicious intent and can be used to identify possible spammers. Therefore, if the frequency at which new posts are added to the blog document matches a predictable pattern, this may be a negative indication of the quality of the blog document.

    Make sure there is some variation when you publish your content for the day, especially with future dated posts.
    Most spamming tools are actually fairly sophisticated, thus I am not sure this measurement is very accurate. It most likely indicated a blogger who is very organised these days.

  • The content of the posts in the blog document
  • The content of the posts in the blog document may be a negative indication of the quality of that blog document. A feed typically contains some or all of the content of several posts from a given blog document. The blog document itself also includes the content of the posts. Spammers may put one version of content into a feed to improve their ranking in search results, while putting a different version on their blog document (e.g., links to irrelevant ads). This mismatch (between feed and blog document) can, therefore, be a negative indication of the quality of the blog document.

    This is actually a very significant and interestingly worded item. Google are stating that they are comparing the content of a feed with the content on your pages to ensure it matches.

    Based upon this:-

    • Don’t use a content spinner on your feeds to avoid duplicate content
    • Allow Google to index your feeds
    • If you use related links on your blog, make sure you use them in your feeds too

  • Duplicate Content, especially in feeds
  • Also, in some instances, particular content may be duplicated in multiple posts in a blog document, resulting in multiple feeds containing the same content. Such duplication indicates the feed is low quality/spam and, thus, can be a negative indication of the quality of the blog document.

    I can’t say I have noticed a problem having a lot of straggling RSS feeds on categories and tags.
    This could also be referring to things like the large footer I have on each post, though I haven’t seen a problem with that either.

    After the last toolbar pagerank update I spent some time studying Matt Cutts’ blog, and also looking at how pagerank was being transferred around my own site. Pagerank is only slightly useful as a guide, and only immediately after an update.
    Rather than repeat myself, you can read about my organic garden approach to this site.

     

  • Collective Intelligence
  • The words/phrases used in the posts of a blog document may also be a negative indication of the quality of that blog document. For example, from a collection of blog documents and feeds that evaluators rate as spam, a list of words and phrases (bigrams, trigrams, etc.) that appear frequently in spam may be extracted. If a blog document contains a high percentage of words or phrases from the list, this can be a negative indication of quality of the blog document.

    Google invest a lot of research analysing spam, detecting various word matching patterns, and use that to identify other documents.

  • A size of the posts in the blog document
  • The size of the posts in a blog document may be a negative indication of quality of the blog document. Many automated post generators create numerous posts of identical or very similar length. As a result, the distribution of post sizes can be used as a reliable measure of spamminess. When a blog document includes numerous posts of identical or very similar length, this may be a negative indication of quality of the blog document.

    This might be of special interest to those that use out-sourcing for articles, you need to ensure the article length changes.

  • A link distribution of the blog document
  • A link distribution of the blog document may be a negative indication of quality of the blog document. As disclosed above, some posts are created to increase the pagerank of a particular blog document. In some cases, a high percentage of all links from the posts or from the blog document all point to ether a single web page, or to a single external site. If the number of links to any single external site exceeds a threshold, this can be a negative indication of quality of the blog document.

    In some ways this debunks the benefits of blogrolls mentioned as a benefit, but as previously quoted, Google are using blog document in multiple context, and comparing the context, thus it could just refer to multiple spam links always pointing to a single domain within the content.

  • The presence of ads in the blog document
  • The presence of ads in the blog document may be a negative indication of quality of the blog document. If a blog document contains a large number of ads, this may be a negative indication of the quality of the blog document.

    Remember this is just a patent, and Google recently relaxed the rules about having ads from other networks along with Adsense. As long as a page is of a reasonable size to support the adverts, I don’t think there is a problem. If you just have a heading and 5 words, with 10 advertising blocks, you might want to add a few more words.

    However they go on to say this

    Moreover, blog documents typically contain three types of content: the content of recent posts, a blogroll, and blog metadata (e.g., author profile information and/or other information pertinent to the blog document or its author). Ads, if present, typically appear within the blog metadata section or near the blogroll. The presence of ads in the recent posts part of a blog document may be a negative indication of the quality of the blog document.

    Thus if you are using blocks in the content for all your ads, you might not rank as well, especially if you use multiple networks. You can probably get away with 3 in the content, or maybe 1 or 2 per post.

  • It will be appreciated that other indicators may also be used

Conclusion

The feed stats information is very useful, and looking at the timing, my conclusion is that Google might have been using Bloglines and Technorati Favorites data, with Google Reader in its infancy, or maybe though less likely, when blog search was introduced, they weren’t using that part of the patent

For me the most significant information was tagging, but just linking though to Technorati with your tags isn’t a great idea.

Remember that Google have their own blogging system, and they have archives and labels, and they are not going to create a system to generate duplicate content and then penalise you for it. Google wouldn’t have added such a system unless they intended to benefit from the enhanced data.

You don’t have to build your blogs in a 1990s era tree like structure to rank well.

Comments

*Originally published at AndyBeard.eu

Table of contents

Read more

Local News