A former Yandex employee has leaked the source code of the search engine and other services. This provides exciting insights into the inner workings of the search engine: ranking factors, weightings, and more.
Yandex is the search engine market leader in Russia and fifth in the world by page views. While Yandex is not Google, the essential workings of search engines are comparable. The following findings are not necessarily directly applicable to Google, but they do provide a fascinating insight.
There is an extensive list of 1,922 ranking factors shared and analyzed by many SEOs.
Thousands more ranking factors have yet to be discussed at this time.
Some factors are marked as TG_UNUSED (149) or TG_REMOVED (115).
Note: Several SEOs understand the factor tag TG_DEPRECATED wrong, in that they assume these 242 of the first 1922 are unused or removed. tag TG_DEPRECATED. This tag just means that the factor shall not be used in new “Matrixnet formulas”, but is still calculated in all existing ranking formulas.
That leaves 1,658 factors from the first batch of 1,922 to be active ranking factors, with many more in other parts of the code base.
That is substantially more than the approximately 200 that Google has mentioned so far to us.
As Google has already confirmed, Yandex also uses different algorithms and weightings depending on the search query.
It differentiates by the time of day, commercialism, adult content, and many more queries.
An initial list of ranking factor weights is also known.
To break down into some details, I picked some interesting link-related factors.
As explained, there are many link-related factors in the code, and each of them deserves days of research and analysis on the code. But the available comments, also in the “common 1923 factors”, is also very useful.
I’ve always “preached” that links need to age like good old wine. There has been a short period when links would work immediately (mid-2003 when I started in SEO) in Google.
But shortly after that, a lot of delays, filters, and in general link damping factors were introduced.
All the papers about combating web link spam were written about that 20 years ago already, and we see these concepts in the Yandex leak clearly confirmed also.
SEO practitioners know that anchor text matters for SEO big time, and that is confirmed also in the Yandex leak.
Even in the “common 1922,” there are 146 factors ONLY related to the Anchor Text, marked with TG_LINK_TEXT.
There are probably at least a couple dozen more in other areas like image search and video search.
So we’re talking about roughly 200 factors (actually “Features”) that Yandex has defined for the link text.
These factors are then multiplied by several aspects, like geolocation, topic, and many more the “MatrixNet” system appears, which is also documented online.
My age-old saying that the rules for links and their effects are different per country, topic, language, etc., is confirmed.
These two aspects could take weeks to learn more about if conducted by highly experienced, professional software engineers. It is still very early, especially since many people don’t even look at the code and its comments.
For example, there’s the tag TG_DEPRECATED. The code clearly states what it means:
“the factor shall not be used anywhere, but it is still computed and may be present in existing formulas.”
It clearly states that these factors can still be used in some MatrixNet “formulas.” That is the typical use of “Deprecation” of code elements.
Yet many SEOs discard those factors claiming that they would be unused. There is a tag in the code called TG_REMOVED (“the factor is completely removed”).
Of course, a Pagerank calculation is the first factor in the list. That was and is still the “secret” of Google, initially published in their famous “backrub paper”, but used still today in many variants at Google and Yandex as well.
There are Static and Dynamic factors. The Dynamic factors are evaluated on a per-query basis, like the FI_LINK_RELEV Link Relevance factor.
The same method applies to anchor text-related factors.
With the factor FI_PAGE_RANK_BONUS, a page rank bonus is applied for long-tail queries.
This page rank bonus is assigned dynamically to almost all two or more word requests, except for a tiny number of queries.
Giving links more weight regarding long tail queries makes sense since they occur much less on the web.
It must be necessary if there is an exact match between a three- or four-word anchor text and query.
We see in the initial weights that these factors are all active in Yandex.
Actually, two factors, closely related are for assigning different weights for exact matches or phrase matches between anchor text and the query.
This factor assignment must also be dynamic, so the value of links changes dynamically, depending on the query.
While many link factors are considered static, I think this is very interesting.
This means that links are interpreted stronger or weaker, depending on the search query.
While many SEOs swear by TF*IDF it turns out that Yandex actually uses a different approach of similar age for their term weights. BM25 is a classical method in information retrieval, just different from all SEO on-page tools out there I know.
This BM25 method is applied to links in one factor and to body text and links combined in another factor.
Right now, some onpage-tool-vendors are hectically looking into implementing BM25 in their software.
We’re just getting started. This provides a rough overview for you of what’s in there.
We’re just scratching the surface here with so many more valuable insights ahead.
But we were quite right in many assumptions and interpretations from the outside of how such an extensive search engine would work, at least regarding links.
All in all, the Yandex code leak offers a fascinating insight into the inner workings of a modern search engine.
Although not all of the findings can be directly applied to Google, many assumptions made in recent years about the general functioning of large Internet search engines are confirmed.
I assume the SEO industry still has a few interesting months ahead of it with new insights from this leak.
How useful do you think the Yandex ranking factors are in terms of SEO for Google? Do you think it's two different engines so not comparable?
I believe they are very comparable. In fact it appears that Yandex has always actively hired Google engineers.
Someone said on LinkedIn, that he could not imagine Google “documenting” ranking factors just like that. But that’s how a complex system like that needs to be built. This leak is from a very authoritative insider, and Google has similar code that could also be leaked.
The aged repetitive statement that not even Google employees know the ranking factors was always absurd for a tech person like me. The number of people that have all the details will be very small, but it must be somewhere there in the code, because that’s what runs the search engine.
Around all that there’s many layers like we’ve seen with the various methods in Google Panda, Google Penguin, than took the “core” results and filtered them. Only after many years such methods were integrated “into the core”. Guess why, because not a lot of developer have the knowledge or skill to work on such an old, large and impactful code base.
That being said, while we may not assume that we can project everything from the Yandex leak onto Google it is very likely that there are similar solutions found there, at least in the core. Yandex was always “famous” for wanting to clone everything from Google, so we maybe see that in parts of the code as well, from ex-Googlers working for Yandex.
In any case there are the same problems to solve, and a lot of smart solutions for these problems that can teach us.
Many of the assumptions about Google ranking factors can be found in the source code of Yandex. This is not a confirmation that Google also uses them, but it is a good indication.
Is the Yandex leak a big deal?
Is the Yandex leak a big deal?
It is a very big deal. It is the biggest chance to learn how a search engine can be built and operated.
We’re only scratching the surface right now. It’s also questionable who will invest the resources needed to analyze such a huge code base, but I can imagine some players. it is certainly not a trivial task to perform in your spare time.
Yandex code leak offers insight into search engine’s ranking factors, weightings, and more, and we’re revealing 7 link-related factors. Yandex is the leading search engine in Russia, 5th globally in page views.
The copywriting problem: To use or not to use? AI tools like GPT-3 and ChatGPT are popular in content creation; copywriters are debating whether or not to use them. Whether or not to disclose. Can AI Content Detectors be tricked? Yes, and very quickly, too.
There is growing interest in detecting and understanding the use of language models and AI text tools in generating content, particularly among organizations and government agencies. AI Content Detection and watermarking techniques are relevant for protecting against detection.