SPAMBrain - How Google detect Spam Links or from a YOU to a whale to a LINK
Disclaimer
We only show parts of the link evaluation and spam detection.Most of it is current technology. If we @FoxxUP can realize this technologically, then Google certainly can.
All Equal or Each Individually
In the early days of Google, in the 90s, as a search engine, Google's success was based on Pagerank. "A link is a recommendation!" was the starting point. A rating system that delivered great results, better than other search engines. Pagerank assigns a rating to each page, i.e. each link receives the same rating, e.g. if a PR 5 has been calculated for a page (displayed in the toolbar), then the PR 5 applies to each link. Google still uses this rating system today, but it has lost importance because other methods are available.
New score systems
We are pretty sure that there will be a second or more scoring systems. One of these systems seems to be SPAMbrain, of which we don't know that much about how it works. What is known is that it uses big data and machine learning. Some patents from Google show possible parts or ways of working. Just because a patent application has been filed does not mean that this is being used productively by Google.
Let's make a few changes of perspective.
We are pretty sure that there will be a second or more scoring systems. One of these systems seems to be SPAMbrain, of which we don't know that much about how it works. What is known is that it uses big data and machine learning. Some patents from Google show possible parts or ways of working. Just because a patent application has been filed does not mean that this is being used productively by Google.
Another change of perspective.
Let's pretend we are a link on a page (In "The Hitchhiker's Guide to the Galaxy" by Douglas Adams, there is a famous scene in which a whale suddenly comes to life during its free fall to earth and becomes aware of its existence.
Within seconds, the whale develops a primitive but remarkable self-awareness and begins to explore the world around it until it finally hits the ground).
Change in perspective - You are a LINK
Another change of perspective.
Link: Where am I? It looks like I'm standing in the middle of a crowd of something.
What is around me? Around me are lots of letters, no, words.
Who am I? There's a sign on me that says "Buy me now!" Maybe that's my name.
What do I look like? Oh, there's something round on me, it could be a button, and there's also something like an arrow pointing somewhere.
What can I do? I can't walk, but maybe I can press this button.
Is there someone else there? A voice sounds from the button:
"Yes."Hello, I'm "Buy me now".
And who are you?
Link2: Uh, oh, there's a voice, "yeah, uh, here I am."
(Man, how embarrassing.) "My name is "Click here".
Buy me now: "Great, now I know someone. Uh, tell me, what's your situation?"
Click here: There are many letters around me and many others like me, with different names of course.
Buy me now: I also have words and I think it's about ....
Click here: "Mine too. ...."
And of course it could go on ...
Up to this point, it should be enough to understand through the change of perspective what Google technology looks at when it comes to links.
All the questions above can be answered with data, so here's a summary:
Who am I?
Name is in the HMTL A tag, but target and size etc. can also be determined programmatically.
Where am I?
First a layout recognition (e.g. machine learning), i.e. a determination of header, footer, content, sidebar, etc. . If the text or texts are extracted, then the distance in the text, e.g. from the beginning of the text or from the beginning of the sentence or paragraph.
Distance can be imagined here as in an Excel table, the cell D9 in a spreadsheet is a marker in relation to the starting cell A1, forming a diagonal that defines the exact location. (see picture) What is directly around me? Can be single words or the whole sentence (window).
What can I do? - I look, I change the text, the color, I open a new window and pop-up etc.
What do I look like? Am I just a link, a button, an image or ... . Am I underlined or do I have the same color as the background?
Am I covered by something?Technically detectable by rendering the entire page, but very cost-intensive and time-consuming.
Where do I link to or what do I recommend? - The linked page should match in terms of content, and therefore you look at the other page in the same detail, with text, topic, layout etc.
Which topic?
Categorization, i.e. sorting into a category, is rarely considered, but is important, because SPAM links, for example, often do not fit thematically.Natural language processes can be used to create a summary or algorithms can be used to find the possible ranking keywords. The keywords can then be checked for similarity to each other.
Additional questions that can be answered but are not mentioned in the dialog above:
How many of my kind are there? - Counting the links, sounds quite simple, but a basis. The fewer links, the more important, more links, the less value.
Where are the others? - Find out in which part of the page layout the link is located and pay attention to the concentration. A link in a big dropdown menu with hundreds of other links is unlikely to receive the same attention or scoring as a link in the content.
Technically a combination of layout recognition and link recognition in the HTML code.
How do I feel or should I feel?
What references and familiar terms can be found?
Just 2 more questions answered in detail, and you could write several articles on this alone.
All these questions can be answered with data.
Let me repeat that.
If we can do it, then Google probably can too!
Scoring for each link individually?
The traditional system widely known Pagerank scoring as well as all its derivatives e.g. Page Authority (see DR and PR https://moz.com/learn/seo/domain-authority), is based on the idea of one value for all.
The above change of perspective has made it clear that with current technology, it is possible to calculate a score per link much more accurately.
Which topic?
This is how it could work:
Each link on a page receives an individual score, which is made up of position, number of links, link density, link name, etc. (see questions above)
Simplified example with a score from 0 to 100:
- Website with 500 links.
- Huge drop-down menu because it's so cool for the user.
- Footer with the important links, disclaimer and co.
- Sidebar with advertising for a friendly partner company.
- Sidebar with advertising for a friendly partner company.
- Link from the content area: Score = 15 Link from the dropdown: Score = 0.2 Link from the footer: Score = 0.1
A machine (ML) is trained with enough positive examples of pages that have been approved by the quality reviewers.
The ML then provides a prediction of how likely it is that a link is a SPAM link.
As already described in other articles, a programmatic decision is made.
The ML can be created individually for each topic and language, for example.
Scoring Systems - both or several in parallel?
The traditional Pagerank system is still used by Google, but with less importance in the overall ranking system.
If you take a look at the Pagerank algorithm, you will quickly realize that it answers a few questions reliably.
For example, the internal page structure of a website can be determined, as well as the top page and where an individual page is located within the page structure or categories can be identified. Understanding how Pagerank works should be basic SEO knowledge. Unfortunately, the reality is different. Perhaps this is due to the modest math lessons in schools.
Why Google still use Pagerank?
"Never touch a running system" and in this case the calculation of Pagerank has been working for over 10 years before the first ML models were even planned.A detailed evaluation of individual links is carried out in parallel to Pagerank.
Different goals suggest that there are several evaluation systems, including SPAMBrain for detecting unwanted links.
Another detailed evaluation could have been introduced with the "Helpful Content Update".Various movements in data collection (visibility index or traffic estimates), similar to a ripple effect, suggest a change in link evaluation.
Technology - Can I do the same?
Everything we have described above is current technology, at least for us.
Pagerank:
Anyone with some programming experience can test a Pagerank calculation on a small scale using PHP and Javascript.
Finding links, counting etc. works in almost any programming language.
Layout detection and text extraction are our own ML models, which are regularly adapted, retrained and improved.
Text analysis with Natural Language Processing(NLP) there are various possibilities, we prefer Python and one model per language. Text statistics (number of words, letters, sentences, etc.) are of course generated using algorithms (see existing libraries, e.g. in PHP, JS, Typescript, etc.).
Potential keywords are extracted from the text with algos, here it should be mentioned that it is not TF IDF. The algo is often mentioned, but is far too bad in reality. The same applies to Textrank Algo. We use a different approach, Transformer and N-Grams, which delivers better quality keywords and combinations, but is unfortunately not as performant. You can train an ML model to categorize a text, but our results were disappointing, which is why we developed our own solution that also works with multiple languages.
Downloading a page is web scraping and there are plenty of tutorials on the net. We operate a crawling infrastructure that processes a stream in real time and the data runs directly through the entire analysis process. This is not as performant as a crawler that only downloads and saves, but that is also not our focus as a data driven SEO solution.
Any enthusiastic programmer can try out the individual parts. If you have any questions, feel free to connect via LinkedIn and ask your question via chat. I'll be happy to help. However, it may take a few days for me to reply. Sorry in advance.
Question & Answers
What happens if I have an unwanted link? In principle, nothing. No negative consequences.
In the past, an offensive measure was a so-called. Manual Penalty, e.g. if someone has advertised the sale of links on their own site. These high-profile measures have not occurred for some years now. Spamming a fellow advertiser's project with garbage links (yes, that used to work), even that no longer works. The scoring of links obviously works.
Should I disavow backlinks? This is a very personal question. At this point, a whole group of well-known SEOs will disagree. My opinion is that individual link scoring works for BIG G. No, I think you can save yourself the work.
A cool change of perspective on LINK "Buy me now" has shown you what you can and should know about a link.In addition to Pagerank, there are other link evaluations that are more accurate.