Post by amirmukaddas on Mar 11, 2024 5:09:01 GMT
Today I'm writing to you about a mistake I made on a client's website that resulted in around 3,000 spam URLs being indexed. I do it for two reasons, the first is that I have nothing to hide, the second is that the phenomenon is very interesting to study and show. I will not tell you which website it refers to, nor the market segment in which it operates, because it is not important in this case, but above all because I care a lot about the privacy of the companies I follow. Scenario I am asked to study the Italian and German versions of a multilingual website. In the Italian property index coverage I find around 3,000 paths that refer to SPAM search queries launched with the aim of having links to online casino websites and other amenities scanned. SPAM queries The resources in question were all in the path /search/ and ?s= . I notice that the last requests for these resources date back to the previous month, so I take it for granted that they are currently stopped, so to prevent these already excluded pages from being requested in the future, I suggest blocking the paths through the robots.txt file. Aside from that, the agency that runs the website makes sure that these requests automatically generate a 404 status, so let's get right to it.
Two days after blocking the routes, I am contacted by the marketing manager who warns me of a spike in pages indexed on the English version of the website. Pages indexed, but blocked by robots.txt Not having studied the English version of the website, I ran to open the Search Console and discovered that in that case the spam queries excluded from the index were not 3,000, but about 10 Denmark Telegram Number Data times more. Of these, approximately 3,000 entered the index – indexed, but blocked by robots.txt – immediately after the block search path directive was inserted. Furthermore, the pages in question were all 404. But how was this possible? What had happened to us Let's make a quick introduction: search pages on WordPress normally have a code of 200 (reachable page) and a robots meta set to noindex. In our case they had a 404 status, so we also felt calm... too much in fact. The indexing process includes various phases, which we could summarize (and simplify) in three steps: Detection : Google notices that a page exists Crawling : Google reads the page content (and code) Indexing : Google lists the page in search results In our case, Google had detected the 3,000 pages before they were placed in 404 status and before the blocking directive was inserted in the Robots.txt file.
When the pages went from the detection phase to the crawling phase , the search paths were already disallowed, so Google couldn't see anything. Usually in these circumstances indexing doesn't happen anyway, because Google moves this type of resources among the excluded pages because they were detected, but not currently indexed , but no, it put them in the index anyway. It is therefore a fairly rare situation - and for this reason I am pleased to tell you about it - In this case it depended on the fact that on the English language version there was a large bombardment of spam queries, evidently also very frequent. To solve the problem, simply exclude the blocking directives from the robots.txt file to restore the situation, hoping that Google is as quick to exclude the resources in question as it was to index them. Do you know when, despite reporting new pages every day, Google makes you wait months to index them? Stuff to make your hands itch, but seriously.
Two days after blocking the routes, I am contacted by the marketing manager who warns me of a spike in pages indexed on the English version of the website. Pages indexed, but blocked by robots.txt Not having studied the English version of the website, I ran to open the Search Console and discovered that in that case the spam queries excluded from the index were not 3,000, but about 10 Denmark Telegram Number Data times more. Of these, approximately 3,000 entered the index – indexed, but blocked by robots.txt – immediately after the block search path directive was inserted. Furthermore, the pages in question were all 404. But how was this possible? What had happened to us Let's make a quick introduction: search pages on WordPress normally have a code of 200 (reachable page) and a robots meta set to noindex. In our case they had a 404 status, so we also felt calm... too much in fact. The indexing process includes various phases, which we could summarize (and simplify) in three steps: Detection : Google notices that a page exists Crawling : Google reads the page content (and code) Indexing : Google lists the page in search results In our case, Google had detected the 3,000 pages before they were placed in 404 status and before the blocking directive was inserted in the Robots.txt file.
When the pages went from the detection phase to the crawling phase , the search paths were already disallowed, so Google couldn't see anything. Usually in these circumstances indexing doesn't happen anyway, because Google moves this type of resources among the excluded pages because they were detected, but not currently indexed , but no, it put them in the index anyway. It is therefore a fairly rare situation - and for this reason I am pleased to tell you about it - In this case it depended on the fact that on the English language version there was a large bombardment of spam queries, evidently also very frequent. To solve the problem, simply exclude the blocking directives from the robots.txt file to restore the situation, hoping that Google is as quick to exclude the resources in question as it was to index them. Do you know when, despite reporting new pages every day, Google makes you wait months to index them? Stuff to make your hands itch, but seriously.