Why are wget and curl essential when dealing with malware-infected URLs?

Official statement

It is recommended to use tools like wget or curl to check sample URLs associated with 'error template' infections, instead of opening them directly in a browser to avoid any infection of the device.

0:49

🎥 Source video

Extracted from a Google Search Central video

⏱ 1:37 💬 EN 📅 12/03/2013 ✂ 6 statements

Watch on YouTube (0:49) →

✂ Other statements from this video 5 ▾

0:05 Comment Google Search Console détecte-t-il les infections malware de type 'error template' sur votre site ?
0:05 Comment Google Search Console détecte-t-il réellement les infections malware sur votre site ?
0:35 Comment les pages d'erreur 404 peuvent-elles devenir des vecteurs de malware sur votre site ?
1:37 Pourquoi modifier les directives ErrorDocument du htaccess après une infection malware ?
1:37 Comment nettoyer un fichier .htaccess infecté sans perdre vos redirections SEO ?

What you need to understand

What exactly is an error template infection?

An error template infection involves the injection of malicious code into the error template files (404, 500, etc.) of a CMS. Attackers take advantage of the fact that these files are rarely monitored and can serve thousands of infected pages without immediately arousing suspicion.

The classic trap: these infected pages load differently depending on the user-agent. When you open them in Chrome or Firefox, the script detects a human visitor and triggers malicious redirects, cloaking, or worse. Wget and curl, on the other hand, retrieve the raw HTML code without executing JavaScript, allowing you to examine the actual structure without triggering the infection mechanisms.

Why does a standard browser expose your machine?

Modern browsers execute JavaScript, load iframes, follow redirects, and interpret the entire DOM. On an infected URL, this means that malicious code activates and might attempt to exploit vulnerabilities, install tracking scripts, or redirect to phishing pages.

Command-line tools like wget or curl simply retrieve the raw content of the HTTP response. No interpretation, no execution. You get exactly what the server returns, which helps identify SEO spam injections, suspicious 302 redirects, or hidden meta refresh tags.

How do wget and curl fit into a SEO hacking audit?

When Google notifies you of a hacked content issue via Search Console, the first step is to identify the scope of the infection. Wget allows you to quickly crawl a sample of suspicious URLs and extract common patterns: presence of external links to pharmaceutical domains, modified title tags, injected scripts in the head.

Curl is ideal for inspecting HTTP headers and spotting conditional redirects (based on referer or IP). Combine the two with grep, and you can automate the detection of malicious signatures across hundreds of URLs in just a few minutes. This is the standard workflow for any professional facing a massive hack.

Wget/curl prevent the execution of malicious code when inspecting compromised URLs
Standard browsers trigger infection mechanisms (JavaScript, redirects, iframes)
These tools allow analysis of raw HTTP headers and source code without interpretation
They integrate into automated audit scripts to handle large volumes of suspicious URLs
Google officially recommends this approach to protect the machines of SEO professionals

SEO Expert opinion

Does this recommendation reflect the real practices of experienced SEOs?

Absolutely. Any professional who has managed a hacked site knows that the first rule is to never open suspicious URLs directly from your working machine. The use of wget or curl is documented in all serious resources on cleaning SEO hacks, long before Google formalized it.

What is interesting here is that Google explicitly mentions it in the context of error templates. This suggests that they have observed an increase in this specific attack vector and that too many webmasters continue to inspect these pages manually, contaminating their machines in the process. [To verify] if this recommendation is accompanied by increased detection on the Search Console side.

What limitations should you keep in mind with this approach?

Wget and curl show what the server sends to a command-line user-agent. The issue: modern infections are often contextual. They serve clean content to Google bots and malicious content to human visitors, or only to IPs outside the US, or only after the first click.

If the malware detects a curl/wget user-agent, it may serve a clean version. In these cases, you should force the Googlebot user-agent or use tools like Puppeteer in headless mode with varied proxies. Google's recommendation remains valid for an initial inspection, but it does not cover cases of advanced cloaking. A complete audit requires multiple combined approaches.

What to do if the infection persists despite visible cleaning?

This is the classic situation: you clean the templates, you run wget on 50 sample URLs, everything seems clean, and Google continues to report hacked content. The malware has probably spread in the database (custom fields, widgets, menus), in the .htaccess with complex RewriteCond rules, or worse, in a plugin or obfuscated theme.

Wget/curl only detect what transits via HTTP. They do not scour your PHP files for base64-encoded backdoors. You must complement with server scanners like maldet, file permissions audits, and a manual code review. If you find suspicious but clean code via wget, the infection is hiding elsewhere.

Warning: some malware modifies wget or curl directly on the compromised server to hide the infection. Always use clean binaries from your local machine, never those installed on a suspicious server.

Practical impact and recommendations

How can you concretely use wget and curl to audit infected URLs?

Start by retrieving a list of sample URLs from Search Console (Security section or Hacked content issues). Export 20 to 50 suspicious URLs. Launch wget with the --user-agent option to simulate Googlebot: wget --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1)" -O output.html "https://example.com/infected-page".

With curl, first check the HTTP headers: curl -I "https://example.com/infected-page". Look for suspicious 301/302 redirects, added X-Redirect headers, or unusual Set-Cookie headers. Then, retrieve the full body: curl -A "Googlebot" "https://example.com/infected-page" > page.html and grep for typical patterns (iframe, document.write, eval, pharma keywords).

What mistakes should you avoid during the investigation?

Never test directly from the client's network if the server is compromised. Some malware whitelist local IPs and only serve the infection to external visitors. Use a VPN or test from a third-party server to get a realistic view.

Another trap: relying on a single wget test. Conditional infections change behavior depending on time, referer, or visit count. Run multiple passes with varied user-agents (Desktop Chrome, Mobile Safari, Googlebot, Bingbot) and compare the results. If the responses differ, you have active cloaking.

How to automate detection across hundreds of URLs?

Create a bash script that loops through your list of URLs, executes wget for each, and filters the results with grep. Example: search for all occurrences of "viagra" or "casino" in the titles. You can also parse the HTML with xmllint or pup to automatically extract suspicious tags.

For large sites, combine with Screaming Frog in custom extraction mode: configure regex to detect malicious patterns in the source code. Screaming Frog uses a headless engine that does not execute JavaScript by default, so it's safer than a standard browser, though wget/curl remain the benchmark for critical cases.

Retrieve the list of suspicious URLs from Search Console (Security section)
Use wget/curl with Googlebot user-agent to obtain the raw server rendering
Compare responses with different user-agents to detect cloaking
Parse the HTML with grep, xmllint, or Python scripts to identify injections
Check HTTP headers (redirects, cookies, suspicious custom headers)
Never test from the network of the compromised server (whitelist IP possible)

Auditing infected URLs via wget/curl is an essential technical skill, but it only represents part of the diagnosis. Modern infections often involve multiple vectors (files, database, .htaccess, plugins) and sophisticated conditional logics. If you notice that the infection persists despite your cleaning attempts, or if the attack surface exceeds a few dozen URLs, it quickly becomes complex to manage alone without deep expertise. Engaging a specialized SEO agency in security ensures a comprehensive audit, methodical cleaning, and the implementation of sustainable protections to prevent any recurrence.

❓ Frequently Asked Questions

Wget et curl sont-ils suffisants pour détecter toutes les infections SEO ?

Non. Ils détectent le contenu servi en HTTP brut, mais pas les infections dans la base de données, les fichiers PHP obfusqués ou les backdoors serveur. Il faut compléter avec des scanners fichiers et des audits de permissions.

Que faire si wget renvoie une page propre alors que Google signale toujours du contenu piraté ?

Le malware pratique probablement du cloaking et sert une version propre aux user-agents en ligne de commande. Testez avec différents user-agents, IP et referers, ou utilisez un navigateur headless avec rotation de proxy.

Peut-on utiliser wget/curl depuis le serveur infecté lui-même ?

Non, c'est une mauvaise pratique. Le malware peut avoir modifié les binaires système ou whitelisté l'IP locale. Testez toujours depuis une machine externe propre ou un VPN.

Comment forcer wget à se comporter exactement comme Googlebot ?

Utilisez l'option --user-agent avec la chaîne complète de Googlebot et ajoutez --header pour simuler les autres en-têtes typiques (Accept-Language, Accept-Encoding). Comparez ensuite avec l'outil Inspection d'URL de Search Console.

Combien d'URL échantillons faut-il tester pour évaluer l'étendue d'une infection ?

Google recommande au minimum 20 à 50 URL représentatives des différentes sections du site. Si l'infection touche les templates, testez plusieurs types de pages (produits, catégories, articles, erreurs 404) pour cartographier la propagation.

🎥 From the same video 5

Other SEO insights extracted from this same Google Search Central video · duration 1 min · published on 12/03/2013

🎥 Watch the full video on YouTube →