this post was submitted on 18 Aug 2024
67 points (100.0% liked)

Privacy

789 readers
21 users here now

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

Related communities

Chat rooms

much thanks to @gary_host_laptop for the logo design :)

founded 5 years ago
MODERATORS
top 18 comments
sorted by: hot top controversial new old
[–] fubarx@lemmy.ml 16 points 2 months ago
[–] mox@lemmy.sdf.org 16 points 2 months ago (1 children)

This article lies to the reader, so it earns a -1 from me.

[–] CynicusRex@lemmy.ml 5 points 2 months ago* (last edited 2 months ago) (1 children)

Lies, as in that it's not really “blocking” but a mere unenforceable request? If you meant something else could you please point it out?

[–] dabaldeagul@feddit.nl 27 points 2 months ago (1 children)

That is what they meant, yes. The title promises a block, completely preventing crawlers from accessing the site. That is not what is delivered.

[–] JackbyDev@programming.dev 3 points 2 months ago (2 children)

Is it a lie or a simplification for beginners?

[–] thanks_shakey_snake@lemmy.ca 9 points 2 months ago (1 children)

Lie. Or at best, dangerously wrong. Like saying "Crosswalks make cars incapable of harming pedestrians who stay within them."

[–] JackbyDev@programming.dev 1 points 2 months ago (2 children)

It's better than saying something like "there's no point in robots.txt because bots can disobey is" though.

[–] thanks_shakey_snake@lemmy.ca 3 points 2 months ago

Maybe? But it's not like that's the only alternative thing to say, lol

[–] ReversalHatchery 2 points 2 months ago* (last edited 2 months ago)

Is it, though?

I mean, robots.txt is the Do Not Track of the opposite side of the connection.

[–] mox@lemmy.sdf.org 3 points 2 months ago

Assuring someone that they have control of something and the safety that comes with it, when in fact they do not, is well outside the realm of a simplification. It's just plain false. It can even be dangerous.

[–] vk6flab@lemmy.radio 16 points 2 months ago

This does not block anything at all.

It's a 1994 "standard" that requires voluntary compliance and the user-agent is a string set by the operator of the tool used to access your site.

https://en.m.wikipedia.org/wiki/Robots.txt

https://en.m.wikipedia.org/wiki/User-Agent_header

In other words, the bot operator can ignore your robots.txt file and if you check your webserver logs, they can set their user-agent to whatever they like, so you cannot tell if they are ignoring you.

[–] nullPointer@programming.dev 13 points 2 months ago (1 children)

robots.txt will not block a bad bot, but you can use it to lure the bad bots into a "bot-trap" so you can ban them in an automated fashion.

[–] dgriffith@aussie.zone 7 points 2 months ago

I'm guessing something like:

Robots.txt: Do not index this particular area.

Main page: invisible link to particular area at top of page, with alt text of "don't follow this, it's just a bot trap" for screen readers and such.

Result: any access to said particular area equals insta-ban for that IP. Maybe just for 24 hours so nosy humans can get back to enjoying your site.

[–] digdilem@lemmy.ml 12 points 2 months ago

robots.txt does not work. I don't think it ever has - it's an honour system with no penalty for ignoring it.

I have a few low traffic sites hosted at home, and when a crawler takes an interest they can totally flood my connection. I'm using cloudflare and being incredibly aggressive with my filtering but so many bots are ignoring robots.txt as well as lying about who they are with humanesque UAs that it's having a real impact on my ability to provide the sites for humans.

Over the past year it's got around ten times worse. I woke up this morning to find my connection at a crawl and on checking the logs, AmazonBot has been hitting one site 12000 times an hour, and that's one of the more well-behaved bots. But there's thousands and thousands of them.

[–] CynicusRex@lemmy.ml 8 points 2 months ago (1 children)

#TL;DR:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-Agent: FacebookBot
Disallow: /
User-Agent: Applebot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: YouBot
Disallow: /
[–] mox@lemmy.sdf.org 4 points 2 months ago (1 children)

Of course, nothing stops a bot from picking a user agent field that exactly matches a web browser.

[–] JackbyDev@programming.dev 3 points 2 months ago (1 children)

Nothing stops a bot from choosing to not read robots.txt

[–] mox@lemmy.sdf.org 2 points 2 months ago* (last edited 2 months ago)

Indeed, as has already been said repeatedly in other comments.