Privacy

10 readers

30 users here now

Protect your privacy in the digital world

Welcome! This is a community for all those who are interested in protecting their privacy.

Rules

PS: Don't be a smartass and try to game the system, we'll know if you're breaking the rules when we see it!

Be nice, civil and no bigotry/prejudice.
No tankies/alt-right fascists. The former can be tolerated but the latter are banned.
Stay on topic.
Don't promote proprietary software.
No crypto, blockchain, etc.
No Xitter links. (only allowed when can't fact check any other way, use xcancel)
If in doubt, read rule 1

Related communities:

founded 3 months ago

MODERATORS

fxomt@lemmy.dbzer0.com

otter@lemmy.ca

shaytan@lemmy.dbzer0.com

fxomt@piefed.social

Scraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigms (epic.org)

submitted 2 days ago by Forumite@lemm.ee to c/privacy@lemmy.dbzer0.com

2 comments fedilink hide all child comments

top 2 comments

sorted by: hot top controversial new old

[–] obbeel@lemmy.eco.br 7 points 2 days ago (1 children)

I've tried scraping arXiv before and they blocked my access to the website stating suspicious activity and that I should contact arXiv owners if I want to scrape it, despite the reason for arXiv existing being for Open Access to scientific articles.

Scraping for scientific articles is still on my plans, but not arXiv anymore.

The gap between the common user and the big technology players is more than just a gap of knowledge, these agreements to keep everything on the hands of big companies is problematic, especially when dealing with important philosophical concepts that should guide some websites, like Open Science.

But it is what it is, no use sending an email to arXiv telling they're wrong and they should change their minds. I'll have to look for options.

[–] e0qdk@reddthat.com 10 points 2 days ago

arXiv has bulk access methods -- you shouldn't need to scrape their website to get the data: https://info.arxiv.org/help/bulk_data.html

If you really want everything (5TB+), that's available from their S3 bucket if you're willing to cover the transfer costs: https://info.arxiv.org/help/bulk_data_s3.html