Data Hoarder

11 readers

1 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 1 year ago

MODERATORS

communick@selfhosted.forum

Searching internet archive for URLs containing substring (alien.top)

submitted 11 months ago by A_Zythera@alien.top to c/datahoarder@selfhosted.forum

1 comments fedilink hide all child comments

Hi all, not a data hoarder myself but have been digging into using the wayback machine to find old images and videos for the past few weeks. I've been trying to find a way to search all URLs on the archive for any containing particular substrings (typically video/img IDs) but haven't had much luck. Yesterday I was directed to the wayback CDX API and its search functions but have some major issues regarding is usage for my desired outcome:

Using the search function via the CDX API requires a domain input. I'm not looking for specific sites perse, instead just looking for a URL for any domain containing the specific strings in question.
Even when searching within a large domain, the system seems to retrieve as many entries relevant to the domain before applying the search filters and has an upper limit for entries it can retrieve. This means that the entries containing the desired substring may not be in the list of entries retrieved before filtering and so will not be flagged.

I have tried using the in-built Pagination API to retrieve all relevant domain entries by splitting them into blocks but, due to the way the filters are applied, this only tells me if the entry is in the current block and I have to search each one manually. I have basically no coding knowledge (sorry) so just figuring out how to use the CDX search properly was a bit of a challenge. I definitely don't have the ability to automate the search process for the paginated data.

Maybe a long shot and sorry for my lack of understanding, but would anyone here know how I could go about solving my issue? It's possible you may have to explain to me like I'm 5 but I normally pick stuff up pretty quick.

Thanks for any help in advance!

top 1 comments

sorted by: hot top controversial new old

[–] WindowlessBasement@alien.top 1 points 11 months ago

I have tried using the in-built Pagination API to retrieve all relevant domain entries by splitting them into blocks but, due to the way the filters are applied, this only tells me if the entry is in the current block and I have to search each one manually. I have basically no coding knowledge

Short answer: you're asking questions that will take a program requesting data (the whole internet archive?) non-stop for a month or more. You are gonna need to learn to code if you want to interact with that much data.

I definitely don't have the ability to automate the search process for the paginated data.

You're going to need to automate it. A rate-limiter is going to kick in very quickly if you are just spamming the API.

explain to me like I'm 5

You need to learn for yourself if this is a project you are tackling. Also will need to familiarize yourself with the terms of service of the archive, because most services would consider scraping every piece of data they have as abusive behavior and/or malicious.