Hi all, not a data hoarder myself but have been digging into using the wayback machine to find old images and videos for the past few weeks. I've been trying to find a way to search all URLs on the archive for any containing particular substrings (typically video/img IDs) but haven't had much luck. Yesterday I was directed to the wayback CDX API and its search functions but have some major issues regarding is usage for my desired outcome:
-
Using the search function via the CDX API requires a domain input. I'm not looking for specific sites perse, instead just looking for a URL for any domain containing the specific strings in question.
-
Even when searching within a large domain, the system seems to retrieve as many entries relevant to the domain before applying the search filters and has an upper limit for entries it can retrieve. This means that the entries containing the desired substring may not be in the list of entries retrieved before filtering and so will not be flagged.
I have tried using the in-built Pagination API to retrieve all relevant domain entries by splitting them into blocks but, due to the way the filters are applied, this only tells me if the entry is in the current block and I have to search each one manually. I have basically no coding knowledge (sorry) so just figuring out how to use the CDX search properly was a bit of a challenge. I definitely don't have the ability to automate the search process for the paginated data.
Maybe a long shot and sorry for my lack of understanding, but would anyone here know how I could go about solving my issue? It's possible you may have to explain to me like I'm 5 but I normally pick stuff up pretty quick.
Thanks for any help in advance!
Short answer: you're asking questions that will take a program requesting data (the whole internet archive?) non-stop for a month or more. You are gonna need to learn to code if you want to interact with that much data.
You're going to need to automate it. A rate-limiter is going to kick in very quickly if you are just spamming the API.
You need to learn for yourself if this is a project you are tackling. Also will need to familiarize yourself with the terms of service of the archive, because most services would consider scraping every piece of data they have as abusive behavior and/or malicious.