Each site you want will different. Wikipedia is ...a wiki site. But, YouTube is a video site. Reddit is a message board. There are blog type sites. There are download sites, Google docs is highly interactive. All just different, a few may be the same-ish.
And then, you can't really fit the Internet on a box in your house. And the cost to host it elsewhere will be astronomical.
Sounds like you have a few in mind? Find out what software they run, run it locally, and then download the site you can ant to mirror. You should be able to locate a crawler that will build the database(a) locally.
Definitely legwork for a whole site/set of sites.
Much easier if it's a type of media: books, video, audio, PDFs, a database.
Each site you want will different. Wikipedia is ...a wiki site. But, YouTube is a video site. Reddit is a message board. There are blog type sites. There are download sites, Google docs is highly interactive. All just different, a few may be the same-ish. And then, you can't really fit the Internet on a box in your house. And the cost to host it elsewhere will be astronomical. Sounds like you have a few in mind? Find out what software they run, run it locally, and then download the site you can ant to mirror. You should be able to locate a crawler that will build the database(a) locally. Definitely legwork for a whole site/set of sites. Much easier if it's a type of media: books, video, audio, PDFs, a database.