I need some help. How do you archive content?
-
Sleeping_in_Sleepy_Hollow — 9 years ago(February 04, 2017 09:10 PM)
Check you PM. Need more help.
http://www.my-diary.org/users/851091 -
MuggySphere — 9 years ago(February 05, 2017 03:58 PM)
So would a script or webcrawler be good for archiving the movie forums for movies you have rated and discussed?
That's the kind of thing I was wondering about, for all the movie boards I have participated in. -
timmyp-98035 — 9 years ago(February 05, 2017 04:43 PM)
IMDB should allow archive.org to do it
-
MuggySphere — 9 years ago(February 05, 2017 08:14 PM)
Hey there that program works fast.
But it only seems to copy the first page of a forum and not the pages in the links when you click on a discussion. Or did I miss a setting that would copy all the pages linked to the main forum page? -
!!!deleted!!! (61311691) — 9 years ago(February 07, 2017 05:12 PM)
Regardless of what you might have read, unless the site admin modifies their robots.txt file, there's nothing we can do other than email them and request an SQL dump. Neither of which is likely to happen.
Read the file below:
http://www.imdb.com/robots.txt
You can plainly see that almost all engines and crawlers are banned, and unless the IMDb administrator changes that file, no program or script you try to use to download the forum data will work. You'll grab the main page, but that's about it.
More here:
http://www.imdb.com/board/bd0000001/nest/265784055?d=265829976#265829976 -
!!!deleted!!! (61311691) — 9 years ago(February 07, 2017 05:22 PM)
CORRECTION: It's not that the crawlers are banned (my apologies), it's that most of the directories are protected.
robots.txt for IMDb propertiesUser-agent: *Disallow: /boardDisallow: /boards
This means that ALL crawlers and scanners are prevented from downloading the entire board directory and all sub-directories below. See the actual file for more.