pouët.net

Backing up demos with ipfs - good idea or shit?

category: general [glöplog]
As one of the people who maintain the scene.org file area, I can only agree with the sentiments of several above - please, upload your files to us. We are at least currently the best option, with several mirrors around the world to make sure the files are as safe as they can be in the event of catastrophic incident. In this day and age, you can be certain of very few things.

I for one would love it if more party organizers were vigilant about uploading the contributions in a timely manner. Some don't even bother at all, or put stuff on their own website, which then lapses (looking at you, Silesia party series). Thankfully, I managed to save the Silesia files, but they're the tip of the iceberg. I know I for one actively go out and mirror stuff off other hosts every day, almost, but there's too many files for one or two people to catch them all.

I guess part of the problem is also that people want to be able to put their productions out quickly, and if they are not in the lucky position that they know or have contact information for an admin that can approve their files quickly, they just throw up a dropbox link or whatever, and then forget about it. Not sure how to "fix" that, but at least - anyone who has contact information for me and are planning a big release that they want hosted, let me know beforehand and I can probably help.

If anyone wants to contribute to making sure files are even safer, feel free to setup more mirrors! If you have the hardware, and the bandwidth, we'd love to expand even more.

El Topo: ftp://ftp.scene.org/incoming/1st_readme
added on the 2016-12-11 08:24:42 by menace menace
@Gargaj: So, crawling is a hard problem indeed, but we can at least solve it to a certain approximation. Getting a mirror of 90% of the stuff that's still alive isn't so hard, so here we go:

http://pouet-mirror.sesse.net/ is filling up as we speak, so far 1697 successful downloads out of 2167 attempts. I have code to pull every URL in every Pouët prod, with the exception of certain hosts that are known-good or not useful to crawl (scene.org, capped.tv, youtube.com, and a few others). Every failed URL is retried once a week until it succeeds. It's rough, but the perfect is the enemy of the good.

The biggest problem right now is keeping my database of Pouët prods up-to-date. Is there a way, short of pulling one JSON request for every prod, of figuring out which prods were changed recently and need a re-crawl to get new URLs? Or even what the last valid prod ID is, as opposed to hardcoding 69000?
added on the 2016-12-11 20:51:19 by Sesse Sesse
Wow awesome Sesse :-)

are the file names content hashes? do you store the mappings (pouet id -> content hash) anywhere public?
added on the 2016-12-12 09:12:49 by skrebbel skrebbel
The filenames are content hashes, indeed (SHA256). There's no mapping pouet ID → content hash, but there's a mapping URL → content hash (see the JSON file in the top-level directory).

I added a reverse mapping to the JSON file right now (URL → pouet ID), but it's a one-to-many thing, as there are multiple links/prods that fit the same SHA256 and even URL. (The former happens especially in the case where people don't have the file anymore, but don't respond with 404 or similar, so it _looks_ like a file got successfully downloaded, but in reality there's only some banner or spam site.)
added on the 2016-12-12 09:35:58 by Sesse Sesse
so original filenames are lost in this process?
added on the 2016-12-12 14:49:10 by havoc havoc
i don't really get this - why not just mirror the files with their original names?
added on the 2016-12-12 15:09:39 by dipswitch dipswitch
dipswitch: there's plenty of dupe filenames so you can't just put all those files in a single dir
added on the 2016-12-12 15:13:03 by havoc havoc
and i guess the reason to go for that JSON stuff is so that if a link dies, it could be replaced by some automatic process instead of manual gloperator labour

which makes sense to me, but not if the files are not kept 100% intact

so ehh why not put the file with the original name in a dir that has the hash as name? (best of both worlds..)
added on the 2016-12-12 15:25:09 by havoc havoc
The files are kept 100% intact. :-) The metadata is not; this goes for filenames, HTTP headers and the likes. The reason is a simple combination of a) it's not very important, and b) it allows deduplication (the same file sometimes exists under many different file names—and as others have pointed out, vice versa).

My plan is to add something that allow you to write something like http://pouet-mirror.sesse.net/blah/http://www.original.url/filename.zip and have it automatically send out the right file. This would allow both for automagic replacement of dead URLs _and_ deduplication.
added on the 2016-12-12 17:58:02 by Sesse Sesse
Quote:
Is there a way, short of pulling one JSON request for every prod, of figuring out which prods were changed recently and need a re-crawl to get new URLs? Or even what the last valid prod ID is, as opposed to hardcoding 69000?

1. No. 2. Yes if you can be arsed to parse the latest added prods RSS
added on the 2016-12-12 19:57:44 by Gargaj Gargaj
This thread is the hosting equivalent of someone seeing a video of a new micro-controller and going "OMG, NEW DEMO PLATFORM?!"
added on the 2016-12-12 20:17:02 by gloom gloom
Gargaj: I see. What would you recommend to make sure I don't miss any new or changed URLs? Periodic re-crawl of the entire range? Just ignoring the problem and sticking to new prods?
added on the 2016-12-12 20:59:58 by Sesse Sesse
Also, now you can fetch a file by URL. See the README.
added on the 2016-12-12 21:18:50 by Sesse Sesse
Great job Sesse, this is indeed a better solution :)

Quote:
El Topo: ftp://ftp.scene.org/incoming/1st_readme

Thanxz!
added on the 2016-12-12 22:46:57 by El Topo El Topo
Go Sesse!
added on the 2016-12-13 22:59:07 by numtek numtek
I pulled all the missing URLs through archive.org's API; it turns out that of the 2879 files, they had 628 of them (a bit over 20%). So now there's actually a fair bit in there that's _not_ accessible on current Pouët.
added on the 2016-12-15 00:29:29 by Sesse Sesse
Kinda worrying that http://cardboard.pouet.net/broken_links.php? reports a lot less missing files. Generally speaking though, the 20% seems consistent with experiences made while policing cardboard, and 628 is not really a worryingly high number given the amount of deadlinks we've already fixed since cardboard became available (it would have probably been 10x bigger if you ran this 2 years ago).
added on the 2016-12-15 04:12:25 by havoc havoc
I think it might be a counting discrepancy; I'm mirroring every link (e.g. videos), not just the main download link.

On the other side, there are some hosts I'm blacklisting, which is a significant chunk of the total number of links, but those are mostly links that are unlikely to be broken (e.g. hosted on scene.org).
added on the 2016-12-15 11:58:28 by Sesse Sesse
Quote:
The reason is a simple combination of a) it's not very important,


sorry, but filenames are, in fact, EXTREMELY important, at least for 1990s and early 2000s releases. the filenaming scheme, together with file_id.diz and infofile, was, so to speak, the "face" of a release, a crucial part of the packaging and releasing process.
added on the 2016-12-15 14:00:03 by dipswitch dipswitch
If you want to do anything with sesse's archive, you can always get the filename back through that JSON file. It's not like it's lost forever.
Yeah indeed, the data you provide to Sesse's server to retrieve a backup file will be the broken URL. Which ends with the original filename, of course. So the filename is not really lost, it is just temporarily replaced for search purposes. I suppose it's not a big step to have files automatically reverted to their original name once a query is made.

Before I grasped the above, I had the same worry as Dipswitch. But with that worry out of the way, that backup archive is starting to sound like a damn useful and quite probably final solution to any deadlink problems on this site. :)
added on the 2016-12-15 17:03:57 by havoc havoc
havoc: Also, if you give the URL (as opposed to going for the content hash directly), the server sends a Content-Disposition header that tells the browser what filename to download the file as.
added on the 2016-12-15 18:18:40 by Sesse Sesse
Seemingly there's a fair amount of files that both my site and cardboard take as “success”, but in reality are just HTML landing pages of expired domains etc.. As a random example, this prod was hosted on planet-d, which now just redirects to /gone/, which returns 200 OK.

I don't know who glöperates cardboard, but if I make a list of URLs that are fake-200 (based on SHA256 collisions, manually filtered), do you want it to ingest?
added on the 2016-12-18 16:26:13 by Sesse Sesse
I'd like to have a copy of that list, currently I don't think we have anything but "deal with it when you happen to come across one" when it comes to battling false positives and this would likely allow a more targeted approach :)
added on the 2016-12-18 18:33:36 by havoc havoc
Linky linky

Also, prod IDs 20288, 20289, and 20294 through 20309 serve the exact same TrueType file under different names :-) Very confusing.
added on the 2016-12-18 19:06:28 by Sesse Sesse

login