I saw this in the news and noted that it would likely affect the Cemetech forum as well as many others: Hosting site Imgur will remove explicit and anonymous content next month.
Although Imgur's comments are focused exclusively on removing adult content, they also seem to imply that they plan to delete items that were uploaded anonymously.

On our forum we've long encouraged users to share images in posts via imgur, but it seems those days may be at an end- I know that many of the images I've shared in posts were uploaded anonymously to Imgur (because why would I log in if I don't need to?) and I assume that others have mostly done the same.

Since we've got about a month before any changes are expected to be made, I figured it shouldn't be hard to find all the images embedded in posts that are hosted on imgur, and archive them elsewhere. Using my bbcode parsing library and public dumps of the forum that I've published previously, I parsed every public post on the site and extracted the [img]s that point to imgur.com or i.imgur.com.
After deduplicating those that appear more than once, we end up with 7303 links to images hosted on imgur that are at risk of deletion in the near future.

With that list, I was able to easily generate a WARC web archive using wget (where images-dedup.txt is the list of image URLs I extracted from post text):
Code:
wget --warc-file=images -i images-dedup.txt --random-wait --wait=1 --user-agent "Cemetech-ImageRescue" --delete-after


These images total about 2GB of data, which is modest but not negligibly small. Since it had been more than a year since I last published a dump of the forum I published a new one, and included the WARC of images alongside that.

Now that I've ensured that the images are saved somewhere, there's an interesting question of whether something more should be done with them and possibly even if the scope of archival should be expanded.

If we expect that many of the images in posts will be killed by imgur, it would be reasonable to rehost them and automatically update posts to point to the new location. It might even be reasonable to use tooling like the replayweb.page (or components thereof) to transparently fix broken images (though by inspecting the captures I can see that imgur return 301 redirects to a placeholder image so you need to understand exactly how imgur handles deletion to detect those that are broken rather than simply looking for ones that return HTTP errors).

As for expanding the scope of the images we archive, I'm sure there are many images embedded in posts that aren't hosted on imgur and over arbitrarily long timescales are also at risk of deletion. It wouldn't be too hard to capture those and archive them in the same way, but at that point we're kind of becoming an image host which is not really a function that we as a web site want to have- but perhaps it's worth exploring how we could allow users to host images directly alongside their posts?
It invariably seems to be the case that new image hosts appear and see a lot of use then realize they can't allow everybody and their dog to hotlink images because that costs them a lot of money with no possibility to charge for it, so the only long-term solution might be to commit to hosting post-related images ourselves.

What do you think about these questions? Do you have an opinion on how we should try to keep images online (or not bother)? What about developing some ability for forum users to upload images to the site for posts?
TI-Planet also uses imgur a lot, including in news article illustrations, even though we also have our own hosting too, so this topic is relevant to me, as I planned to migrate all imgur images to the internal tool at some point...

The external wget-based approach is interesting but I suppose I will just get MySQL to spit out all the image IDs from the post content then take care of the replacement with a script... I suppose this would work as well for Cemetech but 🤷‍♂️
I think you may have misunderstood what wget is doing there: it's saving the actual images, for which the image IDs were extracted by parsing the posts. What you're proposing to do (modify your database semi-manually) doesn't address the issue of saving the images you link to and storing them somewhere.
Oh, I know that - it's just that wget may not see all the pages so instead of hoping it does, I would first get the IDs from the DB directly, then curl/wget the images, and use those local backups as replacements in the posts.

But of course something generic that works from the "public view" of any website is clearly nice too.
The first portion of my commentary leading up to wget was getting the image URLs out of the forum dump- no hoping something saw every post, because the forum dumps come directly out of the database (and simply omit things that are not visible to the public, which seems fine for this application). The images-dedup.txt file wget is consuming that list of images (and I've added a note to my original post to make that clearer).

Which is to say, what I've done is substantially what you said but haven't made a decision of what exactly to do with the saved images.
Are the 7303 images from just anonymous accounts or from proper accounts as well? If it doesn't break imgur's ToS, maybe you could re-upload anonymous images under an official Cemetech imgur account?

I'd like to see some sort of image preservation on Cemetech, even if you need to compress images to conserve space. It sucks looking through old threads just to see a dozen broken image links. Maybe Cemetech could automatically archive any image linked in a post then replace the image link with the Cemetech hosted link? This should definitely be communicated to the end user so they know what will happen to their images. I feel Cemetech is small enough that any abuse of image uploading could be feasibly moderated.
I think we've settled on something for TI-Planet, MyCalcs, and others:
we'll have something like cdn.xxx subdomains where we'll rehost original images and have some cached server-side dynamic resolution serving thing for thumbnails/medium/large like imgur had.
And then cdn/proxy this via cloudflare.
We'll see if that works sufficiently well, otherwise we might have to use CloudFlare Images, which seems pretty cool and not expensive.
TheLastMillennial wrote:
Are the 7303 images from just anonymous accounts or from proper accounts as well? If it doesn't break imgur's ToS, maybe you could re-upload anonymous images under an official Cemetech imgur account?
I don't think there's a way to identify the uploader of an image just given its URL, so it's impossible to say.

Quote:
Maybe Cemetech could automatically archive any image linked in a post then replace the image link with the Cemetech hosted link? This should definitely be communicated to the end user so they know what will happen to their images. I feel Cemetech is small enough that any abuse of image uploading could be feasibly moderated.
An approach like this has seemed like a reasonable solution to me as well; I've thought about some kind of time-delay option, where we archive the images embedded in posts a short time (perhaps a week) after the last change to the post. Not offering users any direct control over what images we store seems like the easiest way to prevent abuse, certainly.

To evaluate how feasible this is, I'd want to examine the current set of image embeds we have, see how much storage they'd consume to mirror, and estimate the growth rate.

Adriweb wrote:
otherwise we might have to use CloudFlare Images, which seems pretty cool and not expensive.
Their pricing does seem decent, though I see from the docs that images are limited to 10MB. That seems like a reasonable limit, but I noticed in the archive I made that there were a decent number of GIFs in excess of that size.
Is there a way I can download the archived images? I would like to host them on my own. You wouldn't have use them obviously but it would be there if anyone needed them. If you are intrested I will offer to host for free just email me at ceo@southnethosting.com I don't think it would be too resource intensive and I have a sub domain I can use, and storage is not an issue. Very Happy
As I mentioned in my original post, the latest forum dump includes the WARC of images I grabbed from imgur.
Oh ok thanks. Very Happy
Thanks for your proactivity about this, Tari. When I was reading about this a few days ago, I realized this was something we needed to worry about, but lacked the time to start investigating. I've always been very resistant to Cemetech becoming an image host for arbitrary user images (c.f. the inevitable problems with posting of pornography and worse when Omnimaga became an image host). If we only mirror images that are visible in posts, deleting posts with inappropriate images also deletes the corresponding images, and we impose some reasonable image size restrictions, it makes sense to me to ensure that the completeness of our forum posts isn't dependent on third-party sites over which we have no control.
I could see requiring a min post limit or possibly a manual title to allow posting images. And yeah a size limit is likely also a good idea as well.

On the subject of broken images Kerm, your link to Geopipe in your signature has a broken image on it.
Tari wrote:
To evaluate how feasible this is, I'd want to examine the current set of image embeds we have, see how much storage they'd consume to mirror, and estimate the growth rate.
I went ahead and pulled all the non-imgur images as well, which totals 16742 unique URLs (of which only 9685 still work) totaling only about 1.5GB.
Tari wrote:
Tari wrote:
To evaluate how feasible this is, I'd want to examine the current set of image embeds we have, see how much storage they'd consume to mirror, and estimate the growth rate.
I went ahead and pulled all the non-imgur images as well, which totals 16742 unique URLs (of which only 9685 still work) totaling only about 1.5GB.


Wow, that's a much higher percentage of link rot than I was expecting - I wonder how many of the broken image links are still available on the Internet Archive?

Regarding image hosting, I would definitely be in favor of both mirroring external images and allowing user-uploaded images, within reason. So long as it's not possible for someone to upload an image without also including it in a public post, I don't think that we would run into major moderation issues, even without restricting which accounts can use the feature. We might also want to include an option to disable image mirroring on a per-post or per-link basis, as there are a few cases where links point to dynamic images (e.g. status badges on projects, as are often seen on GitHub, or dynamic userbars).
Given the list of URLs, we can count the number of links that IA has archived with the magic of shell scripting and IA's availability API:

Code:
wget --wait=5 --output-document=- --input-file=<(
  (while read l
   do
     echo "\"$l\"" \
       | jq -r '@uri "http://archive.org/wayback/available?url=\(.)"'
   done
  ) < not-imgur-dedup.txt
) | jq --unbuffered -r ".archived_snapshots.closest.url" \
  | grep -v '^null$' \
  | tee not-imgur-dedup-ia.txt
..unfortunately there's some fairly aggressive rate-limiting on the API so it'll take some time to run this to completion.
This is why I host my images myself Smile.
I have put the archive onto this domain https://pics.snh.cx/ so, If you want to see a broken imgur link on cemetech just replace the domain in the link with https://pics.snh.cx/ so https://i.imgur.com/12AQMOR.png corresponds to https://pics.snh.cx/12AQMOR.png I might make a page to paste a link in and do it automatically if i get around to it.

Edit: I have noticed that there seems to be 573 images missing. I will try to find them. Smile
Nice work Invalid Jake, a good solution.

Hopefully there won't be another apocalypse for your host though lol Very Happy
tr1p1ea wrote:
Nice work Invalid Jake, a good solution.

Hopefully there won't be another apocalypse for your host though 0x5 Very Happy

Thanks, always happy to help. Hopefully there wont be anymore apocalypses. Surprised
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 2
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement