As we all know there is this new privacy monster in Europe: GDPR.
For all us content creators/bloggers out there providing their services to Europeans a hustle began: We had to adapt to the new regulatory. Always with the fear in mind, that if we fuck it up we will be faced with huge penalties.
So I searched trough my websites: Remove Facebook Like Button, Remove analytics, Add privacy statement, Add cookie opt-in/out …. the list goes on.
As I always was privacy aware, my websites where mostly no super big problem. Just removing some scripts and it worked out fine.
But there where also some websites for which it would be a huge hustle to get them GDPR compliant. (A web app saving users PDFs and IPs, for example). And for that projects it was not worth the work, so I shut them down.
One project I ditched was datenschutzhelden.org (Main GDPR problem: comment function). You now maybe wondering: Why is he linking to a dead page?
Actually the page is not completely dead! I don’t host it anymore and it only lives in the wayback machine (WBM). All the requests to the domain get forwarded to the WBM.
As I don’t own the WBM and none of the content is served from my servers I see myself being compliant with the GDPR by handing the risk over to someone else. (Thanks to the Internet Archive <3)
By being offline and online at the same time we have the Schrödingers website paradox :P
But fun aside: I’m not a lawyer and have no clue if that really solves my problems with the GDPR as I still own the domain. So see this article as an idea for yourself and not as legal binding statement.
Important Notice: datenschutzhelden.org was NOT closed because of GDPR. We already closed it earlier (february 2018), mostly because of time issues. But after the end of the project, the page was still up and running as archive on my server. (And me still being fully responsible for it). In march I decided to move it into the WBM to don’t have it on my servers anymore.(And hopefully keep legal issues away)
The first part told you why I decided to go that path and what the end result will look like. In the second part you will get to know how you can do the same.
How to archive your website yourself
Because the WBM only saves the pages that are requested directly via their search, normaly only the main few pages of your blog get saved. (Or you have quite some manual work to do by searching all your subpages).
That looks like a problem one could automate :D
As I didn’t want to add all my subpages manually I created a tool for doing so:
By following a simple set of instructions (next block) you can automatically scrape your page for all subpages and add them to the WBM. (With all request being proxied to circumvent rate limits).
Using the ‘Save to web.archive.org’-Tool
I assume you have already installed go. (Go installation manual)
Download the dependencies via
Execute the following two commands:
go get -u github.com/simonfrey/proxyfy
go get -u github.com/PuerkitoBio/goquery
Just clone the git repo
git clone https://github.com/simonfrey/save_to_web.archive.org.git
Navigate into the directory of the git repo.
http[s]://[yourwebsite.com] with the url of the website you want to scrape and save.
go run main.go http[s]://[yourwebsite.com]
Additional commandline arguments:
-p for proxyfing the requests
-i for also crawling internal urls (e.g. /test/foo)
So if you want to use the tool with also crawling interal links and use a proxy for that it would be the following command
go run main.go -p -i http[s]://[yourwebsite.com]
[NGINX] Redirecting your domain
It’s as simple as adding a single magic line to your site config:
Please replace http[s]://[yourwebsite.com] with your own website
return 301 https://web.archive.org/web/http[s]://[yourwebsite.com]$request_uri;
Now all the request to your website get forwarded to the WBM. It even keeps your old links alive.
Hope I was able to help you a bit with this article. See your pages soon in the WBM :D
- Simon Frey
Did you like this post?