Scraping and archiving the web
Video - youtube-dlp
Video downloader which supports more than just youtube
Windows
scoop update
scoop install yt-dlp
Update at any time with
scoop update *
Now type:
yt-dlp <link to your video or profile>
youtube-dlp will detect existing files.
Filenames
Filenames include their URL by default
Images - gallery-dl Windows instructions
gallery-dl is the image equivalent of youtube-dl
Installation
scoop update
scoop install gallery-dl
Update at any time with
scoop update *
Cmder recommended (better interface for cmd/powershell)
https://scoop.sh/#/apps?q=cmder&s=0&d=1&o=true
Creating the config file
Create a config.json file in your appdata folder
C:\Users\yourname\AppData\Roaming\gallery-dl\config.json
MUST BE config.json ! DO NOT CONFUSE IT WITH gallery-dl.conf !
Barebones example config (for experienced JSON users)
Full completed config reference (Do not use)
For windows, replace directories \ with \\ OR /
Example:
"base-directory": "E:/Home/Pictures/gallery-dl/", "cookies": "E:/Home/Pictures/gallery-dl/cookies.txt", "archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",
Create a database of downloaded files
Get a major speed boost and future-proof your collection by writing saved files to a database in your config folder.
https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractor-archive
Example:
"archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",
This will create a file for each site such as pixiv.sqlite3, twitter.sqlite3 ect.
It's not necessary, but sqlite files can be opened with sqlitebrowser (available on scoop)
Getting your cookies.txt
Make sure you can see the source code and trust these Chrome extensions or disable them immediately after use.
Recommendations
- Make sure fetching retweets is off
- Consider fetching from replies if there is a chance for there to be anything of value other than "funny giphy reaction image"
Sites
Use cookies or the username/password in the config file
Pixiv
gallery-dl oauth:pixiv
Now follow the instructions in the terminal
Deviantart
gallery-dl oauth:deviantart
Starting the download
Now when you use the command:
gallery-dl.exe <URL of user profile, gallery, folder>
it will download into your base-directory/<website>/<username>/, no matter which directory your terminal is in.
Mass download
Create a .sh file (for example gallery.sh)
Edit it as if it were a text file and start the file with "gallery-dl" followed by the URLs you would like to fetch. Separate them with spaces. Example:
gallery-dl.exe https://twitter.com/username https://www.deviantart.com/username https://www.pixiv.net/en/users/221515
Now in the terminal you can type
bash.exe E:\Home\Pictures\gallery-dl\gallery.sh (your own location)
and it will download every URL in the file, skipping existing files if they are in the database file. Some text editors (eg. Notepad++) can show duplicate entries when highlighting text, this is useful for checking if you've already added someones profile to the script.
Mass download alias
You can make a shortcut to download your .sh file with
alias dlg=bash.exe E:\Home\Pictures\gallery-dl\gallery.sh
Now, just type "dlg" in the terminal to start mass downloading.
Additional tips
Viewing all images at once
Tip: you can search for "." in all subfolders to view all images at once, while keeping everything tidy in folders.
Filenames
Files from twitter are named as tweet ID's
Websites
Saving a web page as a single file
Built in browser save as will save a folder. You can use an addon or wget to save to a single .html file.
Extenstion
Chromium will not support this extension in the future due to internal addon changes, so stick with Firefox.
SingleFile by default embeds a timestamp onto the saved page. It can be turned off in it's settings.
wget
wget example.com
wget will not give the downloaded file a filetype. You can open it in your browser or add .html to the name.
Saving an entire website
Windows installation
scoop update
scoop install wget
Update at any time with
scoop update *
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
/tech/