Antons blogg om elektronik och Linux

3 augusti, 2010

4chan download script for Linux

Filed under: Okategoriserat,Terminal — Anton @ 14:10
Tags: , , ,

Jag brukar ju vanligtvis skriva på svenska, men eftersom detta inlägget riktar sig mot en internationell publik tänker jag ta tillfället i akt och skriva det på engelska.

The 4chan download script

This is an update of the 4chan download script for Linux written by Daniel Triendl, http://blog.pew.cc/blog/4chan+download+script/

The modified script downloads every image file in a 4chan thread, preserving the original file names (not the incrementing numbers given by 4chan). Perfect for downloading entire sets of pictures or other original content. Tested on a few different boards but should theoretically work on all.

Last update: August 2012 (after 4chan’s HTML5 redesign and switch to HTTPS per default). Known bugs and limitations:

  • If there are several files in the thread with the same original filename, only the first will be downloaded.
  • If an image file from another thread is linked to in a post, it will also be downloaded and the link-filename relationship will be messed up.
  • Network errors are treated like 404 errors.
  • Threads that has a slash (/) in the subject breaks the link-filename relationship because the subject is treated like a filename. No known workarounds at this time.
#!/bin/bash
# A bash script for downloading all images in a 4chan thread to their original
# filenames. Updates every 60 seconds until canceled or the thread disappears.
# 
# Copyright 2008, 2010, 2012 Daniel Triendl, Anton Eliasson
# http://blog.pew.cc/blog/4chan+download+script/
# https://antoneliasson.wordpress.com/2010/08/03/4chan-download-script/
# 
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
#

if [ "$1" = "" ]; then # no arguments
	echo "Usage: `basename $0` <4chan thread url> [optional: download directory]"
	exit 1
fi

if [ "$2" = "" ]; then # only one argument
	LOC=$(echo "$1" | egrep -o '([0-9]*)$' | sed 's/\.html//g' ) # find out the thread number
else
	LOC=$2 # use download dir specified by user
fi
echo "4chan downloader"
echo "Downloading to \"$LOC\" until canceled or 404'd"

if [ ! -d $LOC ]; then
	mkdir -- $LOC
fi

cd -- $LOC # new directory named after the thread number

while [ "1" = "1" ]; do
	thread=`mktemp` # thread is the html thread
	links=`mktemp` # links will be a list of all image addresses
	names=`mktemp` # names will be a list of all original file names

    # get thread
    echo "Updating..."
	wget -q -k -O "$thread" "$1"
	if [ "$?" != "0" ]; then
		echo "Update failed, exiting"
		date
		rm $thread $links $names
		exit 1
	fi

    # get file list, space separated
	grep -E -o 'http[s]?://images.4chan.org/[a-z0-9]+/src/([0-9]*).(jpg|png|gif)' "$thread" | uniq | tr "\n" " " > "$links"

	# get original file name list, space separated (spaces in filenames changed to underlines)
	sed 's/ /_/g' "$thread" | grep -E -o '<span_title="[^"]+' | awk -F \" '{print $2}' | tr "\n" " " > "$names"

	COUNT=`cat $names | wc -w` # total number of files/names
	for ((i=1; i<=$COUNT; i++)); do
		wget -nv -nc -O `cut -d ' ' -f $i $names` `cut -d ' ' -f $i $links` # now download all files, one by one
	done

	rm $thread $links $names

	echo "Waiting 60 seconds before next run"
	sleep 60
done;

This should run on any Linux-based OS using the bash shell. Feel free to contact me if you find any bugs and/or improve the script.

5 kommentarer »

  1. This doesnt work on os x, so just do this..

    TMP=`mktemp /tmp/$RANDOM` # TMP is the html thread
    TMP2=`mktemp /tmp/$RANDOM` # TMP2 will be a list of all image addresses
    TMP3=`mktemp /tmp/$RANDOM` # TMP3 will be a list of all original file names

    Kommentar av Anonym — 14 juli, 2011 @ 01:08

  2. That fix probably still applies, but the variables have been renamed to thread, links and names respectively.

    Kommentar av Anton — 1 juni, 2012 @ 10:25

  3. I’m having some trouble getting it to work. It looks like maybe part of the grep command that makes the list of names is missing?

    Kommentar av zorblek — 4 juni, 2012 @ 03:08

  4. Yes, I accidentally posted the code without converting the < and > to &lt; and &gt; so parts of the code was interpreted as HTML. Try it again now.

    Kommentar av Anton — 8 juni, 2012 @ 10:52

  5. Update for i.4cdn.org:

    # get original file name list, space separated (spaces in filenames changed to underlines)
    sed ‘s/ /_/g’ ”$thread” | \
    grep -E -o ‘[^<]*\)’ | \
    cut -f 2 -d ‘>’ | \
    cut -f 1 -d ‘ ”$names”

    Probably also want to avoid filename collisions (untested code):

    # now download all files, one by one
    for ((i=1; i<=$COUNT; i++)); do
    # Get the source link and destination name.
    THISNAME="$(cut -d ' ' -f $i $names)"
    THISLINK="$(cut -d ' ' -f $i $links)"

    # If the target file already exists
    # (two different posters submit different images with the same filename)
    # then rename any duplicate names.
    if [ -e "${THISNAME}" ]; then
    NAMECTR=1
    NEWNAME="$(echo "${THISNAME}" | \
    sed "s/….$/.${NAMECTR}&/"

    while [ -e "${NEWNAME}" ] ; do
    NAMECTR="$((NAMECTR+1))"
    NEWNAME="$(echo "${THISNAME}" | \
    sed "s/….$/.${NAMECTR}&/"
    done
    THISNAME="${NEWNAME}"
    fi

    wget -nv -nc -O "${THISNAME}" "${THISLINK}"
    done

    Kommentar av Anonym — 6 januari, 2014 @ 00:28


RSS feed for comments on this post. TrackBack URI

Kommentera

Fyll i dina uppgifter nedan eller klicka på en ikon för att logga in:

WordPress.com Logo

Du kommenterar med ditt WordPress.com-konto. Logga ut / Ändra )

Twitter-bild

Du kommenterar med ditt Twitter-konto. Logga ut / Ändra )

Facebook-foto

Du kommenterar med ditt Facebook-konto. Logga ut / Ändra )

Google+ photo

Du kommenterar med ditt Google+-konto. Logga ut / Ändra )

Ansluter till %s

Skapa en gratis webbplats eller blogg på WordPress.com.

%d bloggare gillar detta: