How to Scrape Mixergy for Audio Content

If you’re a nerd and care about what other nerds (especially about tech companies) have to say, Mixergy is a good source of audio interviews. If you’re too lazy to click on each link and look for mp3 and wanna download the “whole site”, here’s how:

1) get sitemap.xml

2) curl each page in the sitemap and search for “.mp3″

3) download the mp3

OR if you are even lazier than that, here’s the script:

#!/bin/sh

>mixergy_files.txt
for i in `curl -s http://mixergy.com/sitemap.xml | grep “<loc>” | sed -e :a -e ‘s/<[^>]*>//g;/</N;//ba’ -e “s/\t//g”`
do
echo $i
curl -s $i | grep “\.mp3″ | sed ‘s/^.*<a href=”//’ | sed ‘s/”.*$//’ >> mixergy_files.txt
done

for i in `cat mixergy_files.txt`
do
wget $i
done

Related posts:

  1. Poor Man’s Google Scrape Technique
  2. Linux Grep – how to show only filenames with matches

Leave a Reply

Your email address will not be published. Required fields are marked *

*


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>