Published: Fri 24 May 2024
By Chad Jemmett
In work .
tags: work algorithms
Depth-First Search for cataloging links on a site.
At my current job we use a content management system (CMS) to post content to the web. We also use the Google suite of
tools for sharing documents via the CMS. There were hundreds of links in the CMS that led to google documents. I wasn't sure where our CMS ended and where Google began. So I
could understand our content and where it was I decided to catalog all the outgoing links on our site.
I briefly looked for some third-party tools, but I wanted to write up my own scraper. So I wrote up a little script to
do a depth-first search of all the site's links.
I've never used depth-first for any practical uses. Just searching and navigating test graphs. This was an actual
real-world application.
Tools
I used Python, Beautiful Soup , Deque , Requests and urllib.parse .
Beautiful soup I used to search the pages for a
tags. Urllib.parse
I used to pull the domain and directory from each
a
tag. Deque was new to me. It's pronounced Deck
and it's short for double end deque. It's part of the standard
Python library.
For each link, I used requests
to get the html data.
deque is optimized for appending and popping at both ends of a List. A Pythonlist is similar but
listsare better for a fixed length in memory.
dequeallows for more flexibility in the
list` length. For my purpose, list
would work just fine. The website I was scraping had roughly 1000 external links. But I enjoy learning about new aspects
of Python.
Deque was used to manage the stack
data structure when doing a depth-first search. I used urllib.parse
to figure
out what links were Google links and which links were not. I also used it to build up new links to add to the stack.
After some trial and error, I scraped all 70+ pages of our website and found nearly a thousand external links. Most
of them went to google as I figured, but a good percentage went to social media. It's satisfying to write a short
script to automate your way through hundreds of links to figure out what is what.