5000 Lines of Madness

At work, a community group sent my department a directory to local businesses. It was in a spreadheet and it was a mess. It looked something like this: 5000 Lines of this!

There was 5000 lines of this mess. There was some consistency, but not enough to make this and easy fix. The first thing I did was download the google sheet as an HTML file. I combed through it to find some consistent class names to iterate over and collect the data. I turned to Beautiful Soup to loop through the doc and pick out pieces of it to write to a CSV file. That worked ok, but things like email and Phone were on the same line. So they shared the same class name.

I fiddled with the HTML doc for a while but I realized I needed another option. I asked some people in the Python slack channel. Someone suggested I use openpyxl. I'd never used it before and since I was more familiar with Beautiful Soup I went there first. I started reading the docs for openpyxl and threw together a few lines of practice code. It seemed to work pretty well. I was able to pick out rows and columns to get data from. From there I could write in some logic to pick out the business names. In openpyxl, you can get font data. worksheet[46][0].font.b This bit of code means you get the first cell of the 46th row of the specified work sheet. The font.b property is to check for bold text. If True Then I could do another little loop from there.

So my code would go down the rows. When it hit a cell with bold text that was less than 14px it would run a short loop. Every entry in the spreadsheet covered only 5 rows with 4 columns in each row. So when I reached a bold business name, the code woud do a short loop of 5 rows. I would iterate through each columnt of the 5 rows. If the cell had a value, I would pull that data and add it to a list. I ended up with a list about 7 or 8 elements long. Some of them were a None value. But that was OK. I was able to get all the correct data related to the business name. From there I wrote the list into a CSV file. It wasn't perfect, I had some columns mismatched. If a business didn't have a fax then the line would sometimes have the physical address. The next step required some manual work. But it was easy compared after I was able to automate the collection of all the data. In the final spreadsheet I would filter a column by Fax: and a certain number of rows would rise to the top that were empty. In the next column was something like the Pysical address or the URL. All I had to do was cut large blocks of data from the mismatche column and move it to the right column. It took a little work but I was high on the success of scraping all the data, a little manual work didn't matter. Ultimately my boss was happy with the list. However I found out later, the creator of the messy spreadsheet sent and updated version with more organization. Oh well. next time I'll work faster.

social