Hello readers! Once again, we find ourselves in the throes of the semester, but take heart! Fall is well upon us and we are now able to walk to class without dying of heat stroke. Please celebrate with copious helpings of sweater-wearing and coffee-drinking. In the midst of celebrating the arrival of fall, I have been learning how to use two new tools called Tableau Prep (TP) and Open Refine (OR). These two tools do essentially the same thing: allowing users to clean messy data so that it can be run through various programs such as Tableau (a data visualization software not to be confused with Tableau Prep). While TP and OR achieve the same result, they are rather different in ease of use. For this project, I learned how to use both TP and OR to see the pitfalls and the benefits of each program. In this blog post, I’ll be discussing the project that I am working on, how I used these tools in that project, and compare TP and OR side by side.
In the previous paragraph, I mentioned ‘messy’ data, which is an odd term to those who aren’t well-versed in using digital tools. ‘Messy’ data is data that is out of order or has been entered incorrectly. Running this messy data through a software will yield a skewed result, which is of absolutely no use to anyone. Let me give an example. In the 1940s, OSU was a training center for the Women Accepted for Voluntary Emergency Service (WAVES) program. Upon arrival, each WAVE filled out a registration card, which OSU holds in its collection today. There are over 10,000 cards which have been recorded in an Excel sheet. While this is a great resource, I couldn’t just enter data into Tableau, because there were flaws in the way that the data had been entered into the Excel sheet. Each WAVE had written her home state, but the people who entered that information into the Excel sheet had written many of the state names incorrectly. When this messy data is run through Tableau, Tableau creates a unique category for each misspelling. I hope you can these the issues that this creates.
To correct the state names, I uploaded that data into TP and OR. When cleaning data, it’s a big no-no to touch the original data source (in this case the Excel sheet) because it is too easy to mess up and change things that aren’t supposed to be changed. Moreover, the WAVES data had over 10,000 entries, so combing through each entry and cleaning it in Excel would be time-consuming and ineffective. Thankfully, TP and OR make it (somewhat) easy to do all the data cleaning that needs to get done! TP gathered all the state names and I corrected each individual spelling. I spent several hours on that before discovering that there was a way to group the misspelled states correct them all at once. After a fair amount of banging my head against the table, I decided to chalk that one up to experience. In retrospect, I should have known that there was an easier way to rename the states, as these softwares are all about making your life easier. I was not as ignorant when I cleaned the data in OR, and the process went more quickly.
But data cleaning isn’t just about labeling everything correctly, data cleaning is also about knowing what you want the end result to look like and changing the way the data is formatted to achieve that result. When starting the project, I knew that I wanted to be able to create a visualization of these women’s home states. My data included the not just the states but also the towns that these women were from. These two pieces of data were connected, and if I ran the data as it was through Tableau, I would have gotten over 10,000 unique categories, which would not have been terribly practical to show or use. To achieve the result that I wanted, I needed to separate the town and state names from each other. The technical term for this is separating the fields. After separating the fields, I was able to load the data into Tableau and create some neat visualizations!
TP and OR served their purpose and helped me create the project that I wanted to. I began with TP, which comes with around an hour’s worth of training videos. I’ve heard before that using Tableau is easy…if you know how to use it. And therein lies the issue. TP is not terribly intuitive, making it difficult to use without re-watching the videos. When using TP, I operated on the most basic level. When correcting the data, I found TP slow, however, I am working on a Mac laptop, which is probably not the most compatible with that software. Uploading the data from TP to Tableau proved yet another challenge. Not once, but twice, I exported my cleaned data and uploaded it into Tableau, only to find that it had uploaded my messy data! You would think that exporting and uploading from these two programs would be rather easy, considering that they were created by the same team, but that was not the case. While I eventually uploaded the correct data, it involved quite a bit more banging of my head against my desk.
OR proved much easier to contend with. While you download both TP and OR, OR is run in the computer’s web browser. I did not see any training videos for OR, but it proved to be much more intuitive that TP and I was able use it within a few minutes. When correcting state names, OR was efficient, fast-paced, and user friendly. Moreover, I had zero issues uploading my cleaned data to Tableau. Overall, OR was easy, intuitive, and effective. In future project, I would much rather use this software.
- Had to have training to be able to use
- Difficult to use
- More sophisticated
- Did not load into Tableau easily
- Did not need training
- Faster than Tableau Prep
- Easy to use/simple
- Loaded into Tableau easily
I hope that this blog post shows the benefits and downfalls of each software, and piqued your interest in how data is recorded and used. Till next time!