Cleaning Data

Digital humanities data is generally messy. Sometimes it looks like the floor of a teenage bedroom. Before we can analyse the data we need to clean it up and organise it. There are many ways of doing this. One important tool to use is a spreadsheet.

In this session I will run through some basic techniques in Excel that are often used for cleaning up data. This session should be useful for those who are starting out in digital humanities.

It would be good if someone who is proficient with OpenRefine (formerly known as Google Refine) could share this session so participants can get a taste for that tool as well.

I”ve put some simple Excel tips in a document which you can access here.

Categories: Session Proposals, Session: Teach | Tags: |

About perkinsy

I am a reformed accountant who moved into public relations then morphed into historical research. I am interested in the rise of the secular in the nineteenth and early twentieth centuries and the tension between the secular and religious during this period. Currently I'm analysing WWI soldiers' diaries using the transcriptions of the diaries held by the State Library of NSW for a history I am writing about the personal beliefs of soldiers while on the front. The engine of my research is a simple Python program I have written to search the diaries for certain keywords. You can read more about this project on my blog, Stumbling Through the Past. In my digital humanities blog, Stumbling Through the Future I give simple step by step explanations of how to use some useful tools. Most recently I did a series of posts on getting started with the Trove API. My first THATCamp was a great one - Canberra 2011 (see my post. In 2013 I helped to organise a THATCamp connected with the annual conference of the Australian Historical Association in Wollongong - THATCamp #OzHA2013 and I attended THATCamp in Sydney last year.

3 Responses to Cleaning Data

  1. Luc Small says:

    Hi perkinsky,

    Happy to help out with the Open Refine part.

    Intersect regularly gives a course covering many aspects of the tool – materials are available for anyone to use at www.intersect.org.au/course-resources (scroll about half way down the page). Happy to talk about clustering, faceting, calling into APIs, and other things covered in the course.

    Cheers,
    Luc

    • perkinsy says:

      Thanks Luc, that would be great, and thanks for the link.

      Aside from demonstrating Open Refine basics for those participants who are new to it I would be interested if you could demonstrate how it can be used with data that is not properly aligned, eg some data needs to be shift to the right a couple of columns.

      I have a .csv file which will be the basis of what I show in Excel, but I’ve had problems getting it to upload properly in Open Refine. Once some of the data is uploaded doing operations on Open Refine take too long. I suspect my data is pushing the limits of Open Refine. Perhaps you could also mention what the limits of Open Refine are?

  2. Pingback: THATCamp Sydney 2013 – Less than 2 days to go! | THATCamp Sydney 2013

Comments are closed.