Essential Resources: Tools for collecting and handling data

This is part of a series of posts to share with readers a useful collection of some of the most important, effective and practical data visualisation-related resources. This post presents a collection of useful tools, resources and references for gathering, cleaning and preparing your data for analysis and design.

Please note, I may not have personally used all the packages or tools presented but have seen sufficient evidence of their value from other sources. Whilst some inclusions may be contentious from a quality/best-practice perspective, they may still provide some good features and provide value to a certain audience out there. Finally, to avoid re-inventing the wheel, descriptive text may have been reproduced from the native websites if they provide the most articulate descriptions. Your feedback is most welcome to help curate this collection, keep it up to date and preserve its claim to be an essential list of resources!

Collecting and Scraping data


Typeform is a nimble, fast & surprisingly sexy way to ask questions to your users, customers & peers, on any device.



Google Docs

Google Docs offers two key functions to let you import/scrape data tables from websites: ImportHTML and ImportXML.


See also: OUseful Tutorial, Online Journalism Blog Tutorial 1, Online Journalism Blog Tutorial 2, Distilled Tutorial, School of Data Tutorial makes gathering data from the web as easy as copy & paste. Whether you’re a programmer, analyst, or just want to be informed, we make it simple to extract the data you need.




Liberate your data with ScraperWiki! ScraperWiki helps you do data science on the web. Get, clean, analyse, visualise and manage your data, with simple tools or custom-written code.




Python is a powerful, versatile and increasingly common programming language usually deployed as an automation tool on the data handling side of visualisation projects (eg. scraping data, parsing it, formatting it).


See also: PyTables



OutWit Hub explores the depths of the Web for you, automatically collecting and organizing data and media from online sources. OutWit Hub breaks down Web pages into their different constituents. Navigating from page to page automatically, it extracts information elements and organizes them into usable collections.


See also: Poynter tutorial



GraphClick is a graph digitizer software which allows to automatically retrieve the original (x,y)-data from the image of a scanned graph or from a QuickTime movie. You have the picture of a graph but not the corresponding data? You want to retrieve the trajectory of an object from a QuickTime movie? GraphClick is then simply the best way to solve the problem! You just have to click on the image and the obtained coordinates of the points can be directly exported into any other application.




Pipes is a powerful composition tool to aggregate, manipulate, and mashup content from around the web. Like Unix pipes, simple commands can be combined together to create output that meets your needs, such as combining many feeds into one, then sort, filter and translate them.


See also: Day Barr Tutorial, Video


PDF Extraction


If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple interface. And now you can download Tabula and run it on your own computer. Tabula is a development from Pro-Publica, La Nacion DATA and Knight-Mozilla OpenNews.




Need image (scanned) PDF conversion to Excel, Word, and PowerPoint? Able2Extract Professional combines leading edge technology with our proprietary PDF conversion algorithm to deliver high quality conversions every time. This is great for people working with paper documents and wanting to access them electronically.




PDF to Excel allows you to easily convert PDF files to Excel, CSV and More. It is easy to use, accurate, fast and facilitates advanced editing & tweaking.



Nitro Pro 8

Amongst many other features, Nitro Pro 8 lets you easily reuse and repurpose text, images, or entire documents, with tools to accurately convert and extract PDF files and their content.



Cleaning/preparing data


Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data.



Open Refine

Formerly known as Google Refine, Open Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.


See also: Transition to Open Refine on GitHub


Mr Data Converter

Mr Data Converter is a very able fellow, who promises to convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML.



Working with qualitative data


Leximancer enables you to navigate the complexity of text in a uniquely automated fashion. Our software identifies ‘Concepts’ within the text – not merely keywords but focused clusters of related, defining terms as conceptualised by the Author. Not according to a predefined dictionary or thesaurus. Leximancer embraces the complexity of language allowing the true meaning to emerge from the text itself, without human bias – in minutes!




Lexalytics turns unstructured text into structured data, telling you “who” is being discussed, “what” is the context of the conversation, and is it positive or negative – so that you can look for trends, send alerts, perform predictive analysis as part of your BI system, and more. Lexalytics builds a multi-lingual text analytics engine, Salience, that immediately provides excellent results with the ability to tune and customize as deeply as you desire.


See also: Demos



Welcome to the online text analysis tool, the detailed statistics of your text, perfect for translators (quoting), for webmasters (ranking) or for normal users, to know the subject of a text. Now with new features as the anlysis of words groups, finding out the keyword density, analyse the prominence of word or expressions.



Mr People

Mr People is a name converter/cleaner/standardiser, developed by Matt Ericson.




NVivo is software that supports qualitative and mixed methods research. It lets you collect, organize and analyze content from interviews, focus group discussions, surveys, audio – and now in NVivo 10 – social media data, YouTube videos and web pages


See also: Demo



Bamboo DiRT

Bamboo DiRT is a registry of digital research tools for scholarly use. Developed by Project Bamboo, Bamboo DiRT makes it easy for digital humanists and others conducting digital research to find and compare resources ranging from content management systems to music OCR, statistical analysis packages to mindmapping software.



Data Science Toolkit

The Data Science Toolkit provides a range of open-source tools for data scientists assembled by Pete Warden.



DMI Tool Database

A collection of specialist data gathering, handling and manipulating tools and utilities from the Digital Methods Initiative, reworking methods for Internet research since 2007.



Q Research software

Q has all the tools and state-of-the-art techniques to quickly extract maximum insight from your surveys




Query Tree

QueryTree lets you explore your data yourself with it’s easy to use drag and drop tools for exploring, analysing and visualizing data. There’s no code or formulas to write and it runs in your browser so there’s no software to install.



Office Reports

OfficeReports turns Microsoft Office® into a complete data analysis and reporting suite for surveys. Forget about switching between different analysis tools – now you can get Office to do all of it!



Mechanical Turk

Useful resource to consider if you have manual data tasks that need accomplishing and you’ve a bit of spare budget to outsource it. “Mechanical Turk is a marketplace for work. We give business and developers access to an on-demand, scalable workforce. Workers select from thousands of tasks and work whenever it’s convenient.”




PANDA is a tool for journalists to manage data within the newsroom. First and foremost PANDA is a “data library”, which means that it stores all the data you work with–voter registration records, police reports, water testing results, etc. When you upload your data to PANDA it is stored safely away so that it can be easily found again, either by yourself or by another reporter in your organization. PANDA is also a search engine. By uploading your datasets to PANDA you make them searchable by everyone in your organization. This search feature is designed to work like Google, so you don’t need to learn a new way of exploring the data.



Map Vectorizer

This project aims to automate a manual process: geographic polygon and attribute data extraction from maps including those from insurance atlases published in the 19th and early 20th centuries.




Databin helps you to share tabular data — a few rows from Excel or a result from a SQL prompt — with others.



‘ProPublica’ Scraping for Journalism

ProPublica have written a series of how-to guides explaining how we collected a sample dataset using a range of techniques. These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you’re already an experienced programmer, you might learn about a new library or tool you haven’t tried yet.



eBook: Scraping for Journalists

Scraping for Journalists introduces you to a range of scraping techniques – from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you’ll see results – but you’ll also be building towards more ambitious and powerful tools.


1 Comment

Data Visualization Resources | Arc and AngleSeptember 6th, 2013 at 3:04 pm

[…] on my twitter feed posted a link to a blog article called "Essential Resources: Tools for collecting and handling data". I skimmed the contents and found it had a number of pretty interesting resources and […]