Useful free datasets (Part 2)

Other sites also offer great variaty of datasets free

Yahoo! Labs – Collection of datasets related to language, social, marketing and more. They’re well organized and most of them are hundreds of megabytes in size.

Awesome Publid Datasets – This is a Github repository that’s a list of publicly available datasets organized by category.

Gapminder – Hundreds of datasets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.

The Info – Mostly large datasets. The site is losing momentum, but the data available here is still gold.

The Data Hub – Hosted by CKAN. Most of these datasets come from the government.

Datamob – List of public datasets.

Numbrary – Lists of datasets.

Kaggle – Kaggle is a site that hosts data mining competitions. Each competition provides a data set that’s free for download.

SNAP – Stanford’s Large Network Dataset Collection. This list has several datasets related to social networking. Lots of fun in here!

More available datasets at:

Useful free datasets (Part 1)

Here I share a list of datasets free for download


American Economic Ass. (AEA):
World bank:

Data Science Practice

This section contains data sets used in the book “Doing Data Science” by Rachel Schutt and Cathy O’Neil (O’Reilly 2014)
Datasets on the book site:
Enron Email Dataset:
GetGlue (time stamped events: users rating TV shows):
Titanic Survival Data Set:
Half a million Hubway rides:


CBOE Futures Exchange:
Google Finance: (R)
Google Trends:
St Louis Fed: (R)
Yahoo Finance: (R)

To view more go to:



Like Alyzer-Free Social Media Tool for Facebook

LikeAlyzer is an online tool for companies that want to be successful on Facebook. It  helps you to measure and analyze the potential and effectiveness of your Facebook Pages


  • It provides daily updated Facebook statistics for your company or other Pages of interest.
  • It enables you to monitor and compare your efforts with those of the world’s popular brands or relevant companies, such as competitors.


Graph Visualization with Gephi

Gephi is an interactive visualization and exploration solution that supports dynamic and hierarchical graphs. It runs on Windows, Linux and Mac OS X. Gephi is open-source and free.


The goal is to help data analysts to make hypothesis, intuitively discover patterns, isolate structure singularities or faults during data sourcing. It is a complementary tool to traditional statistics, as visual thinking with interactive interfaces is now recognized to facilitate reasoning.

  • Real time visualization
  • Layout algorithms (force- based and multi-level)
  • Metric (Betweeness, Closeness, Diameter, Clustering Coefficient, Average shortest path, PageRank, HITS, Community Detection,  Random Generators)
  • Dynamic Network Analysis
  • Create Cartograpy
  • Clustering and hierarchical graphs
  • Dynamic Filtering



To learn more about it, go to


How artificial intelligence is transforming the financial industry

By Michelle Fleury BBC business correspondent, New York


Your next stockbroker might just be a computer.

More and more, financial firms are turning to machines to do the job humans have done for decades.

Last spring, wealth management firm Charles Schwab launched a new service called Schwab Intelligent Portfolios. The service is unique in that it’s not a person who decides where to invest your money, it’s an algorithm – lines of code programmed into a computer.

“It’s lower cost for the investor,” says Tobin McDaniel, who leads the Schwab Intelligent Portfolios team.

“As opposed to working with a traditional advisor where you might pay up to 1%, here you get portfolio management at essentially no management fee.”



To learn more: How artificial intelligence is transforming the financial industry

The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data

Reading for this week: The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data (Feldman & Sanger, 2006)




Text mining tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. In addition to providing an in-depth examination of core text mining and link detection algorithms and operations, this book examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches. Finally, it explores current real-world, mission-critical applications of text mining and link detection in such varied fields as M&A business intelligence, genomics research and counter-terrorism activities.