Lately I’ve been thinking to start writing blog about my experience with programming tools and security. I guess today is D DAY!
I am doing a research project on knowledge extraction from Twitter Stream API, the goal is to analyze what people talk about in some category here I’m using sports,politics,fashion etc (you can have your own if you like).
1. Python Programming Language
2. Tweepy Twitter Wrapper for Python
3. MongoDB (there are others checkout here: http://nosql-database.org/)
4. Coke/Tea & Patience 🙂
Why MongoDB, a noSQL DB and not traditional DB?
Simple, data coming in stream every second asynchronously and while trying to insert into traditional DB like MySQL you start dropping tweets which is not going to work for project. It is a scalable, high-performance, open source, document-oriented database. Written in C++. Go to MongoDB site to learn more.
There are some posts about inserting in MySQL using PHP but I didn’t try since it was written before the new Streaming API was launched. If you get MySQL or any SQL DB working please feel free to add here. I know you can store it in SQL DB if you want e.g. send the tweets in Message Queue and then process them however you like. You might want to use the raw json data you receive from twitter stream and then process it. I am impatient doing all these for the project I’m doing, which focuses on the knowledge extraction rather then the architecture of collecting data.
Installing the Prerequisites:
You can download any version of Python from the http://www.python.org, I’m using Python 2.6 for this. Don’t forget to add python installation directory to your windows path for any of the following easy_install command to work.
Once you install python you can install setup tools package to install tweepy from command line:
Download the MongoDB for your preferred OS and then extract it. (I’m using windows 7 64-bit the following command is for windows). Then go to the mongodb folder and start the server from command line:
make sure you create the directory first. In windows you’ll have to setup the db path since the default is /tmp/db
The server should fire up the mongo db server. You can check the connection running from another terminal:
I hope on this point everything running smoothly. Now it’s time to install the driver package of Mongo for Python which is PyMongo. Just type from command line:
Now you can connect to your Mongo database from python application.
to be continued …