Archive

Database

Replica set involves at least 3 instances of Mongo running with respect to voting so that any one server can take over as master in case of failure. So first thing is to install mongo in 3 servers in your network which is a good idea for redundancy. You can easily install mongo by following instructions at mongodb documentation site. In this blog we will be deploying mongo db replica set in linux environment.

Once you have completed installation of mongo, it’s time for some configuration tweaks before you start the databases. But first generate a key file in your master instance for authentication so that the 3 mongo instance can talk to each other securely. 

openssl rand -base64 741 > /etc/mongo.key

Copy the key file in your other instances in same location.

Now edit the following in all of the 3 mongo DB /etc/mongo.conf

keyFile=/etc/mongo.key
replSet=rs0 #[unique name for your set]
rest=true # if you want rest api to view status of your mongodb

Note: auth=true only works for single mongodb instance authentication, keyFile is required for replica or sharding authentication setup.

Now start your mongo db process

service mongod start

start interactive mongo to initiate the replication and setup replication configuration

mongo
>rs.initiate()
# will return no configuration found
> config = {_id: rs0, members:[{_id:0, host: 'ip:port'},{_id:1, host: 'ip:port'},{_id:2, host: 'ip:port'}]}
>rs.initiate(config)
# will return success status
> exit

that’s it! you have setup mongo DB replica set with authentication.

While I was collecting twitter streaming data I figured out there are plenty of tweets which are not in English and that seemed to me of no use as I’m not much of various language reader. So I had to translate those tweets to english and for this I have used google translation api in python. Google translation not only translates but it also detects the tweet language in the process. This had been a great help for my project of tweet stream mining.

Now my tweet data evolves in following process before saving the data as MongoDB document object:

1. Connect to twitter stream api using pycurl

2. create a queue to process the twitter data in another thread

3. differentiate english tweets from all other languages using NLTK english words vocabulary

4. translate the tweet if the tweet is not in english (google translation also provides confidence of the translation which can be used to weigh the translated text)

In some cases google doesn’t provide the translation so some texts are not worth keeping for the moment and I’m removing them. Google translation API has a query limit that 4500 characters per query and 100,000 query/day. So it is advisable to keep the translation calls under that for each day.

5. clean the stop words in tweets using NLTK

6.  Tokenize the tweet words using regular expression.

and thats it for so far. I’ll be posting the updates.

I’m actually running the script while writing this blog and the recent problem I had was lost all data in my MongoDB since I wasn’t an expert configuring it. But now I can be more confident  to write what is good thing to do while setting up your MongoDB. After many trial and error and spending 24 hours in it I’m actually impressed and excited to find out more about MongoDB.

Since I’ll receive twitter stream at a faster rate the possibility of write locks dropping data in the process is higher. To overcome this I’ll be setting up Master-Slave server in the MongoDB. In Mongo we can achieve this either using as replica set or master-slave. I’ve done this both and found out master-slave is quicker and easier to setup. But if you are running a data centric server you should better run with replica set which allows you to mirror the servers and helps you with fastest uptime if your master fails.

The following two commands are to setup your Master and slave server in two separate terminal windows.

Master server:

mongod –rest –master –dbpath c:\data\db

I’m assuming by now you have read the MongoDB manuals and how to start up the mongo.  By default mongodb graphical interface is turned off, with –rest you enable it. By graphical interface don’t compare it with phpMySQLadmin or anything this is simple yet very informative interface to know about your DB server. The command –master is to run the server as the master in your machine, the server by default listens to port 27017

Slave server:

mongod –master –slave –source localhost:27017 –port 27018 –dbpath c:\data\sl

For the slave server you’ll have to provide the source database server that it will act as slave and the db path where the slave server will save its own data. With the –port command you’ll be specifying which port the slave server will listen to.

That’s it you have started the master-slave connection for your mongoDB. Now it’s time for you to write some code. From tweepy examples you can use the codes to get twitter stream to your application, after that you can check out the pymongo documentation for step by step instruction on how to connect with mongoDB from your python application.

Once you have done with this you can easily insert your twitter stream data as it comes in your MongoDB. Enjoy!

Lately I’ve been thinking to start writing blog about my experience with programming tools and security. I guess today is D DAY!

I am doing a research project on knowledge extraction from Twitter Stream API, the goal is to analyze what people talk about in some category here I’m using sports,politics,fashion etc (you can have your own if you like).

Prerequisites:

1. Python Programming Language

2. Tweepy Twitter Wrapper for Python

3. MongoDB (there are others checkout here: http://nosql-database.org/)

4. Coke/Tea & Patience 🙂

Why MongoDB, a noSQL DB and not traditional DB?

Simple, data coming in stream every second asynchronously and while trying to insert into traditional DB like MySQL you start dropping tweets which is not going to work for project. It is a scalable, high-performance, open source, document-oriented database. Written in C++. Go to MongoDB site to learn more.

There are some posts about inserting in MySQL using PHP but I didn’t try since it was written before the new Streaming API was launched. If you get MySQL or any SQL DB working please feel free to add here. I know you can store it in SQL DB if you want e.g. send the tweets in Message Queue and then process them however you like. You might want to use the raw json data you receive from twitter stream and then process it. I am impatient doing all these for the project I’m doing, which focuses on the knowledge extraction rather then the architecture of collecting data.

Installing the Prerequisites:

You can download any version of Python from the http://www.python.org, I’m using Python 2.6 for this. Don’t forget to add python installation directory to your windows path for any of the following easy_install command to work.

Once you install python you can install setup tools package to install tweepy from command line:

easy_install tweepy

Download the MongoDB for your preferred OS and then extract it. (I’m using windows 7 64-bit the following command is for windows). Then go to the mongodb folder and start the server from command line:

mongod –dbpath=c:\DB_PATH_WHERE_TO_STORE_THE_DB

make sure you create the directory first. In windows you’ll have to setup the db path since the default is /tmp/db

The server should fire up the mongo db server. You can check the connection running from another terminal:

mongo

I hope on this point everything running smoothly. Now it’s time to install the driver package of Mongo for Python which is PyMongo. Just type from command line:

easy_install pymongo

Now you can connect to your Mongo database from python application.

to be continued …

%d bloggers like this: