Archive

Programming

I had been really lazy to write this blog to finish off the thor application example. So I’ll just go right to the application, it’s nothing complicated. This application intends not to tweet actually in twitter but to demonstrate how thor tasks work.

So where I left from last post add a new folder bin in your application root and create a file tweet. This is going to be our executable file which will allow us to do something like,


tweet post hello

and will print You tweeted: hello.

Put the following in your bin/tweet file,


#!/usr/bin/ruby

require "tweet"

Tweet.start # this registers the thor class as task

Now install build and install the gem by executing ‘rake install’ command in application root. It will install the executable as well. So that’s it, the code can be downloaded from https://github.com/mftaher/Tweet-Thor. You can use it as a basic structure to create your own executable thor tasks.

Note: Before you install or build the gem always commit your git changes.

Advertisements

Ruby is a a wonderful language, you can never stop praising what it lets you do and help you create great applications. To eliminate the need to copy my initial application structure every time I decided to write a scaffolding application in ruby that will give me an initial structure with necessary files in it just like rails generator (I use Sinatra micro web-framework for ruby web development).

I had been researching many options out there to accomplish this and ruby gives you not only wide range of options but also quickest turn around time for a command line application. To name a few the ruby core OptionParser is a powerful class to begin with, there is also Trollop, of course rake and there’s Thor that replaces rake, sake and rubigen.

A great advantage with Thor is, it documents your command line tool as you develop. Thus makes it easy for the developers create several tasks with plenty options and documenting at the same time. So I thought this reason is good enough to build my application using Thor then the others.

Instead of doing a hello world application, I’m going to demonstrate a simple tweeting application using thor. First you’ll need to have a structure of your application you can easily get it by doing:

bundle gem tweet

You will get the default app structure provided by bundle which can be used to publish your own gem. Then create and edit tweet.thor in your project lib diretory:

class Tweet > Thor
  desc "post MESSAGE", "Tweet message in your command line"
  def post(message)
    puts "You tweeted: #{message}"
  end
end

You should be able to see the task list typing `thor -T`.

To make the command available in your terminal, we’ll cover that in next blog post when we will actually post a tweet.

Switching to Ruby from PHP wasn’t as hard as deploying rails/sinatra app on existing Apache-PHP environment. It should’ve been fairly easy as I choose to do it with Phusion Passenger instead of using Proxy to the rails/sinatra app. But who knew things can get really ugly if I hadn’t tested locally.

Problem Ughhh @#$@%@:

Actually I deployed a rails/sinatra app in production first then was trying to configure in my localhost but then figured out when passenger module is loaded DirectoryIndex index.php isn’t working, meaning index.php of any web application wasn’t recognized by apache by default anymore as before you have to type it in the browser. Digging deeper found out mod_dir is not compatible with phusion passenger offering no fix yet. You might want to check with conflicting apache module with phusion passenger before even trying to install passenger in your existing apache-php environment.

Avoid Catastrophe:

The reason why I did not notice it in my production environment at first because of cache. Once I tried to deploy in my local machine things aren’t working as it was in production. I had to type in index.php to gain access. Checked AddHandler, AddType, DirectoryIndex index.php everything in place and nothing seems to work until I found the conflicting apache modules.

Solution:

There’s a workaround provided by the Passenger team i.e. using PassengerEnabled off. After Passenger module is loaded you can turn it off so that mod_dir can do it’s job setting correct DirectoryIndex, and then enable it where Phusion Passenger is required, preferable inside rails/sinatra app  <Directory> block. If you put it outside it may globally configure which will again make mod_dir not to work. Only when an agent is requesting for the rails/sinatra app it gets turned on for that Directory block and there’s no conflict anymore. A good example is provided at Phusion Passenger Guide. This setting can be done otherway around depending on the number of rails/sinatra and php application you have running in your environment.

Still waiting for something to go wrong just hoping it happens in my local machine ealier.

Local Environment:

Mac OS X Lion 10.7.2 , built in apache2, php5.3.6., rvm 1.10.0, ruby 1.9.2, passenger 3.0.11

Google very recently released its +1 button similar to the Facebook Like button. It’s part of the Google+ project which seems to take on Facebook and Skype at the same time with social networking and online audio/video chat feature.

To add google +1 button to your website you have to add the following scripts in your html.

<script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>

By default the included script will walk the DOM and render any +1 tags. The syntax is as follows:

 <g:plusone size="standard" count="true"></g:plusone>

That’s about it! If you want to know more in details you can visit the link: The Google +1

While I was collecting twitter streaming data I figured out there are plenty of tweets which are not in English and that seemed to me of no use as I’m not much of various language reader. So I had to translate those tweets to english and for this I have used google translation api in python. Google translation not only translates but it also detects the tweet language in the process. This had been a great help for my project of tweet stream mining.

Now my tweet data evolves in following process before saving the data as MongoDB document object:

1. Connect to twitter stream api using pycurl

2. create a queue to process the twitter data in another thread

3. differentiate english tweets from all other languages using NLTK english words vocabulary

4. translate the tweet if the tweet is not in english (google translation also provides confidence of the translation which can be used to weigh the translated text)

In some cases google doesn’t provide the translation so some texts are not worth keeping for the moment and I’m removing them. Google translation API has a query limit that 4500 characters per query and 100,000 query/day. So it is advisable to keep the translation calls under that for each day.

5. clean the stop words in tweets using NLTK

6.  Tokenize the tweet words using regular expression.

and thats it for so far. I’ll be posting the updates.

I’m actually running the script while writing this blog and the recent problem I had was lost all data in my MongoDB since I wasn’t an expert configuring it. But now I can be more confident  to write what is good thing to do while setting up your MongoDB. After many trial and error and spending 24 hours in it I’m actually impressed and excited to find out more about MongoDB.

Since I’ll receive twitter stream at a faster rate the possibility of write locks dropping data in the process is higher. To overcome this I’ll be setting up Master-Slave server in the MongoDB. In Mongo we can achieve this either using as replica set or master-slave. I’ve done this both and found out master-slave is quicker and easier to setup. But if you are running a data centric server you should better run with replica set which allows you to mirror the servers and helps you with fastest uptime if your master fails.

The following two commands are to setup your Master and slave server in two separate terminal windows.

Master server:

mongod –rest –master –dbpath c:\data\db

I’m assuming by now you have read the MongoDB manuals and how to start up the mongo.  By default mongodb graphical interface is turned off, with –rest you enable it. By graphical interface don’t compare it with phpMySQLadmin or anything this is simple yet very informative interface to know about your DB server. The command –master is to run the server as the master in your machine, the server by default listens to port 27017

Slave server:

mongod –master –slave –source localhost:27017 –port 27018 –dbpath c:\data\sl

For the slave server you’ll have to provide the source database server that it will act as slave and the db path where the slave server will save its own data. With the –port command you’ll be specifying which port the slave server will listen to.

That’s it you have started the master-slave connection for your mongoDB. Now it’s time for you to write some code. From tweepy examples you can use the codes to get twitter stream to your application, after that you can check out the pymongo documentation for step by step instruction on how to connect with mongoDB from your python application.

Once you have done with this you can easily insert your twitter stream data as it comes in your MongoDB. Enjoy!

Lately I’ve been thinking to start writing blog about my experience with programming tools and security. I guess today is D DAY!

I am doing a research project on knowledge extraction from Twitter Stream API, the goal is to analyze what people talk about in some category here I’m using sports,politics,fashion etc (you can have your own if you like).

Prerequisites:

1. Python Programming Language

2. Tweepy Twitter Wrapper for Python

3. MongoDB (there are others checkout here: http://nosql-database.org/)

4. Coke/Tea & Patience 🙂

Why MongoDB, a noSQL DB and not traditional DB?

Simple, data coming in stream every second asynchronously and while trying to insert into traditional DB like MySQL you start dropping tweets which is not going to work for project. It is a scalable, high-performance, open source, document-oriented database. Written in C++. Go to MongoDB site to learn more.

There are some posts about inserting in MySQL using PHP but I didn’t try since it was written before the new Streaming API was launched. If you get MySQL or any SQL DB working please feel free to add here. I know you can store it in SQL DB if you want e.g. send the tweets in Message Queue and then process them however you like. You might want to use the raw json data you receive from twitter stream and then process it. I am impatient doing all these for the project I’m doing, which focuses on the knowledge extraction rather then the architecture of collecting data.

Installing the Prerequisites:

You can download any version of Python from the http://www.python.org, I’m using Python 2.6 for this. Don’t forget to add python installation directory to your windows path for any of the following easy_install command to work.

Once you install python you can install setup tools package to install tweepy from command line:

easy_install tweepy

Download the MongoDB for your preferred OS and then extract it. (I’m using windows 7 64-bit the following command is for windows). Then go to the mongodb folder and start the server from command line:

mongod –dbpath=c:\DB_PATH_WHERE_TO_STORE_THE_DB

make sure you create the directory first. In windows you’ll have to setup the db path since the default is /tmp/db

The server should fire up the mongo db server. You can check the connection running from another terminal:

mongo

I hope on this point everything running smoothly. Now it’s time to install the driver package of Mongo for Python which is PyMongo. Just type from command line:

easy_install pymongo

Now you can connect to your Mongo database from python application.

to be continued …

%d bloggers like this: