Using JetS3t to upload larger number of files to S3

February 3rd, 2008

I was looking for a tool to upload large number of files to S3. While I have been a great fan of the bash tools for browsing and accessing s3 objects and buckets and a managing a limited number of files — I could not find an easy way of uploading a large number of files (the first batch being around 800K).

Then I downloaded JetS3t. It has a nice gui called Cockpit for managing the files on S3. The GUI is pretty neat. However, for simple upload/download S3 organizer, a simple Firefox plugin does the job. If you need to extensively manage your files then JetS3t’s cockpit is the way-to-go.

For uploading a large number of files, I was looking for something which is multi-threaded and configurable. JetS3t S3 suite has a “synchronize” application which is meant to synchronize files between a local PC and S3. JetS3t allows you to configure the number of threads and connections to the S3 service. Without reinventing the wheel, I got what I wanted. However, one additional thing I needed was the ability to delete the local files once the upload was complete. On tinkering with the java src, I modded the Synchronize.java and added the following code fragments:

public void uploadLocalDirectoryToS3(FileComparerResults disrepancyResults, Map filesMap,Map s3ObjectsMap, S3Bucket bucket, String rootObjectPath, String aclString) throws Exception  {
...
List filesToDelete = new ArrayList();
...
if (file.isDirectory() != true){
  filesToDelete.add(file.getPath());
}
...

// delete files once objects are S3d
for (Iterator ite = filesToDelete.iterator(); ite.hasNext();){
 String fName = (String)ite.next();
 File f = new File(fName);
f.delete();
}
}

Amazon EC2 Disk Speeds of m1.small and m1.large

January 24th, 2008

I have been running a two node cluster on EC2 and for the past week or so my database writes have been totally bogged down. After some tests it looked like we’re hitting the disk I/0 bottleneck. To my surprise Disk I/0 was 5-6 times faster on the m1.large instance type.

I ran a cheap command to time the creation of a 1GB file. Here are the results. In both the situations the small test was run on /mnt which is considered to be a dedicated spindle.

On m1.small

[root@]# time dd if=/dev/zero of=testfile count=1 bs=1024M
1+0 records in
1+0 records out
real 0m11.298s
user 0m0.000s
sys 0m3.390s

On m1.large


[root@]# time dd if=/dev/zero of=testfile count=1 bs=1024M
1+0 records in
1+0 records out
real 0m2.982s
user 0m0.000s
sys 0m2.350s

Read about the various EC2 instance types here

As Blogosphere explodes, Blog search implodes

January 20th, 2008

Blog search engines are tracking billions of blog posts. Some posts are mindless, some are fun and some are purely spam. As the size of blogosphere grows the quality of discoverable content from the blogs through the search engines is falling way behind. We saw a similar problem with the regular search before Google came up and wooed the online users away from Altavista. I recently searched on blog posts tagged as “digg”. Majority of them were spam. Some were good but they were beyond the first few results pages.
mindless_technorati2.jpg

PHP-JSON on Amazon EC2 Fedora Core4 AMI

January 19th, 2008

I have been using a modified version an old FC4 public AMI, which is too good to be deserted with all the software I have installed. For a project, I needed support for JSON. Javascript Object Notation or JSON is a way of passing string representation of Javascript objects to the user-agents. Straight-through serialization of Javascript objects and transmittal of such is much more compact than passing XML and then converting them to Javscript objects.

On the server-side you have to create a string representation of a Javascript object and then return it to the caller. This is very similar to formatting data in XML and then sending the XML back to the requesting clients. JSON takes away the overhead of XML (no parsing, DOM walking etc. instead you get a first-class Javascript object). In PHP you can create a JSON string by passing a PHP variable into an encoder function.  In PHP 5.2.0 JSON is natively compiled, it was contributed by Omar Kilani who wrote the php-json extension.

You may skip the rest of the post if you already have PHP 5.2.0. Keep on reading if you have PHP 5.0/5.1 on Fedora Core 4 and want JSON functionality.

Good news is that the php-json has an RPM available in fedora-extras. You don’t need do any thingamagic of installing the RPM by hand. Instead use yum to install it seamlessly from the fedora extras repository

(Assuming that You have su privileges)

Step1. Navigate to the /etc/yum.repos.d directory. Check if you have a file called fedora-extras.repo. This file contains the required information for yum to look up the extras repository

Step 2. If you do not have the file then create the file with the following text:

[extras]
name=Fedora Extras $releasever - $basearch
baseurl=http://download.fedora.redhat.com/pub/fedora/linux/extras/$releasever/$basearch/
mirrorlist=http://fedora.redhat.com/download/mirrors/fedora-extras-$releasever
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-extras
gpgcheck=1

If you have the file then make sure the “enabled” flag is set to 1. Extras are normally disabled.

Step 3.  Run a yum search command as “yum search php-json”. If your fedora-extras.repo is set correctly then you will see a matching result as:

php-json.i386                            1.1.0-1.fc4            extras
Matched from:
php-json
php-json is an extremely fast PHP C extension for JSON (JavaScript Object
Notation) serialisation.
http://www.aurore.net/projects/php-json/

Step 4. If all looks good, run the install command as yum install php-json. That’s it. Have fun with JSON. Read the PHP Manual for usage.

Hello WordPress!

January 19th, 2008

For the past 4 years I have been blogging using MovableType — It is a fantastic product with fantastic set of features. I have been thinking of migrating to WordPress due to the 500 Internal Server Errors, which have reduced my ability to blog. MT was solid, stable platform until their 2.4.x release. Things started going south since then. Another reason for the migration was my lack of PERL knowledge (I have tried, but can’t fathom the depth the language has to offer). The hacker in me has already tinkered around with WordPress, which allowed me to complete my import from MT to WP (esp. preserving the numerical IDs for .htaccess forward from MT).

Magnetic “ropes” are causing Auroras

December 12th, 2007

Space.com is reporting that a fleet of NASA spacecraft launched less than eight months ago has revealed new insights into the forces that cause the northern lights, including giant magnetic “ropes” between Earth and the sun. Understanding the cause of auroras will help scientists solve the mystery around solar flares and what causes them and get a handle on the outer space weather. I don’t so much understand the geo-physics of why the solar winds slams into the earth’s magnetic field in the magnetosphere. I’m simply enamoured by the spectacular shows, gallery here and here. Chasing Auroras? I could take this for a living!


Photograph by Phil Hoffman

Google engineers convey world dominance in encoded URL of search experiment

December 2nd, 2007

Several blogs are reporting about Google experimenting with Digg-style voting of search results. What I found intriguing is the URL of the experiment. Here it is unhyperlinked:
http://www.google.com/experimental/a840e102.html
Look at the last part. It reads “a840e102” which looks like some encoded message where the engineers are trying to convey their world dominance. Here is a screen shot of an analysis of probable meanings. I like the cannabis one :D)
a840e102.gif

You can read some of them as:

  1. A twice the dose of cannabis produces 100 times better search experience
  2. A unifying-democratizing way of producing a 100 times better search experience. Just like the Tibetan script Phagspa unified the various languages
  3. 840 is one of the smallest composite number with 32 divisors. Shows the technical prowess of one of the algorithms employed in the experiment or maybe just dump of a highly complex issue in maths, just like they used the value of Pi in their IPO

Tag:

The 3-day Middle-east peace process vs. the price at the pump

November 29th, 2007

Nov_07_3day_peace_process_price_at_the_pump.jpg
Read more at:
NY Times: Bush Promotes Middle East Peace Dialogue

Bush kick starts Middle East peace talks

Egypt: The Annapolis Peace Conference

Tragedy and Travesty at Annapolis

Tiny urls: Taking WWW towards a single point of failure

November 17th, 2007

Tinyurl, urltea and several other url reducers provide an excellent service where they reduce the sometimes very long urls to a fraction of their original size. The short version of the URL provides relief to people on the phone (can’t really think of anything else which could benefit from the service). Thanks to the growth of twitter, the url abbreviating services have gained a lot of popularity recently, so much so that people have started replacing regular URLs on the web (eg. Look at the comment in this post). Charlene Li even considered having the tiny URLs in her book. David Pogue carries the ecstatic side of finding a new service without evaluating the potential pitfalls.

I don’t understand why people want to mask the URL for the normal WWW. A lot of people click the URL after doing a hover and figuring out the actual target.

5 minutes ago, I clicked on a urltea link and I got a 503 HTTP Error:

A 503. Service is not available from urltea. Tinyurl claims to have abbreviated a billion urls. Imagine the impact of such a downtime.

The URL abbreviation services pose the following problems:

1. Single point of failure for billions of web urls. This totally defeats the distributed architecture of WWW.
2. Masked urls could be prone to deception by spammers and XSS exploiters. Quoting Wired Blog, “your audience has no clue where it will lead — could be a porn link, could be a virus laden site from Russia.”
3. A lot of browser security features work on the domain name and it’s associated attributes stored locally. A different url masks the true domain.
4. It leads to even more problems in the text mining community — where a single domain pollutes the corpus of links, while hiding the actual target. Any link analyzer has to first resolve the actual target of the tiny url by performing an HTTP HEAD request.
5. What if tinyurl gets bought by a get-rich-quick advertising company and they start sending a pop-up along with the actual URL. That would be an idea for someone to make a lot of money from billions of tiny urls!

The value provided by these services for mobile is great — it’s a big problem when the tiny urls start popping up on everybody’s webpages! I’m not alone to think there is something wrong with the service in the WWW context. Here’s Tom and here’s Scott Rosenberg of Salon.

Java Generics: Taking the fun away from writing code in Java

November 16th, 2007

Coding in Java was simple until Java 5 (or Java 1.5 — 1.5 is the developer version and Java 5 is the marketing version as Sun calls it!). Learning generics in Java 1.5 is like learning Microsoft COM programming, it would take at least 5 passes to absorb it right.

Look at this simple call, before generics

1 List myIntList = new LinkedList(); 
2 myIntList.add(new Integer(0)); 
3 Integer x = (Integer) myIntList.iterator().next(); 

How about now:

1 List<Integer> myIntList = new LinkedList<Integer>(); 
2 myIntList.add(new Integer(0)); 
3 Integer x = myIntList.iterator().next(); 

Here is the source snippet of java.util.Collection class from the 1.5 version. Makes me wipe the sweat.

1 public interface Collection<E> extends Iterable<E> {
2 <T> T[] toArray(T[] a);
3 boolean containsAll(Collection<?> c);
4 boolean addAll(Collection<? extends E> c);
5 Iterator<E> iterator();
6 }

Yeah, yeah. The fans of C++ would love it, it looks like the C++ templates, compile-time type checking, etc. — Well, programmers shall figure out other ways of making mistakes like working on a null object :D)
No way out, I need to learn it ‘coz 1.6 is already out and I was still hanging on to 1.4 till yesterday!