Archive for the ‘S3’ Category

Using JetS3t to upload larger number of files to S3

Sunday, February 3rd, 2008

I was looking for a tool to upload large number of files to S3. While I have been a great fan of the bash tools for browsing and accessing s3 objects and buckets and a managing a limited number of files — I could not find an easy way of uploading a large number of files (the first batch being around 800K).

Then I downloaded JetS3t. It has a nice gui called Cockpit for managing the files on S3. The GUI is pretty neat. However, for simple upload/download S3 organizer, a simple Firefox plugin does the job. If you need to extensively manage your files then JetS3t’s cockpit is the way-to-go.

For uploading a large number of files, I was looking for something which is multi-threaded and configurable. JetS3t S3 suite has a “synchronize” application which is meant to synchronize files between a local PC and S3. JetS3t allows you to configure the number of threads and connections to the S3 service. Without reinventing the wheel, I got what I wanted. However, one additional thing I needed was the ability to delete the local files once the upload was complete. On tinkering with the java src, I modded the Synchronize.java and added the following code fragments:

public void uploadLocalDirectoryToS3(FileComparerResults disrepancyResults, Map filesMap,Map s3ObjectsMap, S3Bucket bucket, String rootObjectPath, String aclString) throws Exception  {
...
List filesToDelete = new ArrayList();
...
if (file.isDirectory() != true){
  filesToDelete.add(file.getPath());
}
...

// delete files once objects are S3d
for (Iterator ite = filesToDelete.iterator(); ite.hasNext();){
 String fName = (String)ite.next();
 File f = new File(fName);
f.delete();
}
}

MYSQL on EC2: Data backup strategy using replication on S3

Tuesday, May 1st, 2007

Few EC2 and S3 facts:
1. S3 (Amazon’s Storage in the Cloud Infrastructure) cannot be natively mounted on EC2 (Amazon’s Cloud Computing Infrastructure).
2. The maximum size of an “object” (atomic unit of a stored data element on S3) is 5GB
3. Multiple EC2 instances (a virtual machine having a horsepower of 1.7GHz, 160 GB ephemeral HDD and 1500 MB RAM) can be booted on demand
I run couple of EC2 instances in the cloud. The backup strategy (call it layman’s strategy or lame strategy!) so far has been a) Freeze the database b) Break the data files into 5GB chunks c) Move the chunks (or objects) onto S3 d) Unfreeze the database e) Repeat.
The above approach brings the database offline for at least 4-6 minutes for every cycle. So, here’s a new strategy I’m planning to test. The pseudo-algo is as follows:
1. Create an AMI which has a pre-configured mysql slave
2. Boot a new instance using the AMI created in #1 above, whenever a backup is desired
3. Read objects from S3 (if any) and coalesce them to rebuild data file
4. Create SSH Tunnel to the master
5. Start slave to catch up with replication
6. Stop Slave after some time
7. Break the fattened data file into chunks or objects (S3 limitation of 5 GB)
8. Move the objects to S3
9. Shutdown the instance
10. Go to 2
The new algo requires quite a bit of automation and there are some unanswered questions, which I’m sure could be figured out after the first trial. The following areas need to be automated:
1. SSH Tunneling between slave and master EC2 instances. The trick is to figure out the host name of the newly booted instance and then tunnel from it.
2. Client scripting for booting and executing the scripts on the slave. I think the best way to address this could be by running a cron job on the master server, which initiates and completes the backup process.
3. Prevention of data corruption. Moving large objects to/from S3 could have it’s own issues. Need to figure out whether the REST API call will guarantee data consistency.

Cloud Computing Panel at TiE: Amazon, where are my candies?

Wednesday, April 18th, 2007

I was in the audience for a panel discussion on Cloud Computing hosted by TiE. The panel was moderated by Nimish Gupta of SAP and had people from Amazon WebServices, Google, Opus Capital, and SAP. The interesting thing to watch was how the panel agreed to disagree on the benefits/definition of Cloud Computing. Pavni Diwanji from Google mentioned that it’s the tools on Google Apps and the API which matters to the developers.
Dan Avida, a VC from Opus, seemed to have innate knowledge about EC2 and mentioned that there are interesting opportunities waiting to be tapped for EC2. It may be interesting to look into those areas.
According to Vishal Sikka, CTO of SAP:

Cloud computing is suitable for smaller applications but not for large applications like SAP.

Adam Selipsky who represented Amazon agreed with that statement and said the current shape of Amazon EC2 & S3 is the first cut and is still in limited private beta. He further mentioned that Amazon’s prime focus is on stability of the platform and they haven’t added any major feature on EC2 and S3 in last 12 months.
On a question about competition for EC2, he joked, “There are rumors that the company on my left (referring to Google, as Pavni Diwanji of Google Apps was seated there) is working on something.” He went serious and said that educating developers to jump onto EC2 is the hardest part and he would love to have some competition so that they could spend millions of dollars in educating the customers.
On being asked whether Amazon is just utilizing the over capacity available in their data centers, Adam responded, “Amazon has invested around $2b for Amazon WebServices including EC2 and S3 and are fully committed.”
I took my turn from the audience and mentioned that using S3 as a natively mounted filesytem is a limitation on EC2 and asked about the oft-requested feature to support large databases on EC2. Adam quipped that he does not want to commit on a date but they are working on it. Cool.
On a side note: Adam and his team (couple of his colleagues in the audience) were pitching people to sign-up with their beta program at the venue but did not bring any candies for existing customers like me. Too bad! After the meeting I even sold the idea of using EC2 to a gentleman who was still kicking tires. Where’s my referral fee? 🙂
Tags:,

Web Applications with Portable Data: The next generation of Web applications

Tuesday, November 7th, 2006

Data portability is a big issue. None of us want to get locked down with a particular vendor. All the free web apps like GMail, JotSpot, Writely, et al come with a price — your data is in a proprietary data store. If you are not using pop3 and want to migrate from GMail to some other cool new email application, then there is no easy way out. The vendors rely on the lock-in of this data. For example, Google is offering E-mail services for SMB — what if you grow into a larger enterprise tomorrow and wanna have your own e-mail environment. There is no easy migration. Same goes for other next generation hosted applications like spreadsheets, wikis, office application. For a long time vendors rallied against Microsoft for proprietary formats — Talk about this one!
What’s the solution then? As Fred Wilson mentions:

I think anyone who provides a web app should give users options for where the data gets stored. The default option should always be to store the data on the web app provider’s servers. Most people will choose that option because they don’t care enough about this issue to do anything else.

I think we need a new breed of web applications which have pluggable storage. For example, all you get from a next generation GMail is a presentation and business logic layer. You get an ability to choose your data storage. It should work the way other desktop based applications work — You photo organizing software does not have data store attached to it, all it has a tonnes of logic and uses the file system. You can switch to another application and take care of the business.