MySQL and mass inserts

The archive application that I have been building has been a fun change from the normal data-entry business application that I normally get tasked with. I get to do some image manipulation and multithreading, both of which I don’t normally get to do. The one thing that I have been working on the most is the mass insert portion of the application. We get a feed that supplies us with several thousand images a day and the application needs to process this feed by taking the image, extracting the metadata from it and store that metadata into a MySQL database, moving the image to the correct location, generate a thumbnail, then take the metadata and supply it to the search engine for indexing.

As I have mentioned before, the vendor’s app is horribly slow at this process. If their application has no users on it, it will process one image every 30 seconds; with a user, 1 every minute. Totally unacceptable to say the least. So I took the challenge to beat their times. I started by using the current production files, which is about 250,000 images, as my testbed. With an empty database, the processing would do inserts at about 50 images a second (without thumbnailing), but as the table got to rougly 50k rows, it would slow down horribly. The CPU would stay between 90% and 98% and the rate would drop between 2 and 5 images per second and stay thay way.

The first thing I did was to make sure it was the database causing the bottleneck, and it was. So naturally I started googling for performance tuning on inserts. Some tips I really couldn’t implement, such as insert via files, or doing multiple ‘values’ statements per insert. I tried changing from MyISAM tables to InnoDB tables and sometimes that helped, but not enough to make a big difference. So after several days of trying different things, such as multithreading, not indexing until all data was in the database, etc. it finally dawned on me that the problem was an architectural one. Being used to the typical business application, I had a DAO using a DataSource that got a connection and did the insert via a prepared statement for each insert, and that turned out to be the problem, the database was being saturated with having a separate connection for each insert, commiting the insert and moving to the next request. I tried changing MySQL variables, buffers, etc. with no real improvement.

The break came when I changed the DAO to use a single connection opening, and that made all the difference. The app now uses about 12% CPU and pushes about 20 images a second. The best part though is when I turn on the multithreading, the images per thread per second drops to about 8, but the database doesn’t seem to bog down and I’ve run as many as 12 threads and hit averages of up to 50 per second. Now the bottleneck seems to be reading the images (done over a network connection) rather than the database.

I tried a few other things, such as using the LOCK TABLES and UNLOCK TABLES commands, but they didn’t make too much difference for me in this particular test. They may help when there is also a read load on the database as well as an insert load however.

For the indexing, I have the option of indexing per image or as a mass after the mass inserts. For the time being I’m doing them after the database inserts. I am using SOLR for the indexing engine as it fits perfectly with what I need and is very fast (based on Lucene). The process I use reads 5000 row chunks from the database and posts 1000 items at a time to SORL, doing a commit every 10,000 with an optimize at the end of the processing. It processes  250k records or so in just a few minutes.

As for hardware, the database, SOLR (running on jetty per the examples), and my app running on Tomcat are all on my desktop which is a Dell pentium 4HT with a gig of ram running Red Hat linux.

As a sidebar, the vendor claims that the Java image libraries just aren’t scalable and can’t handle the load that is required for the application. They have recommended switching to using ImageMagick and using a different server for the image processing. What a shame. I am using the exact same metadata extraction library they are using and I am getting what, a 480% increase in performance over them? Even before the tuning and doing 2 images a second, that is still a good 120% increase. With regard to the thumbnailing, I we have tested their app with and without thumbnailing and it doesn’t make too big of a difference as their main bottleneck is their code. For me the thumbnailing adds about 300-400 milliseconds per image which is significant, but since I can run multithreaded and they can’t (due to serious concurrency issues in their app), I can make the hit more bearable.



Don’t miss anything, subscribe!

Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically to your feed reader.

Comments

[...] In my continuing saga on my new favorite vendor… [...]

Cool review about and mass inserts at Thinking Outloud. Thoroughly enjoy this interesting posts.

Computer Maintenance Tips…

I couldn’t understand some parts of this article, but it sounds interesting…

Leave a comment

(required)

(required)