Blog Archives

Finding slow databse operations from application logs

A good number of the apps I write involve a database. In some cases, these applications don’t run as well I had expected. Were I a database wiz kid, I would work my database magic and find which queries are slowing down my apps, but I’m not. Regardless, I don’t generally have access to my customers databases after install to run any such required magic.

I have recently been investigating the performance of a reporting application. It produces a daily report which runs a set of queries per user, per login session. When the application was created there was an expectation that the number of users would be up to around 500. Roll forward a few years and there is a site with 5,000 users and the daily reports are not arriving for the start of the business day. Even worse (or better depending on your point of view :)) a customer with 24,000 users is preparing to deploy the application. While I could rewrite the application to query the database more efficiently, I would prefer not to as

  • It currently operates correctly, though just inefficiently
  • The application logic is very clear and simple with the current queries
  • There are costs of rewriting the application in terms of time, money and risk

So instead I will try to improve the performance of the queries.

You can generally get a good idea of the very slow running queries by eyeballing your application’s log files. With decent logging around your database operations, it’s obvious if a query is taking a long time to run. But what about more subtly slow, yet frequently called queries. While you could go nuts and index every field you can think of, this impacts your performance at the other end when you are inserting data (which is a real concern in a real time system such as the one under investigation) and wastes database resources. In my limited experience, you are better off spotting the slower queries from your application logs (which a customer can ship without engaging their DBA team).

What you need to look for is occuring at the end of each (say) second. By tracking this you get a granular breakdown of the queries (or other processing) that are delaying your application. You have to consider how often each query is running over the second boundary, after all SOMETHING has to be running as the second ticks over, but if it is consistently one of a small number of queries (or other activities), then these would be good candidates for indexing or rewriting.

In the case of this app, each line of the report starts with a time stamp 8 characters long. In a text editor I could read through the file, looking at which query was running at the end of each second. Take the example below (note the lines have been truncated here):

12:19:12 Perf1_1_60 01  115045  235959       0       0       0       0       0  
12:19:12 db_get_agent_productivity_aht: [select count(*), CAST(AVG(TALK_TIME) as
12:19:12 db_get_agent_productivity_aht: [select sum(IDLE_LENGTH) from IDLE where 
12:19:13 Perf1_1_60 01  115045  235959       0       0       0       0       0  
12:19:13 loaded Perf1_1_60 (2267)

The following query [select sum(IDLE_LENGTH) from IDLE where END_TIME >= 1327539045 and END_TIME <= 1327582799 and AGENT_DBID = '2266';] was running over the second boundary. Given I would need to execute this query some 24000 times I need to execute quickly. What I need to do is go through the whole log file (in this case some 40MB of logs) and see what is running at the end of each second.

Now to the point of this post, doing this manually is laborious to day the least. Thankfully awk comes to the rescue.

awk "// {time = $1;  if (time != oldtime) { oldtime = time; print last_line} else {last_line = substr($0,10,60)}}" example.log

In my case this will output

db_get_agent_productivity_aht: [select count(*), CAST(AVG(TA
db_get_initial_login_state: [SELECT STATUS FROM STAFFED WHER
db_get_agent_productivity_aht: [select count(*), CAST(AVG(TA

SNIP

db_get_login_times: [SELECT STATUS, TIME FROM STAFFED WHERE
db_get_login_times: [SELECT STATUS, TIME FROM STAFFED WHERE
db_get_agent_productivity_aht: [select count(*), CAST(AVG(TA
db_get_agent_productivity_aht: [select count(*), CAST(AVG(TA

I have chosen the substring parameters to cut off the timestamp and give a long enough line so that it is clear to me which query is being run, yet short enough that each instance will be the same (there is a where clause in the queries towards the end of the line).

I can then pass that through sort and uniq to get a line count for each

awk "// {time = $1;  if (time != oldtime) { oldtime = time; print last_line} else {last_line = substr($0,10,60)}}" example.log | sort | uniq -c

giving

      1
      3 db_get_agent_productivity_aht: [select CALL_TYPE, count(*),
      6 db_get_agent_productivity_aht: [select count(*), CAST(AVG(TA
     17 db_get_initial_login_state: [SELECT STATUS FROM STAFFED WHER
    334 db_get_login_times: [SELECT STATUS, TIME FROM STAFFED WHERE
    724 db_get_agent_productivity_aht: [select sum(IDLE_LENGTH) from

In this case I went back to my logs, found the full query that corresponded to the last two entries and applied some indexing to the relevant tables and indexes. On my data set, this improved by query time from 20 minutes to a little under 4 minutes. After loading the extra indexes, a stress test ensured the real time aspect (data loading) components of the solution were not impacted to any measurable degree. I think the problem is solved and it’s time to deploy to a real customer and measure the real world performance increase. Best still, it did not involve a change to the application.

Happy Australia Day, and happy birthday to my Sister-In-Law too!