Ever wanted to have an easy way to get the most recent entry for any given key on a massive database? I ran into this problem a few days ago at work and I wanted to attempt to explain how I solved it.
We have a large database that contains the lifetime history of our trucks GPS locations. The data in these tables is entered once every 10 seconds per truck, given the fact that we have about 30 trucks on average there are about 31536000 entries into this table per year and we have been running this program for several years now. We need a quick way to get the most recent location for every truck on the fleet regardless of whether it was last reported 10 seconds ago or 10 days ago.
For a while we had created a simple SQL query that returned the MAX(date) record for each given truck and this seemed to work all right. That is until yesterday evening, at that time the query that I just mentioned was taking somewhere in the realm of 10-15 minutes to execute, just long enough for our program to time out. Our table now contains somewhere in the realm of two billion records and this query is just not efficient enough. We attempted to optimize the query and gained some performance but not nearly enough to make a sustainable difference in the future.
The program that relies on this query is a real time monitoring system and we cannot sit there and wait for this to process during several minutes. An idea I had was to intercept the incoming stream and tag it as “most-recent” before insertion thus allowing a simple query such as “SELECT * FROM TABLE WHERE MOST-RECENT=true” would work. The challenge with this approach was, the interception of the data. We knew that we could write a simple database trigger, but had no idea if would allow us to modify the incoming data stream.
Most of the database triggers happen after insertion, or update so this may have posed a problem. However it appears that Microsoft had thought about this for us and gave us “Instead of Insert” triggers. This triggers take the incoming data and allow you to do something with it other than actually inserting it into the database. We decided to take the incoming data, update the table to clear the “most-current” flag and then insert the data while updating the current records most-current flag. See the implementation below, we went from a 10 minute query to a milliseconds long query. And the overhead placed on the insert hasn’t posed a problem for the amount of data we receive.
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO -- ============================================= -- Author: Jose C Gomez -- Create date: 5/10/2010 -- Description: Trigger that allows us to keep track of the Vehicles most recent location -- ============================================= CREATE TRIGGER InsteadofInsert ON dbo.current_location INSTEAD OF INSERT AS BEGIN -- SET NOCOUNT ON added to prevent extra result sets from -- interfering with SELECT statements. SET NOCOUNT ON; -- Insert statements for trigger here --CLEAR THE OLDER MOST-RECENT FLAG UPDATE current_location SET most_current=0 WHERE most_current =1 AND Device_ID in (SELECT Device_ID from Inserted); --INSERT THE RECORD WITH THE MOST-RECENT FLAF SET TO TRUE (1) INSERT INTO current_location SELECT Device_ID, Lat, Lon, date, ip_address, direction, speed ,1 FROM Inserted END GO