Subscribe to
Posts
Comments

I frequently need to update large amount of data, like tens of millions of rows.

Usually the update request comes as that there is a table (let’s call it new_table) with a record id column (record_id) and a few other columns, say date_field1, data_field2 and data_field3 that need to be updated into another table (old_table). The old_table also contains the record_id column and the data field columns. It also contains a lot of other fields. When you look at such request a straight forward solution is to issue an update statement to update the data fields into the old_table like this.

update o set o.data_field1=n.data_field1, o.data_field2=n.data_field2, o.data_field3=n.data_field from old_table o join new_table n on o.record_id=n.record_id

If you do this you will probably find your update query run for 24 hours and still can not finish. Even when you have indexes correctly built on both the new and old tables. I think a much faster approach is to create another table based on the old table and incorporate the updates into the new table while the table is built. To do that you need the help of a select … into query like this

select o.recordid,
case when n.data_field1 is not null then n.data_field1 else o.data_field1 end as data_field1,
case when n.data_field2 is not null then n.data_field2 else o.data_field2 end as data_field2,
case when n.data_field3 is not null then n.data_field3 else o.data_field3 end as data_field3,

o.other_fileds.....

into table_with_updates
from old_table o left join new_table n on o.record_id=n.record_id

A few notes about this query. First the old table is left join to new_table so that all records in the old table are in the result of the select statement. Second the three case statements are to use the data field values in the new table if it has one and use the value in the old table only when new table doesn’t have it. In effect this is an update of the data fields. Third “into table_with_updates” instructs SQL server to create a table called table_with_updates and dump the result set of the select statement into the table.

After the table with updates is built, you can drop the old table and rename the table with updates to the old table. If both the new table and old table have clustered indexes built on the record_id columns, the select…into query usually can finish within a couple hours. That is more than 10 times faster!


Related Posts:

  • MySQL Update Statement
  • T-SQL Query to Get Database Size
  • MySQL vs SQL Server in Data File Management
  • How to Solve Problem That Excel Data Imported Into SQL Server Via DTS Became Null
  • A Useful Script to Get The Disk Space Used by Every Table


  • 1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 3 out of 5)
    Loading ... Loading ...

    RSS feed | Trackback URI

    6 Comments »

    Comment by Scott
    2008-04-07 21:43:59
    MyAvatars 0.2

    While I appreciate this is a time saving approach, and your testing indicates it has improved speed 10 fold…it may be worth mentioning that “Select Into” clauses are very inefficient.

    I carried out some similar tests on 2 million rows, and found Select Into took a lot longer than an Insert Into…Select From query.

     
    Comment by Andrew Subscribed to comments via email
    2008-04-16 05:32:11
    MyAvatars 0.2

    That’s probably true but I didn’t have that observasion. If anyone can provide some input on what is the underlying difference between “select * into table from” and “insert into table select * from” it will be great.

     
    Comment by adam
    2009-09-11 14:31:10
    MyAvatars 0.2

    select into and insert into CAN be equivalent in speed. select into will create locks while determining the data structure though. for small datasets, insert into is less typing. for cases where you will need to keep locks to a minimum, you can do a select into where 1=0 and then an insert into. Allows you to not have to code out the table creation while getting everything structurally you need.

     
    Comment by Michael Brönnimann Subscribed to comments via email
    2011-05-07 03:02:04
    MyAvatars 0.2

    Instead of going thru the INSERT and sp_rename path, which also requires to rebuild any other table constraints and indexes, you may consider executing a regular UPDATE with parallel execution…

    Parallelization for queries (SELECT) is covered quite well by the SQL engine itself, but when it comes to large volume data modifications (UPDATE, INSERT, DELETE), the standard engine does parallelize towards best use of all available resources (disk, multiple cpu-cores, etc.).

    Therefore you may have a look into the approach of SQL Parallel Boost at
    http://www.ibax.ch/-IBX-/plain.simpleimagetitletextlinklist.en.solutions.products.parallelboost/default.aspx

    This approach can also be used to execute multiple SQL statements in parallel.

    A purely SQL engine related parallelisation solution takes advantage of minimized complexity and has no ‘external’ components like SSIS involved, Furthermore it’s the best performing solution regarding task splitting and synchronization, as it hasn’t potential connection and communication overhead. The overall performance gain thru parallelisation with SQL Parallel Boost is up to 10 !

    In case you don’t wan’t to rebuild your own solution, SQL Parallel Boost provides a self-contained pure T-SQL based solution, which can be easily embedded in existing applications and ETL process tasks.

    Comment by Michael Brönnimann Subscribed to comments via email
    2011-08-23 03:55:02
    MyAvatars 0.2

    A free Community Edition of SQL Parallel Boost can be downloaded at CodePlex

    :smile:

     
     
    Comment by Tech Helper
    2011-05-10 10:37:16
    MyAvatars 0.2

    You will need to delete in stages for best results, see this thread for more on deleting large amounts of sql data.. :arrow: http://www.techhelpcentre.com/showthread.php?t=242889

     
    Name (required)
    E-mail (required - never shown publicly)
    URI
    Subscribe to comments via email
    Your Comment (smaller size | larger size)
    You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.