A Fast Way to Update Large Amount of Data in SQL Server With Select Into Statement
November 24th, 2007 by Andrew Chen
I frequently need to update large amount of data, like tens of millions of rows.
update o set o.data_field1=n.data_field1, o.data_field2=n.data_field2, o.data_field3=n.data_field from old_table o join new_table n on o.record_id=n.record_id
If you do this you will probably find your update query run for 24 hours and still can not finish. Even when you have indexes correctly built on both the new and old tables. I think a much faster approach is to create another table based on the old table and incorporate the updates into the new table while the table is built. To do that you need the help of a select … into query like this
select o.recordid,
case when n.data_field1 is not null then n.data_field1 else o.data_field1 end as data_field1,
case when n.data_field2 is not null then n.data_field2 else o.data_field2 end as data_field2,
case when n.data_field3 is not null then n.data_field3 else o.data_field3 end as data_field3,
o.other_fileds.....
into table_with_updates
from old_table o left join new_table n on o.record_id=n.record_id
A few notes about this query. First the old table is left join to new_table so that all records in the old table are in the result of the select statement. Second the three case statements are to use the data field values in the new table if it has one and use the value in the old table only when new table doesn’t have it. In effect this is an update of the data fields. Third “into table_with_updates” instructs SQL server to create a table called table_with_updates and dump the result set of the select statement into the table.
After the table with updates is built, you can drop the old table and rename the table with updates to the old table. If both the new table and old table have clustered indexes built on the record_id columns, the select…into query usually can finish within a couple hours. That is more than 10 times faster!



While I appreciate this is a time saving approach, and your testing indicates it has improved speed 10 fold…it may be worth mentioning that “Select Into” clauses are very inefficient.
I carried out some similar tests on 2 million rows, and found Select Into took a lot longer than an Insert Into…Select From query.
That’s probably true but I didn’t have that observasion. If anyone can provide some input on what is the underlying difference between “select * into table from” and “insert into table select * from” it will be great.
select into and insert into CAN be equivalent in speed. select into will create locks while determining the data structure though. for small datasets, insert into is less typing. for cases where you will need to keep locks to a minimum, you can do a select into where 1=0 and then an insert into. Allows you to not have to code out the table creation while getting everything structurally you need.
Instead of going thru the INSERT and sp_rename path, which also requires to rebuild any other table constraints and indexes, you may consider executing a regular UPDATE with parallel execution…
Parallelization for queries (SELECT) is covered quite well by the SQL engine itself, but when it comes to large volume data modifications (UPDATE, INSERT, DELETE), the standard engine does parallelize towards best use of all available resources (disk, multiple cpu-cores, etc.).
Therefore you may have a look into the approach of SQL Parallel Boost at
http://www.ibax.ch/-IBX-/plain.simpleimagetitletextlinklist.en.solutions.products.parallelboost/default.aspx
This approach can also be used to execute multiple SQL statements in parallel.
A purely SQL engine related parallelisation solution takes advantage of minimized complexity and has no ‘external’ components like SSIS involved, Furthermore it’s the best performing solution regarding task splitting and synchronization, as it hasn’t potential connection and communication overhead. The overall performance gain thru parallelisation with SQL Parallel Boost is up to 10 !
In case you don’t wan’t to rebuild your own solution, SQL Parallel Boost provides a self-contained pure T-SQL based solution, which can be easily embedded in existing applications and ETL process tasks.
A free Community Edition of SQL Parallel Boost can be downloaded at CodePlex
You will need to delete in stages for best results, see this thread for more on deleting large amounts of sql data..
http://www.techhelpcentre.com/showthread.php?t=242889