SQL Server MVP Deep Dives- P8

236 CHAPTER 17 Build your own index Tester_sp calls the tested procedure with four different search strings, and records the number of rows returned and the execution time in milliseconds The procedure makes two calls for each search string, and before the first call for each string, tester_sp also executes the command DBCC DROPCLEANBUFFERS to flush the buffer cache Thus, we measure the execution time both when reading from disk and when reading from memory Of the four search strings, two are three-letter strings that appear in 10 and 25 email addresses respectively One is a five-letter string that appears in 1978 email addresses, and the last string is a complete email address with a single occurrence Here is how we test the plain_search procedure (You can also find this script in the file 02_plain_search.sql.) CREATE PROCEDURE plain_search @word varchar(50) AS SELECT person_id, first_name, last_name, birth_date, email FROM persons WHERE email LIKE '%' + @word + '%' go EXEC tester_sp 'plain_search' go The output when I ran it on my machine was as follows: 6660 ms, 10 rows Word = "joy" 6320 ms, 10 rows Word = "joy" Data in cache 7300 ms, 25 rows Word = "aam" 6763 ms, 25 rows Word = "aam" Data in cache 17650 ms, 1978 rows Word = "niska" 6453 ms, 1978 rows Word = "niska" Data in cache 6920 ms, rows Word = "omamo@petinosemdesetletnicah.com" 6423 ms, rows Word = "omamo@petinosemdesetletnicah.com" Data in cache These are the execution times we should try to beat Using the LIKE operator—an important observation Consider this procedure: CREATE PROCEDURE substring_search @word varchar(50) AS SELECT person_id, first_name, last_name, birth_date, email FROM persons WHERE substring(email, 2, len(email)) = @word This procedure does not meet the user requirements for our search Nevertheless, the performance data shows something interesting: Disk Cache joy 5006 296 aam 4726 296 niska 4896 296 omamo@ 4673 296 The execution times for this procedure are better than those for plain_search, and when the data is in cache, the difference is dramatic Yet, this procedure, too, must scan, either the table or the index on the email column So why is it so much faster? The answer is that the LIKE operator is expensive In the case of the substring function, SQL Server can examine whether the second character in the column matches the first letter of the search string, and move on if it doesn’t But for LIKE, SQL Server Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Fragments and persons 237 must examine every character at least once On top of that, the collation in the test database is a Windows collation, so SQL Server applies the complex rules of Unicode (The fact that the data type of the column is varchar does not matter.) This has an important ramification when designing our search routines: we should try to minimize the use of the LIKE operator Using a binary collation One of the alternatives for improving the performance of the LIKE operator is to force a binary collation as follows: COLLATE Latin1_General_BIN2 LIKE '%' + @word + '%' With a binary collation, the complex Unicode rules are replaced by a simple byte comparison In the file 02_plain_search.sql, there is the procedure plain_search_binary When I ran this procedure through tester_sp, I got these results: Disk Cache joy 4530 656 aam 4633 636 niska 4590 733 omamo@ 4693 656 Obviously, it’s not always feasible to use a binary collation, because many users expect searches to be case insensitive However, I think it’s workable for email addresses They are largely restricted to ASCII characters, and you can convert them to lowercase when you store them The solutions I present in this chapter aim at even better performance, but there are situations in which using a binary collation can be good enough NOTE In English-speaking countries, particularly in the US, it’s common to use a SQL collation For varchar data, the rules of a SQL collation encompass only 255 characters Using a binary collation gives only a marginal gain over a regular case-insensitive SQL collation Fragments and persons We will now look at the first solution in which we build our own index to get good performance with searches using LIKE, even on tens of millions of rows To achieve this, we first need to introduce a restriction for the user We require his search string to contain at least three contiguous characters Next we extract all threeletter sequences from the email addresses and store these fragments in a table together with the person_id they belong to When the user enters a search string, we split up the search string into three-letter fragments as well, and look up which persons they map to This way, we should be able to find the matching email addresses quickly This is the strategy in a nutshell We will now go on to implement it The fragments_persons table The first thing we need is to create the table itself: CREATE TABLE fragments_persons ( fragment char(3) NOT NULL, person_id int NOT NULL, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 238 CHAPTER 17 Build your own index CONSTRAINT pk_fragments_persons PRIMARY KEY (fragment, person_id) ) You find the script for this table in the file 03_fragments_persons.sql This script also creates a second table that I will return to later Ignore it for now Next, we need a way to get all three-letter fragments from a string and return them in a table To this end, we employ a table of numbers A table of numbers is a onecolumn table with all numbers from to some limit A table of numbers is good to have lying around as you can solve more than one database problem with such a table The script to build the database for this chapter, 01_build_database.sql, created the table numbers with numbers up to one million When we have this table, writing the function is easy: CREATE FUNCTION wordfragments(@word varchar(50)) RETURNS TABLE AS RETURN (SELECT DISTINCT frag = substring(@word, n, 3) FROM numbers WHERE n BETWEEN AND len(@word) - ) Note the use of DISTINCT If the same sequence appears multiple times in the same email address, we should store the mapping only once You find the wordfragments function in the file 03_fragments_persons.sql Next, we need to load the table The CROSS APPLY operator that was introduced in SQL 2005 makes it possible to pass a column from a table as a parameter to a tablevalued function This permits us to load the entire table using a single SQL statement: INSERT fragments_persons(fragment, person_id) SELECT w.frag, p.person_id FROM persons p CROSS APPLY wordfragments(p.email) AS w This may not be optimal, though, as loading all rows in one go could cause the transaction log to grow excessively The script 03_fragments_persons.sql includes the stored procedure load_fragments_persons, which runs a loop to load the fragments for 20,000 persons at a time The demo database for this chapter is set to simple recovery, so no further precautions are needed For a production database in full recovery, you would also have to arrange for log backups being taken while the procedure is running to avoid the log growth If you have created the database, you may want to run the procedure now On my computer the procedure completes in 7–10 minutes Writing the search procedure Although the principle for the table should be fairly easy to grasp, writing a search procedure that uses it is not as trivial as it may seem I went through some trial and error, until I arrived at a good solution Before I go on, I should say that to keep things simple I ignore the possibility that the search string may include wildcards like % or _, as well as range patterns like [a-d] Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Fragments and persons 239 or [^a-d] The best place to deal with these would probably be in the wordfragments function To handle range patterns correctly would probably call for an implementation in the CLR THE QUEST The first issue I ran into was that the optimizer tried to use the index on the email column as the starting point, which entirely nullified the purpose of the new table Thankfully, I found a simple solution I replaced the LIKE expression with the logical equivalent as follows: WHERE patindex('%' + @wild + '%', email) > By wrapping the column in an expression, I prevented SQL Server from considering the index on the column My next mistake was that I used the patindex expression as soon as an email address matched any fragment from the search string This was not good at all, when the search string was a com address When I gave it new thought, it seemed logical to find the persons for which the email address included all the fragments of the search string But this too proved to be expensive with a com address The query I wrote had to read all rows in fragments_persons for the fragments co and com ENTER STATISTICS I then said to myself: what if I look for the least common fragment of the search string? To be able to determine which fragment this is, I introduced a second table as follows: CREATE TABLE fragments_statistics (fragment char(3) NOT NULL, cnt int NOT NULL, CONSTRAINT pk_fragments_statistics PRIMARY KEY (fragment) ) The script 03_fragments_persons.sql creates this table, and the stored procedure load_fragments_persons loads the table in a straightforward way: INSERT fragments_statistics(fragment, cnt) SELECT fragment, COUNT(*) FROM fragments_persons GROUP BY fragment Not only we have our own index, we now also have our own statistics! Equipped with this table, I finally made progress, but I was still not satisfied with the performance for the test string omamo@petinosemdesetletnicah.com When data was on disk, this search took over seconds, which can be explained by the fact that the least common fragment in this string maps to 2851 persons THE FINAL ANSWER I did one final adjustment: look for persons that match both of the two least common fragments in the search string Listing shows the procedure I finally arrived at Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 240 CHAPTER 17 Listing Build your own index The procedure map_search_five CREATE PROCEDURE map_search_five @wild varchar(80) AS DECLARE @frag1 char(3), @frag2 char(3) ; WITH numbered_frags AS ( SELECT fragment, rowno = row_number() OVER(ORDER BY cnt) FROM fragments_statistics WHERE fragment IN (SELECT frag FROM wordfragments(@wild)) ) SELECT @frag1 = MIN(fragment), @frag2 = MAX(fragment) FROM numbered_frags WHERE rowno EXISTS (SELECT * FROM fragments_persons fp WHERE fp.person_id = p.person_id AND fp.fragment = @frag1) EXISTS (SELECT * FROM fragments_persons fp WHERE fp.person_id = p.person_id AND fp.fragment = @frag2) The common table expression (CTE) numbered_frags ranks the fragments by their frequency The condition rowno THEN UPDATE SET cnt = fs.cnt + d.cnt WHEN MATCHED THEN DELETE WHEN NOT MATCHED BY TARGET THEN INSERT (fragment, cnt) VALUES(d.fragment, d.cnt); go The trigger starts with two quick exits At B we handle the case that the statement did not affect any rows at all In the case of an UPDATE operation, we don’t want the trigger to run if the user updates some other column, and this is taken care of at C Observe that we cannot use a plain IF UPDATE, as the trigger then would exit directly on any DELETE statement Thus, the condition on IF UPDATE is only valid if there are also rows in the virtual table inserted At D we get the changes caused by the action that fired the trigger Inserted fragments get a weight of and deleted fragments get a weight of -1 If a fragment appears both in the new and old email addresses, the sum will be 0, and we can ignore it Otherwise we insert a row into the table variable @changes Next at E we use this table variable to insert and delete rows in the fragments_persons table In SQL 2008, we can conveniently use a MERGE statement, whereas in the SQL 2005 version, there is one INSERT statement and one DELETE statement Finally, at F we also update the fragments_statistics table Because this is only a statistics table, this is not essential, but it’s a simple task—especially with MERGE in SQL 2008 In SQL 2005, this is one INSERT, UPDATE, and DELETE each To test the trigger you can use the script in the file 06_map_trigger.sql The script performs a few INSERT, UPDATE, and DELETE statements, mixed with some SELECT statements and invocations of map_search_five to check for correctness What is the overhead? There is no such thing as free lunch As you may expect, the fragments_persons table incurs overhead To start with, run these commands: EXEC sp_spaceused persons EXEC sp_spaceused fragments_persons The reserved space for the persons table is 187 MB, whereas the fragments_persons table takes up 375 MB —twice the size of the base table What about the overhead for updates? The file 07_trigger_volume_test.sql includes a stored procedure called volume_update_sp that measures the time to insert, update, and delete 20,000 rows in the persons table You can run the procedure with the trigger enabled or disabled I ran it this way: EXEC volume_update_sp NULL EXEC volume_update_sp 'map' No trigger enabled Trigger for fragments_persons enabled Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Fragments and lists 243 I got this output: SQL 2005 INSERT took UPDATE took DELETE took INSERT took UPDATE took DELETE took 1773 ms 1356 ms 826 ms 40860 ms 32073 ms 30123 ms SQL 2008 INSERT took UPDATE took DELETE took INSERT took UPDATE took DELETE took 700 ms 1393 ms 610 ms 22873 ms 35180 ms 28690 ms The overhead for the fragments_persons table is considerable, both in terms of space and update resources, far more than for a regular SQL Server index For a table that holds persons, products, and similar base data, this overhead can still be acceptable, as such tables are typically moderate in size and not updated frequently But you should think twice before you implement something like this on a busy transactional table Fragments and lists The fragments_persons table takes up so much space because we store the same fragment many times Could we avoid this by storing a fragment only once? Yes Consider what we have in the following snippet: fragment -aam aam aam aan aan person_id 19673 19707 43131 83500 192379 If we only wanted to save space, we could just as well store this as follows: fragment -aam aan person_ids 19673,19707,43131 83500,192379 Most likely, the reader at this point gets a certain feeling of unease, and starts to ask all sorts of questions in disbelief, such as Doesn’t this violate first normal form? How we build these lists in the first place? And how would we use them efficiently? How we maintain these lists? Aren’t deletions going to be very painful? Aren’t comma-separated lists going to take up space as well? These questions are all valid, and I will cover them in the following sections In the end you will find that this outline leads to a solution in which you can implement efficient wildcard searches with considerably less space than the fragments_persons table requires There is no denial that this violates first normal form and an even more fundamental principle in relational databases: no repeating groups But keep in mind that, although Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 244 CHAPTER 17 Build your own index we store these lists in something SQL Server calls a table, logically this is an index helping us to make things go faster There is no data integrity at stake here Building the lists Comma-separated lists would take up space, as we would have to convert the id:s to strings This was only a conceptual illustration It is better to store a list of integer values by putting them in a varbinary(MAX) column Each integer value then takes up four bytes, just as in the fragments_persons table To build such a list you need a user-defined aggregate (UDA), a capability that was added in SQL 2005 You cannot write a UDA in T-SQL, but you must implement it in a CLR language such as C# In SQL 2005, a UDA cannot return more than 8,000 bytes, a restriction that was removed in SQL 2008 Thankfully, in practice this restriction is insignificant, as we can work with the data in batches In the download archive you can find the files integerlist-2005.cs and integerlist2008.cs with the code for the UDA, as well as the compiled assemblies The assemblies were loaded by 01_build_database.sql, so all you need to at this point is to define the UDA as follows: CREATE AGGREGATE integerlist(@int int) RETURNS varbinary(MAX) EXTERNAL NAME integerlist.binconcat This is the SQL 2008 version; for SQL 2005 replace MAX with 8000 Note that to be able to use the UDA, you need to make sure that the CLR is enabled on your server as follows: EXEC sp_configure 'clr enabled', RECONFIGURE You may have to restart SQL Server for the change to take effect Unwrapping the lists The efficient way to use data in a relational database is in tables Thus, to use these lists we need to unpack them into tabular format This can be done efficiently with the help of the numbers table we encountered earlier in this chapter: CREATE FUNCTION binlist_to_table(@str varbinary(MAX)) RETURNS TABLE AS RETURN (SELECT DISTINCT n = convert(int, substring(@str, * (n - 1) + 1, 4)) FROM numbers WHERE n

SQL Server MVP Deep Dives- P8

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan