Simon Sabin
SQL Down Under Show 44 - Guest: Simon Sabin - Published: 30 Aug 2009
In this show SQL Server MVP Simon Sabin discusses full text indexes in SQL Server 2008 along with thoughts on learning MDX and on technical book writing.
Details About Our Guest
Simon Sabin is an independent SQL Server consultant and trainer in the U.K. His website is at onarc.com. Simon has worked with SQL Server since 1998 and has focused on high performance, reliable systems. He has a particular expertise in the world of search, distributed architectures, business intelligence, and application development. He loves making websites and applications run faster through scaled technologies and performance tuning databases. Simon was awarded an MVP for SQL Server in 2006. He runs user groups in London and Cambridge, in particular www.sqlsocial.com. In 2007 he founded sqlbits.com which is currently the largest SQL Server conference in Europe. He’s a regular speaker at SQL Server events as well as writing for his blog which is at sqlblogcasts.com/blogs/simons.
Show Notes And Links
Show Transcript
Greg Low: Introducing show 44 with guest Simon Sabin. Welcome. Our guest today is Simon Sabin. Simon is an independent SQL Server consultant and trainer in the U.K. His website is at onarc.com. Simon has worked with SQL Server since 1998 and has focused on high performance, reliable systems. He has a particular expertise in the world of search, distributed architectures, business intelligence, and application development. He loves making websites and applications run faster through scaled technologies and performance tuning databases. Simon was awarded an MVP for SQL Server in 2006. He runs user groups in London and Cambridge, in particular www.sqlsocial.com. In 2007 he founded sqlbits.com which is currently the largest SQL Server conference in Europe. He’s a regular speaker at SQL Server events as well as writing for his blog which is at sqlblogcasts.com/blogs/simons.
So welcome Simon! It’s good to have you on the show finally. First up, how did you ever come to be involved with SQL Server in the first place?
Simon Sabin: I graduated from University in 1997. I joined a consultancy which was at that time CMG. It’s now been consumed. I joined as a graduate and became interested in working with systems. I was put on Oracle systems to start with. I found that a small Oracle system worked running the deregulation energy market. As with many consultants, you go where the work is. The next place I worked was with SQL Server. I went on a training course and picked up SQL Server 6.5 at that point, and went working on a Microsoft SQL Server project converting a FoxPro claims system into SQL Server which was amazing. It really got me into the whole performance aspect of databases. When you’re converting data that has millions of rows of data in it with x number of columns you soon find how the smallest operation on converting a single column can impact your performance hugely. A million times a hundred is a large number. Even if you can make milliseconds worth of savings per the core, that starts impacting the performance. And so I went from there and haven’t looked back.
Greg Low: Tell us a little bit about your current role.
Simon Sabin: I’m currently an independent consultant in the U.K. I’m doing training at the moment which is always fascinating to me because I’m an MVP and sometimes you get so ingrained in the MVP world, working with some of the best people in the world that you soon forget some of the basics of SQL Server. A lot of people don’t understand a lot of the internal stuff. That’s really fascinating for me to try and pass that information to other people so they understand the indexes properly and trying to dispel lots of little myths about indexing and database tuning.
Greg Low: It’s interesting that you mention that. I’ve been running a few sessions lately where again it’s just absolute basics, sort of indexing and index structure and index design and things like that. It’s easy to forget what’s just not common knowledge, unfortunately.
Simon Sabin: Yes it is quite amazing. You just come across so many scenarios where people go they might know as indexes but they’ve gone and indexed something one way and in the classic case they have the table, where first name is Fred and so they put the index on the first name. Because the index doesn’t cover the query, the index doesn’t get used.
Greg Low: And it’s not very selective in the first place. It’s kind of interesting. I was mentioning on my blog, most of the time I only recommend books that I really like. In the category of if you can’t say something nice don’t say anything. I often have people say “every time I read a review I know you’re going to recommend it at the end because you only do one’s you recommend”. So I broke with tradition recently and did post some stuff on one of the books. Database refactoring is a scenario that’s fascinated me for a long time. I saw Scott Ambler’s book was the first one that really tackled the subject, I thought, in a good way apart from refactoring in the normal dev world but in the database world which I thought was kind of good. The thing is I didn’t like the way he kind of did things because everything was done at lowest common denominator of SQL. It was all done via triggers and things. The way he did it you just wouldn’t do in SQL Server necessarily. I was always kind of hoping there would be another book that came along. I broke one of my own rules in that I was on the Amazon site and it said if I enjoyed one book I’d probably like this one. It pointed me to one book called “Refactoring SQL Databases”, so I picked it up. It was a pre-order as well. It hadn’t even come out. Normally I only order books when people have said, “hey, Greg. This is a good book. Go and get it”. And so I went and got this book. I was sitting there reading it and I described it like a train wreck of a book. It was reading it and could see it was going in the wrong direction and it wasn’t going to be a pretty outcome but it was too late to do anything about it. It was indexing that threw me the most. What he was doing was building indexes that weren’t very good for the queries he was dealing with, then he was coming to the conclusion that indexing itself wasn’t very useful. It was really sad. But it was interesting again that typically the thing he was doing was exactly the same thing where he would have some column he thought he was selecting on and then indexing that column only. It was just a horrible index.
Simon Sabin: Yes. The other thing is when people have two filters and they select to put two indexes on and think the data they send you will do that ie: use both indexes. In some scenarios it will but largely it won’t. One thing I try to get across to people when I’m training is that SQL Server isn’t magic. It doesn’t automatically merge your data and tables together. It’s code. It’s SQL Server XE. It’s got loops in it. It’s got generating structures in it. It is just that. If you can’t imagine how you would write the code to do then, ...
Greg Low: ... Guess what, it’s probably not going to work. It’s funny you use that analogy. I tend to use a similar one. I look at people who spend their life looking at execution plans, trying to work out what’s wrong. To me it always seems you should be able to work out how to do it yourself and you should almost be using the execution plan just to see that it ended up doing it the way you expected it to do it. I always like the mental picture of how I know I would actually answer that query if I had to, as a human sitting down and doing it.
Simon Sabin: Yeah. The area I find most fascinating is when you’re dealing with large volumes of data and you’re combining a table with 200 million rows with another table with 100 million rows and another table with 100 million rows and that’s when things go a bit off the wall. It’s interesting when you look at the features that are going into SQL Server, a lot of them are around those sorts of areas where really it comes down to theoretical, mathematical algorithms the database engine has to come up with. The whole bit mark filter stuff in 2008. You mention books. I’m editing a book at the moment by Christian Bolton. Troubleshooting. I did some editing at the beginning of the year. I think it’s very easy to write something. But to write something as a book that reads well I think is very, very difficult.
Greg Low: It’s tough. I think even what people underestimate is that even when you get the technical part of it right, even the process as the thing goes through is often like amazing in most of these. In fact, I’ve just finished doing a chapter for the next Inside SQL book. I’ve done sort of extended events and auditing and change fragment and change data capture. But the thing is once you have the technical stuff down, the whole process of tech editing, sort of pulling it in, making it flow right, even little things like trying to make sure that if somebody is following along with the code in the book, for example, that at the end you always make sure that you undo all the things you did in the chapter. There’s all this sort of stuff which is quite separate from getting across the technical concept but the whole rigmarole of how they do it. Other things I’ve found quite strange, even the names of thing like features, the capitalization of that. MSPress determines that based on what’s in books online. So if a particular feature has it in capitals in books online, then the book will have it in capitals. Interestingly things like change data capture is lower cased c, d, and c, but extended events are capital e and capital e. I was going whoa. It’s kind of an interesting exercise going through that. It’s certainly interesting trying to get that whole thing. I think it’s interesting, the amount you have to focus your thinking, to be able to get it down to a readable chapter in a book, let alone a whole book.
Simon Sabin: I’ve been blogging, I’m one of the longest SQL bloggers. That’s partly why I got where I am in the first place. Sometimes I sit with books and wonder if books are really the thing? Because it’s almost like trying to write in such a concise way to fit into a book, often you’re going to miss so much stuff, whereas in a blog you can write endless blog posts and change them. But books are still popular.
Greg Low: Yes, they are. I still love them. I love being able to pick up a book and take it with me. I do spend a lot of time on planes, and sitting on a plane reading a book still works for me. It really does.
Simon Sabin: That’s one thing nice about books. I’ve got a room full of them luckily. People send me books to review and stuff. So I’ve got a nice set of books. I was loading MDX the other day and it’s always nice to flip through a page.
Greg Low: There’s an interesting question. What book did you read for MDX?
Simon Sabin: I’ve actually come up with a goal. The third time you learn something, I think. The first time for learning something, it’s difficult to pick it up and be able to carry on and run with it. The second time, you get a bit more familiar. The third time you go in you start seeing stuff and you end up with this recollection and you can then start moving things forward. The third time I picked up MDX I was reading two books. One was Chris Webb’s Applied MDX book, the other was the Technical Consulting MDX book. I can’t remember what it was, but it nicely covered things. I think the problem from a relational perspective, going to MDX, is it’s a head thing. You’ve got to get it in your head straight to figure out the differences and the similarities between My SQL and My MDX.
Greg Low: I also liked, I’ve got one of my colleagues on, Mark Whitehorn, also from the U.K. Mark had a SQL 2000 book. It was something like Easy Path to MDX or something like that. I actually really enjoyed the book. I just wish he’d built a 2008 version of it. I did text to him recently about that. He said he might.
Simon Sabin: I remember meeting Mark at the 2005 or 2008 launch. He’s a writer, isn’t he?
Greg Low: Yeah. The three people involved in that book, one was the Product Manager, the other one was Mosha who basically came up with the language, and Mark was the professional writer and technologist who was the one putting the book together.
Simon Sabin: Yeah. I think one thing with MDX is its SQL, rewriting queries can phenomenally change how you get things done. Amazing.
Greg Low: In terms of things, being able to change apps and things, today’s topic was about search. One thing I’ve found is that if you look at full text search, which has been part of SQL Server for a pretty long time now, I’ve found that whenever I do a session on full text search, I’ve found there are three groups of people in the room. Most of the room has never touched it. There are a few people who have come along and find that what’s different in the latest version. And the remaining people, a handful there that have old pain associated with earlier versions. Do you strike similar things?
Simon Sabin: Yeah. There’s probably a fourth group as well. Those that are really using it. They’ve decided to either keep it or gone with something else like Lucerno or one of the other open source type search analogies. I was pushing it for a while as I’ve got a whole series of full text posts on SQL 2008. You go along to a session with 100 people and two people put their hand up that they used full text. It’s a bit like replication. It’s got a big stigma attached to it. People either don’t know about it or people think you can’t touch that, it’s really complicated, it doesn’t work. It really bugs me when people say that about replication as well.
Greg Low: Configured properly, it’s actually really good. And actually, what intrigues me with replication is that the next part of that is that people always presume they can roll their own and it’ll be better anyway.
Simon Sabin: Yeah. I remember meeting a Microsoft evangelist in the U.K. last year. He was off to a client. They wanted to roll replication. It’s like Microsoft has been doing replication for ten years, so they find out a lot of those issues that you come across.
Greg Low: The odd few things you might not have thought of. I found another thing is service broker. I was at a site just this week where people were building their own queuing system. You start saying things like how do you handle poison messages, and they go what? There is a reason somebody else has gone to a lot of trouble thinking about these things.
Simon Sabin: I’m building a session on service broker which is an interactive session. The hope is that we turn up with a reader and a laptop. Hopefully people turn up with their own laptops. We have them download a little install which sets up a service broker between their machine and our machine. It allows us to show how it can scale out using service broker. We start off using a local machine and getting so many transactions per second. What we do is it turn off the local machine for processing and just distribute all the requests after the people in the audience and see how things pick up. What’s great is that you can get people to turn off their machine, and requests start reaching other machines and it’s just such a resilient mechanism.
Greg Low: It’s interesting that it’s underutilized. We did a podcast on that with Niels Berglund a few years ago. I think this another one I must get some material on. If I look at it as a topic, it always come up as people saying it’s underutilized whenever I talk to the product group. But then I say, hand on, you gave us no tools and no proscriptive guidance. You gave us good plumbing but without those other things, you can’t really expect there will be a lot of adoption. In terms of full text search, to get on that one, maybe if we just start with what you can do with full text search and why it should be of interest to anybody. I find it I put full text search in the title of a session for say TechEd, and I have full text search in the title, people say I don’t use that and they don’t come. But if I put in a more meaningful title like answering the queries you use, then you can start to get people to take another look at it or realize this might be something of interest. So what is it we do with full text search that would be of interest to people? Why does it matter?
Simon Sabin: With full text search what you have the ability of doing is looking for words within text. Most people start off when they’re looking for doing that sort of thing, they have a product description on their product table and someone’s doing a search on the website and they want to find anything that is a car. So what they can do is look within that product description for the word car. Without full text search, what you’re limited to doing, is like %car%. That would return everything that has the three letters c, a, r in it. That’s got a few limitations. One is that it won’t say anything with car in it. It doesn’t search for the specific word. More importantly, because you’re doing %car%, SQL Server can’t use an index. What’s its going to do is essentially scan the whole table to get hold of that data. With full text search, what it does is essentially take the product description, split all the words contained within it, then create its own little index which is the full text search index that’s got those words in it. When you come along and say you want every product that contains the word car, it can go and find anything that has car in the full text index and return that data very efficiently. It enables you to do searching within actual data. Essentially, that’s what full text search does.
Greg Low: It’s an interesting technology in that I often sort of, again a similar example, say you’re go and search for pen, you’re going to find pencil and pendulum. Of course with the Internet you’re going to find anything to do with the male anatomy. There’s going to be stuff everywhere. So it’s going to be everything but what you actually want to go looking for. But also, we have the ability to do other words, suppose we start with what you can do, for example.
Simon Sabin: Beyond looking for simple words, and you’ve got the ability to find words near each other, if you want to find anything that is looking for SQL Server Integration Services, what you could do is put it in quotes and look for the specific phrase, but what you could also do is look for something where SQL Server is near Integration Services and so you’re finding a more relevant content. That’s the key aspect for full text search. Search, generally, is all about gaining relevance. Search is not a precise technology. There’s not a way that you can look, and I actually did a search for something like Integration Services, and give you an exact match back. I’m not going to find the exact one document that you want. So what the aim of search is, is to provide the user the data in a relevant mechanism. If you’re searching for pen and you’re using like, what you’ll find is lots of irrelevant content, which means the search is a very poor experience. With full text search, you can essentially provide a better relevance. What near does is expand beyond the phrase SQL Server Integration Services to other documents that have SQL Server Integration Services together, but maybe not as a phrase. What you’re doing is getting more content which may be more relevant to the user.
Greg Low: I suppose the whole concept is that we have the concept that as soon as you get into fuzzy things you have a concept of ranting. It’s sort of the fact something is nearer that something else and increases, perhaps to you, its relevancy. Most people have had the experience where they go searching for something in a system. They search for two words and when they get the return there’s one word at the top of a 50 page article and the other words are at the bottom. It just has no relevance at all in terms of what you’re looking for. What I find interesting is that if I look at how IT people work, IT people love systems that are I would say nice and neat and precise and things like that, yet end users don’t. They love stuff that is soft and fuzzy. You’ve got to look at the success of search engines to know that that’s how people actually want to search yet in no systems do we give them that ability.
Simon Sabin: That’s an interesting comment because there are definitely two different people in the world. I encompass search as just finding content. There are definitely two different type of people. Those that like doing a keyword base search. One thing in a previous job I had was running the database architecture. I had a team on the site and cwjobs.com in the U.K. and other job sites in the U.K. It’s all about people finding their jobs. What we found is that there’s a certain number of people that like doing keyword based searching but having to put in complex keyword structures to try and find jobs you want is maybe not what you want. There’s this other area which is all about I maybe call it discoverability, which is around people browsing for data. They know they live in London so they’ll click on London. They know they want an IT job so they’ll click on IT. They know they want something in database so they click on database. They want something using SQL 2008 so they click on SQL 2008. What it means is that people can browse the data. There are two areas there. What I find is that search needs to encompass both of those. If you go to eBay for instance, eBay is very good at this in that I can often start off with a search and then go and start clicking on subcategories of data. I was most recently looking for a bed for my son. So I put in a specific type of bed. You can’t really put in certain specifics in a search, you want to see what’s there and then you can go and drill down into the data to find the data. What that means is it provides the users a more relevant result, much more easier and efficient than a results paged with 10,000 results. I can narrow that down further.
Greg Low: There’s another thing I kind of notice with that too is that I like the option in full text search where you can do like forms of words. Again, if a human thinks of a word like drive, they think of drives and drove and driving and so on. They think they’re all the same word. But computers don’t.
Simon Sabin: Yes. Computers just do bits and bytes. They don’t understand sentiment and meaning and all that stuff. That’s why full text search doe shave thesaurus functionality but also has the ability of doing synonyms where it would do forms. You can search for drive and find those words. One area where SQL Server full text doesn’t really help you is where you have forms of words that aren’t actually forms of words. It can do thesaurus. If I’m looking for content on Internet Explorer, then Internet Explorer as a phrase is pretty specific so I’m really looking for language forms of IE. But if someone wants to search on IE, then IE and Internet Explorer should be the same. What you have the ability of doing is doing a thesaurus which allows people to search for one of those terms and you get anything that includes either of them. In terms of 2008, you’ve got some nice thesaurus functionality. One thing it doesn’t help with is spell checking. If you’re doing anything with a website, doing spell checking is often very, very crucial.
Greg Low: Have you tried using thesaurus to take care of common misspellings?
Simon Sabin: I have, but one of the things about the thesaurus function is that it’s outside the database. Unfortunately they didn’t get into SQL Server 2008 the ability of managing the thesaurus to the database.
Greg Low: I think that’s worth mentioning to, that the file is actually an instance based file as well, so it applies across all the databases. To me, that’s an absolute killer because in a hosted SQL environment, you’re talking about wanting to modify the thesaurus where you may only have access to one database. I think they really, really have to get thesaurus within the database.
Simon Sabin: Yeah. It’s just an XML file. One thing is that if I put a term in, that’s only expanded one way, which with spell checking you might want to expand it more than one way. My preference is to actually implement some spell checking, matching functionality within your application tier which then passes the queries into the database. Essentially that’s what it is doing when it asks do you mean this by putting in this spell checked version or an inflected version of what you’ve put in. If you’ve spelled Microsoft wrong, then it can work out that Microsoft is a common term and find other forms that might be more common.
Greg Low: What’s hilarious with the Google one, it’s an interesting insight into how the industry has moved. Many years ago, the Internet, people would discuss protocols and things like that. It was all common place. It was quite funny. I went searching the other day for SMTP. It came back and asked did you mean smut? I thought it was in interesting insight in how the whole industry of the Internet has moved.
Simon Sabin: Yeah. One of the other things I was working on last year was a virtual music download site. One thing I soon realized is that from a search perspective, if your name is pink, you’ve got a much better chance of your content being found. If your search is Brianna or something else that is fairly obscurely spelled word or name, you rely on someone trying to find by spelling that name correctly. I was really, really shocked that when we did some analysis, there was something like 50% of searches didn’t match anything. The reason they didn’t match was because p[people had spelled things wrong, of what they had done was ... I imagine you have music. You have the artist, you’ve got the description of the track, you’ve got the album, you’ve got the track title. You’ve got four elements there. You’ve got genre as well. If someone does a search and you have a field which is the track title but somebody puts in Robbie Williams and Angels, if you’re looking for an exact match it’s not going to find it because Robbie Williams and Angels doesn’t belong in the track title. What you need to do is combine all your data together into a single column which has got all the information and then full index that. That means you can then put in Robbie Williams and Angels and you’ll find tracks relating to Robbie Williams and Angels. Interestingly, going back to your comment about ranking, one of the things that people always lament about regarding SQL Server ranking is that it’s a fairly fixed ranking mechanism that you can’t change. One of the nice things about SQL Server 2008 and people don’t really appreciate this, is that if you want to do ranking then you can do your own ranking. What we did with this download site is we’ve realized if you put in Robbie the likelihood is you are probably going to want Robbie Williams based on popularity. However, in SQL Server 2008 and before, there’s no way for popularity based ranking directly. But what you can do very easily because of the way that they’ve changed things in SQL Server 2008 is just add another table which allows you to rank things whichever way you want and then you can just join that table, order by ranking, order by sales ranking or order by popularity and then you can order things whichever way you want. So SQL Server does allow you to do your own custom order ranking but what you have to do is stick it in another table and then join to that table. That allows you to return Robbie Williams rather than something obscure. There’s some guy called Robbie. Because that’s one word, Robbie is an exact match that generally ranks from a context perspective, how many times the word exists and that sort of things, higher than Robbie Williams.
Greg Low: That’s a good point to take a break and when we come back we’ll talk a bit on what on earth we have to do to implement this stuff.
-----------------------------
Greg Low: Welcome back from the break. So Simon, tell us. Is there a life outside of SQL Server?
Simon Sabin: Yeah, just about. But at the moment, Greg mentioned the SQL based conference. I’m planning the next one in November. When that starts kicking in, which happens now, that just takes over my life, which is interesting given that I’ve got a new baby coming in two or three weeks. I’m trying to work myself now so that I have some spare time when that comes along.
Greg Low: That’s great. That’s good. With full text search then, given the fact that we say this could be a desirable thing that we want to be able to do to make these sort of searches, now in 2008 there are some substantial changes to the architecture of full text search where I found if you look back at 2000 and earlier it was sort of pretty clunky to look after and then not very fast at processing things. I found in 2005 they tried to fix that a bit where things like when you did a backup of a database even though it was still out in the operating system, it kind of backed up before text search and tried to restore back and it was treated like an external file. But the good part of 2008 is that it’s all in the product.
Simon Sabin: It is. They’ve rewritten a lot of it and stuck it within the engine. This has been going on from a development perspective for quite a long time. They see search as being very important. You can see from Microsoft’s acquisitions and partnerships that search is key. There’s so much data now and people need to find it. There are a lot changes in the way that SQL Server works. Essentially, now it is stored internally so its stored with instructions within the database, which means the backup works a lot nicer, and also moving and detaching and attaching becomes just easy because it sits within the database. Anyone that’s tried to move a full text index in SQL Server 2005 and before will appreciate the pain. But there are some interesting changes. If you’re moving from 2005 to 2008. What full text does is, I talked about the index that full text creates, what it does is, as with any index in a database, the best way having an index on a table is to have a small integer based column, a small column as your index key. That means look ups are more efficient. It’s a much more efficient index. What this index does in full text is that it generates an integer ID for that index. What that means is if your index on your table is five or 20 or even a complex index, there is a mapping between those keys, between the full text index key and the key on your table.
Greg Low: You should probably note that you have to have a primary key on the table anyway.
Simon Sabin: Yes. And it has to be clustered as well, which is an interesting view. With an XML index you have to do the same as well. I think the same with a spatial index as well. It’s joy for clustered farm keys. Anyway, that’s another religious debate which we won’t get into. In 2000 and 2005, part of the performance issue was mapping between these two structures. In 2008, because it’s on the database so it’s still mapped, it’s called doc ID map, which maps the doc ID which is the full text indexer, to your keys of your table. What you should be doing if you’re full text indexing data, is still have a small indexing column on your table. If you have that, then full text index will use that as its index key as well so you don’t have this mapping table. Imagine that you have query that’s constructed for finding stuff in full text index and it contains your free text, so a table from person contains forename comma and Simon, where your person table contains a forename of Simon. That looks fairly straight forward and you think, well it’s just using index straight off that table. If your index on your person table isn’t an indexing key, what you’ll find is you’re actually joining three structures together. You’re joining the full text index results, you’ll be joining to the doc ID map, then you’ll be joining to the person table to get the rest of the data. That can be very arduous and a very poor form of forms perspective, especially if you’ve got large tables. Removing the doc ID map means you’re joining your full text index directly to your person table so you can do much more performance queries because the optimizer has got many more options about using message loops and how the joins are merged and join and that sort of things. The reason I’m going into some depth is because if you’re migrating from 2005 you will have this doc ID map. When you upgrade your database you’ll have a number of options for upgrading your full text index. You can get them dropped and recreated, you can have them upgraded, or you can have them dropped. The dropped this is that you just can’t use it any more. That doesn’t give you any purpose. The upgrade maintains the doc ID map even though you might have an indexing key, it maintains that doc ID map so you don’t get these huge performance improvements. So you need to be sure you select the option which is to get them rebuilt. SQL Server by default chooses this, thankfully, I’ve been pestering them, and so your index will get rebuilt as the proper structure. That’s where you get performance benefits. If you do a query with your full text queries and you find you have a doc ID map referenced, and your table you’re indexing has an indexing key, the suggests that your index has been upgraded and not rebuilt. What you should do is get your index rebuilt and that will get rid of that doc ID map.
Greg Low: The other thing that I’ve found that makes a big difference to the performance of the query plan is there is now a single query plan that includes full text predicates as well as your normal relational predicates.
Simon Sabin: Yes. True. You can see with the XML index stuff the operators being used on the full text index and extracting the index records from full text index. This is essentially an operator which is similar to the spatial index sort of thing. The one thing I would say is query plans, and I’m writing a white paper now on when query plans go wrong, and full text index is an area where query plans can go very, very wrong. It all comes down to statistics. The query optimizer can’t return, unless it has statistics in your database. It doesn’t know if your table has one row or a million rows. Then it has to come up with some very generic plans. With 2005, they introduced some statistics on your full text data, and in 2008 you’ve also got statistics. What this means is the optimizer can, and if you say search for a form with the name of Simon, if can give you an optimized very rough idea of how many records are coming back. The downside is if you parameterize your queries, you can really get into a bad state where the optimizer, if someone comes in and puts some obscure name in first, and the optimizer comes along and says ooooh, I’m not going to match any rows, you can end up with the full text index going off and saying no rows so you end up with no query plan. If someone comes along and the next query is the surname is Jones, in the U.K. that matches a lot of people, your query performance can really go through the floor. It’s an area to be really conscious of. Personally, you an often find that doing a recompilation on full text queries might be better in the long run because search type queries generally aren’t instantaneous, and can take between hundreds of milliseconds and seconds so taking the slight query compilation might be better for you in the long run.
Greg Low: It’s interesting. The other I’ve noticed is it seems it seems to be a lot cleverer now. If I had things before where I had two predicates and I had sort of a contains clause and then I had something that might be a document type or something like that, it seems a lot more clever now at not looking through everything, which is what I found it used to do all the time.
Simon Sabin: Yeah. There’s a big thing about mixed predicate queries where you’ve got a filter, a full text type filter where surname contains Jones and where a person.sex is male. Those sorts of predicates, you’ve got a relational predicate and you’ve got a full text predicate. They don’t really work very well in 2008. In 2005 and before you had to do some interesting work with them. Because these are now structures within the database, the optimizers got the full query plan and join options for it so you can do nested loop joints, you can have posturing, so it can really work well and if the relational predicate only returns a few rows, you can end up with a message loop joined to your full text index and it can work really well. The downside with 2005 was the optimizer could come along and if your relational predicate appeared to be highly selective, it could end up with a nested loop joint on the end of that was the full text index remote scan. Full text index in 2005 and before ear an early provider of this remote search service. With a nested loop, every row that comes into that you go off and execute this query. If the optimizer thought there was only one row being returned but you actually have 1,000 rows, what you end up finding is the engine would go off and make a thousand calls to the external search service, which basically, your search just grinds to a halt. I think what I remember we found when we upgraded to 2005 is that the query optimizer tried to be clever and ended up shooting itself in the foot. Looking at profile we ended up with queries that had like a trillion reads. The number in the reads column in the profiler was so big. It was ridiculous. If you equate it into pages, it must have been billions. It was processing like terabytes of data. It was bizarre. Just an area. Keep an eye out for bad query plans.
Greg Low: One thing I quite like about that in terms of performance, I built a graph for an event a couple of years ago. What I did was populate an index of like 20 million rows with 1K of data on absolutely identical hardware. It was quite cute. In SQL Server 2008 it was now about six minutes; in 2005 it used to be about 12 minutes. But notably, in SQL Server 2000, it was 20,000 minutes. The difference in performance is like extraordinary. The bit I like is that I also drew the graph up for SQL Server 7. Of course the people who did that are still waiting for it to complete today. The performance was a fairly long thing. Part of the thing with old pain is those that tried it around SQL Server 2000 days and it really, really, the performance of creating indexes and things was dramatically slower than it is now.
Simon Sabin: Yes. The population aspect is so much quicker and so much more efficient at doing so. You still use the same technologies, essentially, which is the ifilter stuff which takes the content out of your column and breaks it down into words. That’s pretty much still the same thing, but does things in batches, not doing individual reads and those sorts of things.
Greg Low: In fact, it’s probably worth, for those who haven’t looked at this, just a quick wander through the main terminology. An ifilter is where we usually start. Maybe if you could just define that.
Simon Sabin: The ifilter is the technology which takes content, and its content specific, and breaks that into text. For example, you have a Word ifilter, Excel ifilter, pdf ifilter, jpg ifilter. They will extract the textual content out of that document, whether it’s an image or whatever. A lot of those come in the product. If you use SharePoint or Exchange or any of the Microsoft search technologies, they all use ifilters.
Greg Low: It’s worth noting, too that the output of that, even when we say text, it’s really more a stream of characters at that point. So the thing is we’re not at the stage of talking about words.
Simon Sabin: No. It is just characters. That’s because since all you want out of these things are characters. One thing to know about ifilters is if you have 64-bit servers, which most people are using, there is no 64-bit pdf ifilter in the product. What you have to do is go off and get one. What’s also interesting is there is no Office 2007 ifilter in the product. You have to go off and get it. There is an ifilters pack which is on my blog. Just search for Office ifilter and download from Microsoft.com. What you then have is just characters from the content. That needs to be broken up into words. You have this thing called word breakers. Word breakers are language specific. What they do is take the content and break it down into specific words. The reason that I like language specific is that some languages have compound words so you want to split words down into multiple words and you also have different word based rules and some languages you need to break on certain characters. If you have Japanese or Mandarin, and all those sorts of languages, they have their own specific word breakers. That’s what breaks the content down into the individual terms that are in your document.
Greg Low: I always have to chuckle when I look at word breakers, they have English and English U.K. They’ve kind of taken over the language. It’s worth noting that in 2008 there are lots of languages supported out of the box. I’m looking at the list. There were quite a number that were unchanged from 2005, and there were a whole lot that were replaced with better versions. If you look at the one’s they added just in 2008, they added Arabic, Bengali, Bulgarian, Canadian, Croatian, Cyrillic, Gurarati, which I presume is an Indonesian language, Hebrew, Hindi, Icelandic, Indonesian, Latvian, Lithuanian, Malay, Milam, Norwegian, Punjabi, Romanian, Slovak, and so on. The list goes on and on, and those are just the ones that were added in 2008.
Simon Sabin: The thing about word breakers is that because there are a lot of new ones in 2008, what you’re better off doing is if you’re indexing based on one of those languages is to actually rebuild your index to get the benefit of the new word breakers. If you upgrade, it won’t repopulate your index. What it will do is to get a new repopulated index based on the new word breaker you have to specifically request a repopulation.
Greg Low: Even some of those already there were updated. You have things like Brazilian, Dutch, French, German, Italian, Japanese, Portuguese, the neutral one was updated, Spanish, Swedish. There are new versions of all of those. If you just do an upgrade you’re not going to get the new affect of the word breakers, where if you do have the opportunity to rebuild you get a much better thing. Once we’ve done the word breaking, what happens after that?
Simon Sabin: Those terms then get compressed and put into the index. The synonyms we talked about earlier, that’s the query time operation. They build this index based on those terms. One of the really nice things about SQL Server 2008 is that you can actually get those terms back out of your index. If you want to do some sort of term extraction on your content then you can use full text to do that. There were two or three DMVs used to get that content out, the downside is because the index is a compressed index, there’s no really efficient way of just extracting all the content. You can’t get content for a specific row in a table, where you get the whole table of content. It does allow you to understand the distribution of key words within your terms within your index, so if you want to build that discoverability sort of thing, I like people to do tagging, automated tagging type stuff so you can find out which words are common within the content and index and then you can use the DMVs to get that content back out of the index.
Greg Low: I notice there are two nice levels of the overall index, and there’s also a break down by document. One function I really do like is this DMV FTS parser. I don’t know if you’ve played with that much but I that’s sweet.
Simon Sabin: Yeah. If you’re ever working in multiple languages, that is a godsend. An interesting nuance, especially if you’re working with languages, thesauruses, or you’re doing inflections on terms you’re looking for, those words like drive and drove, if you’re doing any of those, then the DMVFTS parser is your friend. If I do a search, it takes the search string and go off and return you all the terms that the query goes and looks for after it’s done things like changing synonyms and also stop lists. One thing we haven’t talked about is words you don’t look for by default. For example, in English, words like the, at, it, on, don’t really add any value to your search. They’re ignored. They’re essentially removed. The nice thing in SQL Server 2008 is those are all managed in the database using TSQL and you can have multiple stop lists and multiple stop lists per language. It’s much more flexible then SQL Server 2005 and before. If you’re using these and suddenly find you’re wondering why a query is not returning the data you expect, it will show you things like word breaking, so if you put in multi-million with a hyphen in the middle, you’ll find it’s searching for multi-million as a single phrase but also as multi and million. That’s a nice feature about SQL Server 2008.
Greg Low: I find that incredibly sweet. It strikes me as a little odd that it’s actually a DMV or DMF, but that’s another discussion. It almost seems to me like a string parsing function. The key thing they’re putting it there for is to help you debug or diagnose the behavior of the full text index.
Simon Sabin: Yes. It does the word breaking and it does the whole query aspect of things. If you put complex queries like and or or in, and forms of, then it does that and the stop lists and the thesaurus. It does do a lot of stuff.
Greg Low: It’s probably worth mentioning too. I was thinking of stop lists. People ask why you would add your own words. But a common example I’ve found is that for most companies, having their own company name in the stop list is pretty useful because typically every single document you index will have that in it.
Simon Sabin: That’s interesting. I hadn’t thought about that. People have taken stuff out of stop lists. For example, the music site I talked about. The is quite an important word for many bands, especially when you’re called The The.
Greg Low: It’s a bit like when Google first appeared and people would go searching for C Sharp, and that wasn’t what you were going to find.
Simon Sabin: Yes. That’s one thing about when DMF parser really comes into it. If you put in C Sharp and you change case, you can see fundamental differences in what you’re going to be searching for. That’s something to very wary of. Running IT job web sites, things where C Sharp and those sorts of words are very common.
Greg Low: Just to sort of finish up on this. There are plenty of resources around to let people get used to this. I’ve recorded a couple things called Answering the Queries People Really Want to Ask, things like that. But just in general, the general approach is we’re building a table, we have some textual data we want to index, and we build a full text catalog as a container for the indexes, and then we build indexes over the top. Eventually we start using contains or free text as sort of predicates, and there’s also table versions of this.
Simon Sabin: Yes. That’s essentially it. One of the things if you’re coming from 2005, the full text catalog aspect now really doesn’t do anything. In 2005 that was the storage mechanism. Now your full text index is stored on full text file groups it’s not what it was before. Full text catalogs really don’t have any purpose any more.
Greg Low: Most people create one and then forget it.
Simon Sabin: Yes. Very true. One thing I would say from the design perspective, you have this table with content. If you mention your product table that has product description on it, one thing I talked about was wanting to combine multiple columns together into a single column. One nice thing you can do in 2008 is you can index a computed column. You can have a computed column which is a catenation of all of your queries, all of your columns, product description, product type or title, your category. That is a nice way of doing things.
Greg Low: You're saying like a persistent calculator column. In 2000 you could have a calculated column but the thing was it would work out every time you selected the value, where in 2005, we go the ability to persist the column and so when you update or insert it works out the value and then the selects are very fast, and we also have data that we can index very reliably.
Simon Sabin: Yes. Imagine you have your product description column which is in bits and bytes, if you do this computed column, if you store another version so you’re using another bits and bytes, your storage is going to go up and up and up. In 2008, full text index will allow you to query a non persisted computed column, which is a really, really neat feature. Something else you can also do, we talked about having combined relational and full text predicates on your queries. In 2005 and before, it really didn’t perform. One of the ways around that was to put your relational column into your full text data and what you could do was add those to your full text query. Then you end up only with a full text query. You can still do that and use that for some nice features. A nice feature about that is if you have one too many tables so a person and a job type, and they might have multiple job types, it’s very difficult to say relational query give me all those people who have a job type which is a consultant and a job type of full time. You have to do two joins to do that. But in full text it’s very easy to say where it contains job type full time, job time consultant in one query. An interesting thing you can achieve some real nice performance improvements by flattening out and essentially denormalizing your data and then in full text. On a final note, from the design perspective, I often prefer to store my data on an index on a separate table. The reason being, often that large object data. What that means is I’ve got a separate table which is a full text index. I can do my indexing on that table directly. The benefit is I have a 1:1 relationship. My person table and I have my person text search table. I can do online indexing rebuilding on my person table which you can’t do if you have large object data on it. So if you move your large object data into another table which you can then search, you can full text index which gives you the ability to do online index rebuilding on your other tables
Greg Low: Magic. One last final thing, you can do full text index on the new file stream data type as well, which I think is another sweet thing. But we’ll talk about file stream on another show, another day. But thanks, Simon. What’s coming up next in your world? Where will people see you?
Simon Sabin: As with many SQL Server people, conferences are coming up in September, October and November. I have a conference in Bulgaria in October. A webcast in September for Quest. The big thing for me is SQL Bits is coming end of November. We’ve got Donald Farmer coming over and doing a whole workshop on Gemini and R2. We’ve extended it to three days. I don’t want to forget to mention my user group, sqlsocial.com in London.
Greg Low: Are you going to get across for the PASS Summit this year?
Simon Sabin: I’m not sure. I do hope so, but with baby due... I’m probably going to be at TechEd, then SQL Bits.
Greg Low: That’s right it’s the PASS Summit in November, then the week following is TechEd in Berlin, then for those really keen, there’s PBC in LA the following week, then SQL Bits the same weekend as that.
Simon Sabin: I don’t know about three weeks away when a new baby is here. But we’ll see. I think I’ll go to PASS. PASS is great.
Greg Low: Well terrific. Thanks Simon, for your time today.
Simon Sabin: Thanks, Greg.
Phone: 1300 SQL SQL (1300 775 775) l International +61 1300 775 775 l Fax: +61 3 8676-4913
Copyright 2017 by SQL Down Under | Terms Of Use | Privacy Statement