Mark Tabladillo
SQL Down Under Show 47 - Guest: Mark Tabladillo - Published: 4 Mar 2011
This show features data mining guru Dr Mark Tabladillo discussing data mining with SQL Server.
Details About Our Guest
Mark Tabladillo has worked with Business Intelligence applications and development since 1988. Mark has a doctorate in Industrial Engineering in Georgia. Today he provides consulting and training to clients in many industries across the United States and around the world. He promotes Data Mining and machine learning at MarkTab.net.
Show Notes And Links
Show Transcript
Greg Low: Introducing Show 47 with guest Mark Tabladillo.
Our guest today is Mark Tabladillo. Mark has worked with Business Intelligence applications and development since 1988. Mark has a doctorate in Industrial Engineering in Georgia. Today he provides consulting and training to clients in many industries across the United States and around the world. He promotes Data Mining and machine learning at MarkTab.net. So welcome Mark.
Mark Tabladillo: Thanks Greg.
Greg Low: First up, as I do with everyone I will get you to describe how on earth did you ever come to be involved with all of this?
Mark Tabladillo: It is a long story, I will give the short version here. I became interested I think in statistics and UCLA when I was and undergraduate. I noticed in your introduction you introduced me as going to Georgia actually I went to George Tech. We are kind of rival schools here. I continued my studies there, applied statistics and part of that were time series and some other techniques that were hot at the time. Today really Data Mining has come full circle and now it is becoming much more important for average users and people who may not have had even graduate school. It is an exciting time and I am glad to see all the many things that are happening and all the people that are involved.
Greg Low: Yes it is excellent. I find Data Mining is an interesting one. I find that it is an area so many people haven’t touched or looked at all. Even many of the people who have had a look at Analysis Services because it shipped as part of that in the product; they tend to still not have even looked at it. I think one of the things I normally say I love about Business Intelligence as a whole is that if you look in most organisations I think there is a food chain and the better off you are is the further up you are in the food chain. I find many of the folk that work typically in relational parts of the database things are often in areas that the business considers a cost of doing business. It is very hard to get terribly good funding in things for projects if you are in the cost of doing business end of the world. The thing I like with Business Intelligence as a whole is that it appeals to the people who pay the bills. The further you can move up the food chain away from being the cost of doing business into something that provides significant value of doing business. The much better off your life is going to be in amongst an organization. Data Mining is one of those things that just had the potential to push you way up that tree.
Mark Tabladillo: I agree and of course I have heard you present on this topic: Data Mining specifically for Analysis Services. I know we share some of the beliefs in this space. In general the key point you made was being able to connect organisations, being able to connect with their values and helping organisations discover new information, insights into any problems or issues or initiatives that company may have.
Greg Low: Maybe for those who haven’t had a good look at Data Mining at all, briefly first up, maybe a quick description of how you perceive Data Mining to be today and the sort of value it adds to an organization?
Mark Tabladillo: Ok, well maybe I will start with my basic definition of What is Data Mining? I like to use definition that Data Mining is the process of revealing patterns in data which help give insights toward actionable decisions. I make that distinction against machine learning which is a respected topic and also the focus of many computer science schools today and also a very hot topic for people wanting graduate degrees today in this field. That area focusses more on algorithms and their specific performance against whatever metrics that someone chooses. I view Data Mining in the same way many people in industry do and that is a semi-automated process. It does use algorithms from machine learning. It also helps towards actionable decisions. In the Microsoft world, as you mentioned earlier this technologies inside Analysis Services. It is perhaps presented in an unconventional way in the sense the Data Mining offering from Microsoft is not an application it is actually a service. It is a service because it runs with Analysis Services. It is intended for enterprise systems, but is also intended to be a simple to someone opening Excel and doing analysis there from that well know BI interface. It has a range from working with Excel, and all the way to working as a full blown production service connected to Analysis Services connected to the relational data engine.
Greg Low: Yes, one of the examples I normally use for people that haven’t really thought about what Data Mining is about I find the example many people now would of bought things from Amazon or sites like that. If I look back years ago the types of suggestions it use to make to me about what else I might buy were fairly basic, but if I look at it today they are razor sharp suggestions that are coming up. When I am purchasing something it seems to know an amazing amount now about what I am interested in. What sorts of things I would buy. Most people have come across that as a simple example of things that are involved in Data Mining. They must realize that is something must completely change eventhough the entire profitability of that company.
Mark Tabladillo: I love that example, as I am an Amazon customer myself for many years and have also enjoyed the recommendation that they might provide based on searches. Or even when I log on as a user based on my past purchase history and the idea there is that Data Mining is a probabilistic look into what someone is likely to want based on past actions. It is based on the idea again even on a screen real estate there is only so much space if we had a huge screen and we had the ability to see Amazon’s entire stock of all the books that they sell then we would be able to absorb all the information at once. Of course that numbers in the millions and for humans we have to have a reduced list to focus on and that is a good example of where Data Mining can provide value.
I want to make a short technical note here, that we will set apart perhaps Analysis Services and what it does for this specific example we are talking about which is association for market basket analysis. That is the Analysis Services technology will allow someone to analyse or do a Data Mining model based on a nested relational structure. Whereas many competing products requires the data to be flattened or be normalize. I consider that one of the advantages of using Microsoft technologies.
Greg Low: Yes in terms of being very flexible in that arrangement. A good starting point is probably something around the algorithms but just the types of things we can do with Data Mining.
Mark Tabladillo: Right. Ok so with Data Mining I like to think about is in three different categories. The first category is actually the example I feel people would use the most and that is the time series or forecasting analysis. The basic idea there is that one column would evenly spaced time increments, might be months or years or maybe could even be minutes but evenly spaced time sequence. Then for the predictive column would be some number, it would be an integer and could be floating point number and the idea there would be to predict what the days would be in the future base on history in the past. It can in the Microsoft technology consider any type of cyclical facts and cyclical fact at least in the United States might be seasonal trends in retailing where certain months and certain years are known to be high on a 12 month cycle. That is something the model includes. Again category one is the time series, two other categories that I like to talk about are the supervised and the unsupervised algorithms. Unsupervised algorithms generally are just trying to collect items together that are similar. That is basically the task of the clustering algorithm for example. The supervised algorithm have a specific targe in mind and the question will be of the other independent variables for attributes, what combinations lead to a best prediction of that output variable. These are very general terms, I feel it is hard to understand it based on words. I know many times people want to see examples. We will be talking through about that. I know you mentioned my website MarkTab.net where I have some video where I have produced and slide decks and examples there to see how some of this works in action.
Greg Low: Yes that is excellent. Mark basically is provided in the box today with SQL Server. I suppose what capability do you think there are right out of the box and how useful are they?
Mark Tabladillo: Right, in the box with SQL Server one of the selling points with the Data Mining technology along with Analysis Services it comes bundled with a SQL Server license. It just means that when someone purchases SQL Server they get it at a certain level and they are able to use that technology across all the components which are included. What I am saying for a SQL Server Standard or Enterprise license each one is going to have an implementation of Data Mining. Personally I do recommend that people do seriously look at the Enterprise edition. It does have some additional features especially in the time series area for Data Mining and can provide some excellent service there. For the front end as part of that license package, people use something for version 2008 SQL Server. They use a produce called Business Intelligence Development Studio which people often abbreviate BIDS. That is a portal for Visual Studio and it allows someone to develop a Data Mining structure or Data Mining model and then put that model into production. When I say put into production, what I mean is that when the model.
What is a structure? Data Mining structures just a list of attributes made available for Data Mining and then a Data Mining model actually takes the attribute from a structure and catches it to a specific algorithm and inside the Microsoft lists there are what I call 10 but let me clear here. 9 of the 10 are in Analysis Services and then 1 of the 10 which is text mining is part of Integration Services. Inside Analysis Services there is 9 algorithms and any one of those 9 algorithms could be used for a specific model. How many models can you make? It is actually a very large number, it like 2 to the 31. I have the specific number on my blog, I don’t remember the exact number but there is a very large number of models that someone could make theoretically if we wanted to make that many models. The limitation at that point would be something like hard disk space, perhaps memory, a computer. As far as software itself concerned it is really made Enterprise type situations. The technology there is very powerful for analyzing very large systems and being able to leverage the processing speed of Analysis Services which is where the performance tuning would happen if someone needed to do that to be able to have a model in production.
Performance tuning is another important thing for someone just connecting it to Excel. If someone just wants to use it from Excel, in what is called Interactive session they can do that. They would make either a temporary, permanent Data Mining model and they can do Analysis on fairly small data tables. That is another option and for someone to evaluate what the differences are on the options they should see a demo. I have done some demos but I am not the only one. In fact I put links on my website of some of the Microsoft videos and demos that exist. There is a number of other people that have done them too and that are not with Microsoft who also show case this technology. It is good to get in front of a screen and have someone show what the differences on some of these technologies.
Greg Low: Yes indeed. Could you give an example of a specific business situation where you found that you have got good outcomes?
Mark Tabladillo: Specific business situation well let me give an example of a client that we have that has used Data Mining to good advantage. This is a telecommunications client, I am not sure if I can share the name but I will just say they are in telecommunications, they are in the news industry and they have a system where users can leave comments on their website. Imagine again it is the telecommunication, a news organization and they are allowing people to post comments onto a public website. They pay people to read comments. Some comments are automatically filtered, other comments go into we are not sure type of category. For those comments they read them and decide whether they can or cannot be posted. Our team which wanted to use this Data Mining technology used Data Mining to help determine what some of the features were of this post. They did it on one level based on the attributes of who the person posting was and what do I mean by that. I mean when did they sign up and how much time had passed between the time they signed up and the time they posted and what types of forum were they posting on and how frequently and what the size of each post were. These types of attribute variables and they can do it based on history. They know in the past based on people postings, like user history. On which types of posts were not approved by their moderators and they can therefore determine a probability of whether a future post might be likely be filtered.
What is the business value of that? The business value is being able to in advance have the machine first look at posts and then the machine can assign these posts a probability rating of being likely to have an issue according to their editorial guidelines. In that case to the moderators, the human moderators can prioritize their work based on what the machine probabilities are and then the humans will make the final decisions. I like the example too because I have often said that Data Mining does not make specific decision for people. People are the ones who make the decisions. Yes it is possible for someone, a person to program a Data Mining algorithm to automatically make a decision for them. That is true, but still I still like to say in any situation it is always people who make the final call on what an algorithm does or even if something becomes fully automated. To have a specific cut off number or specific result or trigger as a result of Data Mining.
People’s values still become part of the conversation and even in this case there wasn’t a specific person it was the organisation’s values. What they want on a website. That project was able to demonstrate some cost savings in the millions of dollars. I think in this case they were able to demonstrate that they were able to move from where they were and leveraging the power of what a machine can do. That is being able to repost or it actually wasn’t reposting in this case. We were not doing text mining in this specific case and that was not one of the earlier objectives.
The early objectives were to do straight supervised algorithms and evaluate data and based on past behavior be able to priortise any future post and put them into a cube for a moderator.
Greg Low: Yes that’s great. The main thing is the business value there is a reduction in the overall cost of what they would have had to have human wise otherwise to do that work. Yes that is excellent. One of the things I have had some questions that come at times about the level of dedication the team has in Microsoft to the product. There doesn’t seem to be a lot of changes lately to what is provided in the Data Mining tooling. One of the arguments is it is already at a fairly mature stage and so that is a point which you wouldn’t see a whole lot of things coming out. What is your take on where you think, the level of interest is within the product team and or the future of that part of the product?
Mark Tabladillo: That is a great question. There is a quote I have somewhere from Bill Gates about 10 years ago and he was generally talking about Microsoft research. At the time and I think this still represents Microsoft’s general position. They are generally interested in Data Mining, they are generally interested in machine learning across the entire product spectrum. We know today we are focusing on Analysis Services and we are focusing on Data Mining algorithms inside Analysis Services and I also mentioned text mining inside Integration Services. I have links on my website MarkTab.net where I point to some the machine learning work going on right now in Microsoft research.
I will say as far as my feel, this is just a feel I don’t have number behind this. My feel is that a lot of this technology is now being invested in their search engine which is Bing.com and also will be reflected in the fast search technology which recently got rolled into Microsoft but also part of SharePoint now. It is going to be bundled that way.
To answer your specific question for the Analysis Services team, I do encourage people to find out as much as they can about this team. This team has not only produced a world class OLAP technology but it has also produced Data Mining and it is also producing a new product called Power Pivot. Power Pivot right now in its first implementation in one form an add on to Microsoft Excel. I have described it as kind of a light version of Microsoft Access. Don’t take my words for it, you need to go see it for yourself at PowerPivot.com. What it is and what it does? In the next release of SQL Server Microsoft has already announced that they will continue to work on developing Power Pivot and the technology behind it which they call Vertapak. Also integrate that more with the Analysis Services product which I don’t know if anyone really knows what the next version of Analysis Services is exactly going to look like.
I only mention that because Data Mining piece inside Analysis Services is just part of what makes this whole enterprise solution work. Functionally no-one does Data Mining without for example bundling it into. I am talking about in Enterprise environment. They are not going to do it pragmatically without using Integration Services which is my recommendation is for rolling out Data Mining into the enterprise environment. It is not just looking for example Microsoft to have more algorithms for example inside Analysis Services.
Some of these other systemic changes and by the way I want to also mention Apollo column store indexes that is a new change in the relational engine. I put it on my blog and it is also been one of my most read blog post of the year. People continue to read it every week about what I wrote about Vertapak and Apollo column store indexes. I wrote about these topics because again my interest specifically Microsoft is not necessarily about testing new or novel algorithms that might come out of the statistical or science literature. My focus is more Data Mining. Data Mining for enterprise environments and these new innovations from Microsoft and Power Pivot a Apollo column store indexes will help with where most of the work is for Data Mining is and that’s what called ETL. Extract Transform and Load, in fact that’s what the abbreviation stand for and reflects the data cleaning or the data verification procedures important for preparing data before even an algorithm is applied to that data.
Anything that Microsoft can do on that portion, I have heard people call it maybe 70%. So in actual practice most of what Data Mining is in terms actual work if that was the entire focus on this project and we are not saying that is typically is. It is typically not, people typically are also doing other things also. Let’s just for sake of argument, let’s just say that an organization is just doing Data Mining most of the efforts is going to be ETL. Anything Microsoft can do to help that process does have direct effect on Data Mining. It is supported technology and which Microsoft is still supporting through their technical support user forums. Now I will say there was a time in a short window, where Data Mining was the new thing. Well it is not the new thing anymore. Vertapak and Apollo column store indexes and Power Pivot is now the new thing. Microsoft standing behind the technology behind Analysis Services. It is extensible in the sense that someone can look at their product site which is SQLServerDataMining.com and look at ways of programming their own algorithms. If you are so inclined and want to do that. If someone wants can program their own algorithms on top of this technology and extend the product.
Overall and even if someone doesn’t want to do that, I think that the product right out of the box has incredible efficiency. Anyone who has done Data Mining will immediately recognize some of the outputs and mains of the algorithm and even some of the tasks which you could do with that. That’s my basic take on that. Let me ask what have you heard about some of these ETL technologies?
Greg Low: Yes, the thing I look at Data Mining that has been raising questions is that they have been a bit slow producing some of these things. For example in Excel there is a 32 bit of the add on, the Data Mining add on but they haven’t produced a 64 bit version. There has been questions about like is there some sort of lack of commitment in that area and then you see things like key people from the team have moved on. It tends to raise some questions about is this going forward fairly fast.
The thing I do like is what is already there is really substantial and probably at a mature level already. I do find it interesting the idea that it is extensible. I often say to data people, even if you would never have any interest at all in extending some of these things the idea is that it is extensible is quite important. I look at things like Reporting Services most people can’t imagine that they would ever build components or processing extensions that would drop in the toolbox like new controls or things like that. It is really important that somebody can. That means the Dundas and other companies that it is possible for them to build things for you to buy to add onto the product whereas if they have it all as a completely closed arrangement then that is not possible.
One of things if anything I tend to be critical of the SQL Server team of, I think extensibility is something that they don’t tend to get in much of the product. The case of these they do have good extensibility points and I think that is important. The other place where it is very useful is it means in an academic setting if somebody wants to try and do things like create new algorithms and so on then gives them a very good framework to plug into.
Mark Tabladillo: Right and that is a good point. I feel that any graduate student who can program a .NET language and have some knowledge about algorithms should be able to extend the product. I can say that because I have taught and am teaching graduates school, we didn’t mention that in my bio. I am teaching at the University of Phoenix, so I have taught at graduate level statistics. I don’t feel that the challenge is beyond what a graduate student could do and then again you know for your business user out there I haven’t gone to a graduate school for that purpose maybe they went for an MBA or some other degree. They still can find someone who can do this for them if extensibility is pressing as a need that something that is certainly possible.
Greg Low: Welcome back so Mark is there a life outside SQL Server and Data Mining? What else do you get involved with?
Mark Tabladillo: I am glad there is, I am a believer in work life balance. I have had good advice from people much wiser than me to have a life beyond professional interest. I will say for your listeners that for a few months I am on Facebook and people can tap into my social life. I will just generally mention that on that topic that I have collected most of my friends which I consider most of them acquaintances from my primary social hobby which is partner dancing. By partner dancing I am referring to swing and lindi hop which is done to American Jazz music and I also sometimes do salsa and other forms of ball room dancing. It has been a passion of mine probably for over 10 years. It is something I discovered when I was out of college and something I wish I would of done earlier just because it is a great social fun. People from around the world the thing too. At times when I travelled in the United States or even internationally I have to find myself on my off hours, leisure time or tourist time looking for social events and dancing events. It is something I enjoy and something I help organise here in the Atlantic area. I have encouraged other people to look at or try.
Greg Low: Actually that is good. I should mention that last time I saw you it was in Alicante, Spain not long ago. I remember one of the things you did one of the evenings was wander off or try and find a place with music and things like that.
Mark Tabladillo: Yes I did find a salsa club there in Alicante. Around the world Salsa is more a predominant dance in most cities around the world. In the United States it is easier to find Swing dancing and either that is Landi hop or what is called let’s go swing. Either of those forms is great.
Greg Low: Good that’s good to see a life outside it. Back on the Data Mining though, we have talked about how you might have things like Excel or tools like that as a front end. Of course the other big thing is how you would integrate it into your own applications and so maybe just some mention of DMX and so one might be kind of enlightening for those who haven’t looked at that.
Mark Tabladillo: Ok. We will be a little bit more geeky here and maybe with these responses. I am not going to completely explain these technologies. DMX stands for Data Mining Extensions and it refers to the SQL like language that someone can use for SQL Server Management Studio. It allows for someone to manage either Data Mining structure or Data Mining models. I will mention too that DMX is a core piece of some of the upcoming presentations that I have.
We will be talking about that at the end of this podcast. It is a SQL like language and allows someone to manipulate the models. It is the focus of the Microsoft certification questions in this area. They will ask questions about DMX which I believe the 452 exam which is the current one. That is one of the topics.
You spoke of our application development, the Data Mining technology is completely encapsulated in Analysis Services class structure. On my blog I reviewed a book which I do recommend for people interested in this topic in that book is about Data Mining for the 2008 version. It was written by Jamie McLennan and Bob Kravat. I did a review on my blog on this book and some of their code. I took some of their code that they had in their book which they had for C# and I translated it into PowerShell. That post itself has become one of the referenced post on my blog.
PowerShell I do consider an application environment and only because it is for simpler applications. I feel it completely replaces a need to do console application programming regular Visual Studio. It has complete access to the entire .NET class structure so it is a great way for people to use in the .NET world what is called MO or ADOMD.net. MO is for the management of the Data Mining and Analysis Services. ADOMD.net is for the querying of those models. Those two areas would be areas on where to start.
In addition to that there are a lot of application tips from the product team’s website which is SQLServerDataMining.com and included on that website are some controls which people can use to drop into their applications. You can look at those controls and see what they are and see if you like them. If you don’t then you could just make alternative viewers or controls for your application.
I believe going forward, I would like to see more people working on specific controls. I think there will be a market for controls. I talked about this on my blog and I am hoping specifically more people will want to make controls either for web applications which includes SharePoint or also for Silverlight. Which I feel is going to be fascinating way that people are going to want to look at Data Mining from that viewer perspective and again that also can be in SharePoint. These are some interesting times for people who want to take that next step and do some creative visualization.
Greg Low: What also intrigues me with business applications is the places that I see Data Mining should be used in Standard traditional business applications. It strikes a lot of people where on earth would you use this. The thing that intrigues me if you think about the tax department or somebody like that, in any country now they don’t have time to go through and look into detail at everybody’s tax returns and things like that. So you can imagine that they are endlessly looking at where do you sit in terms of what you have returned compared to the norm. Whatever that is perceived to be. In some way if you are an out liar in that area they are going to look further at what you are doing.
I see that with even normal business applications. It amazes me that sites that I go to which would ask me questions that I could say for example I am a 25 year old guy that I live in a really poor neighborhood and I earn $800,000 a year and I have 25 years education. I can go and fill all these details in and the systems don’t even blink. Each question on their own is completely reasonable but as a whole almost zero possibility that my answers as a whole makes sense together. I think one of the power of something like Data Mining is that we could make much better use of it to be able to look at when somebody even fills in a normal form in an application to go. Is there any real chance that data is correct?
Mark Tabladillo: I like your line of questioning and I feel speaks to generally what we expect from intelligence specifically. It is not just Business Intelligence but it is any type of business intelligence. Even the type of business intelligence that governments have to inform them about what is going on around the world. Again thinking about the concepts, what is the difference? I have talked on my blog about what is the difference between regular business intelligence.
What is the difference between drill down and Data Mining? Drill down is just a deterministic look at more detail in other words it just gives somebody the facts that are already there. Data Mining differs from drill down because it is a probabilistic look and will require somebody to make an evaluation judgment. Greg in your example about someone filling in a form and being perhaps a potential out liar. Data Mining would maybe just flag someone as being probable but it is not going to say for sure. Someone will have to come in and make a decision. Do an investigation and say what do you think? Do you think this person is an out liar or not? Is more investigation warranted? In some cases, this is a part that might bother people when they use Data Mining. Even human judges may not be able to make a conclusive decision based on what the question is without gathering more additional information. Data Mining might show something as being a potential problem but knowingly know whether issue is an out liar gaining additional information. Sometimes that is a stopping point right there. Other times there is a relationship between an organization and perhaps an individual or another group and it is possible to collect more data.
Where does it have the highest potential impact? I think in cases where there is continuing and ongoing relationship such as patients in the health care system. Such as customers and financial services or institutions or maybe customers of services organisations. When I say services it could be anything from a utility or telecommunications organization or media outlet.
Greg Low: What is your take on the product in terms of performance in three areas? How long does it take to build models do you think compared to the amount of data it is processing? When you are making queries how quick is it if I send a DMX query that says go off and do a predictive join or something what sort of performance do you find there. Overall how do you think the product performs?
Mark Tabladillo: Performance, it is a fair question to ask. I will say it in two ways, first off I feel that the product as it is now is fast enough to be a backend to the Excel interface that there is a Microsoft add in. It is a free add in for Excel that does Data Mining. It is fast enough to provide interactive good user experience for Excel. That does say a lot right there, especially if it is working on a server. I typically have had computers and machines which tend to be a head of the median as I am sure you and a lot of other IT professionals do. I don’t have for example the absolute best laptop ever built but I find on even more average available systems it’s performance is still very good.
I did have someone who emailed me a few weekends ago about performance and it is information I haven’t posted to my blog. The essence of the question was how you get insight into performance turning for Data Mining? Part of the story has to do with Analysis Services. Just run Analysis Services and see how it is configured. Ideally Analysis Services runs on its own box and it is just doing that. Sometimes in some shops they may run for example relational engine and Analysis Services on the same box. That choice has an impact on performance but for someone who wants better performance they could just put dedicated Analysis Services there or do one of these scale up or scale out solutions available. I will often refer people to SQLCat.com which is performance website. It gives some tips across the entire SQL Server on how to tweek or improve performance. I will say this, this again is a comparative statement about what I am about to say. What makes this technology different from many of the other competitive solutions? If someone was looking at one of the many open source Data Mining software packages available those packages were designed to run on one machine and they were basically designed to run in the memory capacity of that one machine. In this case with Microsoft technology we are talking about the ability to either run on one machine which you can. I often do that with demos, I will run Analysis Services on the same machine that I am running my Power Point slides or someone can put it into a client server situation and have Analysis Services running on one or perhaps more machines.
Splitting load between multiple machines, in ways I don’t personally know very well. I will say that I am not a performance guru or expert. I know enough to say that I am not an expert in those areas. I do know people who are and I do know about resources where I would go for answers and try to improve performance that way. I did again to finish this topic, one of the sites is improving the performance just in SQL Server and Analysis Services. Another way to tackle performance issues which I will be blogging about for Data Mining and that is just to consider what the scope of the problem is. When I say something like that again, I will go back to my earlier statement about the capacities. I am not going to go into detail about what all is involved in the Algorithms which we are talking about different algorithms and each have their own performance tuning parameters and which you can read about in the SQL Server documentation.
Beyond that one of the general advice points, this is just based decades of statistical analysis experience from the academic and business community is to try and make models simple. Sometimes people will look at technologies such as Microsoft technology and say it can handle hundreds or thousands of attributes so we are going to put everything in there. Someone has to really ask do you really need to have about 1000 attributes even if 1000 are available. Or could you reduce that number perhaps produce a simpler model and a simple model will not only be quicker to build. When we say build we are talking about the word train. Train means being able to take some data, match it against attributes and produce a model. So it would be easier to train and faster to query so simplicity itself is a major performance strategy and it is a best practice to make the models as simple as possible.
That gives us the scope, that simplicity point is true across all Data Mining products, it doesn’t matter who wrote them. When I spoke earlier about the server performance aspects, those tips and efficiencies are only going to be available for products like Analysis Services. Server based and that is why this technology rose the way it did. It was built, Microsoft intended Data Mining to be built this way specifically for performance. If they wanted it just for features then it would have been rolled out as another element of Microsoft Office just as a pure extension to Excel or some of the other products. The way it is it is really written for performance and for people who want to assess or analyse that they can do better by applying best practices in performance and including to continue to improve hardware and also maybe network elements. Maybe it is a san or the physical pipe where the servers are located. There is a lot more tuning and configuration that can happen once a technology enters into the server space.
Greg Low: Now one thing that I do want to drag you back towards. You made a brief mention earlier about text mining in Integration Services. Maybe just your thoughts on what is the typical use case or where is the value added by that text mining?
Mark Tabladillo: Ok, well the text mining which is a term lookup and term extraction technology inside Integration Services and is great technology. It works in any language so long as it is English, as I like to tell people.
Greg Low: Nice.
Mark Tabladillo: Based on English, what it does is it extracts noun phrases and then it enables someone to build the dictionary of those terms and attach it to relational file and use that as a platform for applying some of the algorithms. Someone could actually program that and they could look for specific terms yourself, the advantage of using the built in one is that you don’t have to say look for a noun or noun phrase. It will look for it for you. That is the technology as it is. I would encourage people to use it. I see only primitive examples and I have only produced primitive examples again because it has limited range of scope it what it can do and what it will do.
Greg Low: Maybe just one example of the sort of thing you can potentially do with it?
Mark Tabladillo: Ok, I will give you an example. I will retell the story of the standard example that is in the book that I mentioned earlier on Data Mining 2008. What they did was they took American, US residential addresses to every year. The American president has the invitation to speak before congress in what is called the State of the Union. What they did was they took terms from that speech and they tried to predict what party the president was in based on the terms they would use in the speech. It is an interesting example for students of political history they may be alerted to the fact that they took 200 years’ worth of presidential addresses and tried to predict parties when everyone knows very well the names of parties have changed in the last 200 years. Also the basic orientation of these parties have changed even when the names have remained the same. There is that caveat, however it does produce some good results where certain terms are more associated with certain parties.
Again, here is the key for example. The word government is found in speeches going back 200 years in American history. It is not a distinctive factor among parties. It is not a term that one party is more likely to use than another. There may be other terms that are contrasting examples and Data Mining will surface some of those distinct differences. That is really the advantage, if someone was just going for frequency of words they might miss it. They might not be able to see that people from Presidents or leaders of one party may be using certain phrases or terms more than others. So that is an example where the text mining has an advantage in that being able to show contrast or difference.
Greg Low: What about things like detecting tone? I mean a good example is a company saying I would like to pick up everyone the twitter verse talking about us and I would like to know just whether this is negative or positive.
Mark Tabladillo: Well I will mention again, what you mentioned earlier. We were in Alicante and part of a conference where we were talking about text mining and some of the new research coming out of Universities and what we are talking about now is called Sentiment Analysis and trying to figure out what is the sentiment of a person. Not just looking for key words but trying to look at whether a term or a thought or opinion is either positive or negative. Companies find this adventurous because now with all forms of media out there, they want to have an idea of some of the key influential people are saying about their product or service. That is something that large companies will pay for now. They will pay either to do it themselves or as I suspect they are outsourcing they are having third parties do this for them.
Again this is another entrepreneurial idea to be able to find blogs or news media or outlets that are more influential on being able to monitor or track them on what they are saying about products or services. I have heard companies doing this for stocks where they will monitor postings of certain stocks, shared in either public or private and be able to use that sentiment to advise their client’s about investment decisions. Still it is a topic that has enough fascination that I feel that people will continue to work on this topic for many years. One example I have given for example is sarcasm. Sarcasm could trick a machine to thing you think one thing but actually thinking the other, which is actually the point of sarcasm. You say that you don’t like something but you really mean that you do. That topic right now is an example of where sentiment analysis might hit a wall. Again I still go back to the need to have people involved at these different stages. Either individuals who are well trained or typically groups, groups who could look at output and for critical issue be able to determine if that person is being sarcastic. Was it at parity? There is some parity websites out there that is what they do. They are not necessarily for or against things they are just putting it up as some sort of media relief. Part of being to accurately do things is not just programming your computer to attach twitter feeds and scrape web pages but also consider the source and also consider some of these other known factors which might be involved.
Greg Low: It is kind of interesting because exactly the same phrase from different people you would need to know this person intends to use sarcasm quite a bit whereas somebody when they say that, it really means that and so on. That is really interesting. How do you imagine somebody should get started with Data Mining?
Mark Tabladillo: Great question, I do have on MarkTab.net. I do have under my about section there is a page on getting started. It does show how to get started with the Microsoft technology. The first step is just identifying what role someone might be in. I identified the three roles that Microsoft recommends in their own literature. Firstly the analyst just someone who is an expert in the Data area and wants to have some more actionable information. Second is an architect which would be considered on the processing or data requirements. The third area would be developers, which could be somebody extending the Data Mining algorithms in custom solutions. For those three descriptions I have outline and I put some pictures about backend and front end solutions for people to look at.
I will mention that SQL Server is available for a free trial and people can download it for free. Even Windows is available for free trial. I think that is true. Right now I don’t have a link in front of me.
Greg Low: Yes I know they both definitely are. It is easy enough to try those things for free. What I do suggest with people that use SQL Server or interested in building more knowledge is getting a copy of the Developer edition. For people in the US it is only about $49. It just gives you the entire toolset, everything. I see people try and learn SQL Server just using Express and things like that because it is free. Just say look just go and pay your $49 or whatever. What you get for that is astonishing.
Mark Tabladillo: Right, also Data Mining is not in SQL Server Express. You will have to get the Developer edition. Developer Edition is a good way to go, that version does not expire and it can be installed on a single machine and again you can go onto Microsoft’s website and see what type of machine that will need to be.
Greg Low: You also mention Jamie McLennan’s book earlier. He is a notable person that has left the team and gone off. I noticed that he has Predixion software. Are there any other books that you thing is notable in that area?
Mark Tabladillo: Other books, another book review book with the author with lead author Galit Shmueli. She is a professor and she did an interview with me and her book is called Data Mining for Business Intelligence I think my memory is correct on that. But I reviewed that on my website and I feel her book shows how to take Data Mining and make decisions. I feel view that process is more important for this technology than some of the other technical surrounding questions that we have been talking about in this podcast. Such as performance tuning, such as application development. For someone to learn performance tuning, yes go to SQLCat. For someone to learn application development, they would be better learning from a .NET guru. To learn pure Data Mining, that is part of the puzzle. It is best taught by somebody who is teaching and the book I did I review on that topic was very good. I am in the midst of reading another book which I am not going to mention yet because the review is not out. I would encourage people to see what is generally available on my resource list. I have a list of websites and blogs. The blogs and the websites are not just Microsoft technologies but do include all that is related to Microsoft. Between the two lists there is a lot of other resources out there and communities where people can get involved and ask questions. I believe we are now in the social networking age. Yes read books, please do it because there are a lot of great books out there but in addition to that also find the communities, find the conferences, find the free resources in many cases available. Not just on my website but also many other websites that I do provide a link where people can get free information.
Greg Low: Yes it intrigues me actually, how slowly many countries are reacting to the change to the social media structures. I thought the best example in recent times was when that panel things blew off the side of the engine off the Qantas jet. What was intriguing the Qantas company was saying oh no, no, no. Those things there were no evidence that they came from us at all. Yet there were pictures immediately going around the twitter verse of here are pictures of the engine with flying kangaroos on the side of them and so on. The company not able to deal how quickly the information spreads nowadays. Everybody is talking about it long before the company even really knows what is going on. The social media has completely changed a lot of this stuff, it is quite interesting.
Where will people see you or what have you got coming up which might be or interest Mark? Anything at all?
Mark Tabladillo: I do, first off I am going to be presenting at an event in the US called SQL Saturday and I am going to be presenting at Columbia South Carolina March. Which is this month. Later on I am going to be presenting at SQL Rally which is a PASS conference. It is going to be at Orlando Florida in May. In preparation for that I am going to be also doing some PowerShell Data Mining examples and that is going to be shared off the Microsoft scripting guys’ website. Again I got attached to PowerShell because I saw a blog on the topic and it was filling a hole and plenty of holes out there if you love PowerShell. They are looking for people to write scripting examples. Those two presentations SQL Saturday this month and then SQL Rally which will be in May.
Greg Low: Magic, so listen thanks so much for your time today Mark.
Mark Tabladillo: Thanks Greg for inviting me.
Greg Low: We should note also that Mark will be a speaker at TechEd Atlanta in the middle of the year.
Phone: 1300 SQL SQL (1300 775 775) l International +61 1300 775 775 l Fax: +61 3 8676-4913
Copyright 2017 by SQL Down Under | Terms Of Use | Privacy Statement