Marco Russo
SQL Down Under Show 58 - Guest: Marco Russo - Published: 1 Apr 2013
SDU Show 58 features well-known SQL Server author and SSAS Maestro Marco Russo discussing SQL Server Analysis Services tabular data models.
Details About Our Guest
Marco Russo is a BI consultant and mentor. He has written several highly-regarded books on Analysis Services and PowerPivot and is a regular speaker at international conferences such as TechEd, the PASS Summit, SQLRally and SQLBits. He is an SSAS Maestro and a Microsoft Certified Trainer and holds a number of Microsoft certifications.
Show Notes And Links
Marco blogs at: http://sqlblog.com/blogs/marco_russo/
Hardware Sizing a Tabular Solution (whitepaper): http://msdn.microsoft.com/en-us/library/jj874401.aspx
Expert Cube Development with Microsoft SQL Server 2008 Analysis Services: http://www.amazon.com/Expert-Development-Microsoft-Analysis-Services/dp/1847197221/ref=sr_1_5
Microsoft® SQL Server® 2012 Analysis Services: The BISM Tabular Model: http://www.amazon.com/Microsoft-Server-2012-Analysis-Services/dp/0735658188/ref=sr_1_1
Microsoft® Excel® 2013: Building Data Models with PowerPivot: http://www.amazon.com/Microsoft-Excel-2013-Building-PowerPivot/dp/0735676348/ref=sr_1_3
The Many to Many Revolution (whitepaper): http://www.sqlbi.com/articles/many2many/
Show Transcript
Greg Low: Introducing Show 58 with guest Marco Russo.
Welcome, our guest today is Marco Russo. Marco is a BI consultant and mentor. He has written several highly-regarded books on Analysis Services and PowerPivot and is a regular speaker at international conferences such as TechEd, the PASS Summit, SQLRally and SQLBits. He is an SSAS Maestro and a Microsoft Certified Trainer and holds a number of Microsoft certifications. So welcome Marco!
Marco Russo: Hi Greg, nice to be here today.
Greg Low: Lovely and we were just discussing you were based near to Turin?
Marco Russo: Yes I’m close to Turin, just 25 km near the mountains, the Olympic Mountains one-hour drive I can go to ski. It is very good.
Greg Low: So listen as I do with everyone when they first on the show, how did you ever come to be involved with SQL Server and BI?
Marco Russo: Oh it is a long story because I am born a developer and I always programmed on Microsoft Windows and before Windows spectrum another things. When it comes to putting data on a database, I started with SQL Server a very early version of SQL Server. Then in 99 with the first version of OLAP services, I started working with that because at that time I was working with Oracle for a data warehouse project. Once Microsoft launched its own product it was interesting to me to jump into and looking to what this product was about. In this way I started using what was called at the time was called OLAP services and now it is called analysis services. I have seen this baby grown along its entire life and now I started very early with the first beta, tabular and before that with power pivot which has the same engine as tabular.
Greg Low: Indeed, if I had to ask what sort of mixture of projects would you currently be doing, tabular versus multidimensional?
Marco Russo: This is a good question because fundamentally 80% of the projects where you can use analysis services can be solved by using multidimensional or tabular. And just 20% of the projects can favor one against the other. Today we are using tabular and especially when we can get advantages for the performance of tabular for the least level calculations and when the development team has to complete the development, I am usually involved in the consulting of the initial phase. The initial phase of the project where we define the architecture and the data model and then the development especially for the ETL and the completion of the project with all the measures are usually performed by our customers. The customer that asked us to help them create these models for them.
When the development team does not have a knowledge about MDX and multidimensional and maybe they probably know SQL. DAX which is the language for tabular is much easier for them to learn so when we are in the 80% that is good for editing for both multidimensional and tabular we suggested in tabular for those people who has to learn from the ground up everything about analysis services. However if we have a company that has an internal team that is already confident in multidimensional, we consider using multidimensional because maybe they have already good skills for that and they do not want to renounce to some feature that is still present in multidimensional but still not present in tabular. There are a few cases that we can dig up if you want.
Generally I would say there are a few scenarios where tabular is very good and of course we have some constraints. Data has to be stored in memory and even in compressed way, this means that we have some limit of the size of the project that can be stored in tabular. I would say that today, maybe 60 to 70% of the new projects can be done in tabular and the other 20-30% we suggest to use multidimensional. Even if probably more projects could be developed in tabular because of the percentages I mentioned before.
Greg Low: I suppose one of the questions I should ask, something I completely agree with for the listeners point of view. Why would you prefer to do them in tabular the projects rather than doing them in multidimensional?
Marco Russo: Well one
Greg Low: I mean apart from the fact that the team might not be familiar with MDX.
Marco Russo: Okay I will say this. Tabular has a very low cost of maintenance so if I know that. Okay this is not about the skill of the people I know that I don’t want to help problems. For example creating new measures it in the future I don’t want to be, I don’t want to be required to check aggregation, check performance and something like that. The response time of a query in tabular is very predictable even if you know that the data will grow you can easily predict the response time of a query but in a multidimensional sometimes when some dimension grows up you can start getting bad performance. Of course you have to test your model with the good size of the data. Usually I see an advantage in tabular for the cost of the life cycle of the project.
When you do not have a clear idea about which measures you will want to write in the future then tabular is very good. However all for example another scenario when you have to do a lot of distant count measures tabular is very good.
Greg Low: Yes you do find performance differences in different areas and distinct count is absolutely one of those. I suppose the underlying data model or sitting inside tabular is already a list of all the values, so distinct count becomes a really really easy calculation for it to do.
Marco Russo: Yes fundamentally you do not suffer from the performance issues that you have in multidimensional. In multidimensional you have to create to process the entire table for each distinct count measure whereas in tabular you can get any column and evaluate the distinct count at query time without having to plan this in advance. This is a huge advantage in some scenarios, for example you marketing sometimes you want to know how many new customers I got in this case. You want to do basket analysis; you want to do many different analysis about distinct products against, distinct customers against something like that. You have a very good performance and if you know about the star schema, the good thing in tabular. Imagine you have a dimension that has product, sub categories, categories like the classical AdventureWorks, in tabular it is very easy to count the distinct products, distinct subcategories and the distinct categories for any slice of data.
If you think about what you have to do multidimensional to do the same thing, you have to change the cube structure in order to denormalise, for example the subcategory ID into the fact table just to be able to run the distinct count measure. Not to mention about what you have to do about optimization, so you have a huge cube.
Greg Low: Yes.
Marco Russo: Just will this reason tabular is a great advantage.
Greg Low: I suppose the first thing we should really tackle, what are the missing features or which other features you run into difficulties with if they are not present?
Mark Russo: I think that there are differences. The most common one is that you cannot. Sorry in tabular you do not have a replacement fall what you have in MDX with the MDX creator. You do not have today the ability to apply a transformation over an existing measure.
For example if you want to implement a Year To Date (YTD) calculation, we can talk about very simple calculations. If you have five measures so the quantity, the sales amount, the average price and so on for each measure you have to create another measure just to run your YTD calculation.
Whereas in multidimensional you can define with the pattern that we call dates will dimension you can apply a calculation over an existing measure so you can change dynamically. The device of the data, applying the same measure to a different set of data. So when you define a YTD calculation you can apply this YTD to any existing measure without having to write any other line of code. So when you have many measures in a table model and you want to apply different calculations over time, YTD, month to date, difference over a year and so on and you want to implement this measure on server side. So that any client that has a measure that is ready to use. This can become very expensive. Expensive in terms of time to code the measures, also in terms of navigation of the measures because in the end you end up having a long list of measures with all these combinations.
There is some technique to simplify the presentation to the end user but at the end on the server side you have to create one measure for every combination of calculation that you want to perform. So today not having something that is equivalent to their MDX script, one of the measure issues that I have in complex projects. Even if I have to say until now this was not such a big issue because usually you do not need all the combinations of all the measures, with all the calculations. When you look at the real world, it is not such a big issue in many projects. In some projects it is a big issue but it depends on the requirements you have.
Greg Low: What about areas such as many to many relationships?
Marco Russo: This is an interesting area because we had I wrote a long white paper about many to many relationships in multidimensional and last year we released version 2 of this White Paper called multidimensional and tabular.
Greg Low: Yes this is the many to many revolution white paper. I will put a link to that in the show notes.
Marco Russo: This is still the White Paper with a high number of downloads from our website and every day tens of people download this paper. In tabular we have still not the same experience as we had in multidimensional in more than 10 years. Many to many appeared in 2003 in in the first date of 2005 SQL Server so we now have 10 years of experience in multidimensional just one year in tabular.
The good thing in tabular is that the performance are good. Not very very good but are good probably very good. We usually get good performance with tabular, with many to many. Certainly better than multidimensional sometimes it is not enough because you know you get used to good performance and you just want to be faster, faster and faster. A single measure usually a single measure many to many in tabular because your data model is entirely in memory it is something that is pretty quick. A single cell can be evaluated usually less than one second.
The problem is that when you have a large pivot table with many sales or you have a report that could be written in DAX language for some calculation you may have the issue that because some calculation is not entirely parallel in the former engine of the tabular of analysis services tabular. Sometimes you still get some slowness, it depends on the query. I am talking about complex calculations, with complex many to many relationships with large databases.
I am talking about complex scenarios but certainly we are in place now where we can do the calculation. Whereas in multidimensional you was impossible to run the calculation. For example we usually show in our session 4,000,000,000 rows fact table in a many to many relationship with 2 million rows bridged table and we usually get answers in just a few seconds, may be to do five seconds, which is pretty good.
That example that we showed was impossible to run in multidimensional. After minutes you shut down the server because you get no answer. However we still want to go faster because we know something can be improved. So actually I am pretty much confident that we have a good engine but probably there is still space for improvement. Especially at the high level at the formal engine that could be parallel, and some calculations can be put in in a parallel mode. It can be operating in a parallel mode in a query whereas today some things are still in a single thread formal engine. There is space for improvement and I hope in the future we will see something better.
Greg Low: Are there any characteristics of a project that when you are first starting now you are just looking that is a red flag and I should just go straight to a multidimensional model instead?
Marco Russo: Yes I think if you have, while there are several I think the most important one when you have heavy use of MDX script for the business logic. Usually we say and we always say don’t put too much calculation on the cube. Try to put everything you can in the ETL, try to prepare data in the data mart, but sometimes you need to do that. For example, for simulation for comparison with budgets or something like that. We have seen projects where the part of their MDX script was very important, it would have been much more expensive to move this calculation in the ETL.
In those cases where you apply a calculation in the particular area of the multidimensional model, so when you really need a multi dimensional model. When you really have collations that are based on hierarchies then dimensional is better. It is better because you can express your calculation in a much more consistent way. In a word it is safer to test, it requires more skill to write this measure is but in the end I am more confident because we can, we have to manage a lower number of lines of code for the calculation. So at the end we are more confident about the numbers that we provide to the end user.
In tabular you can do whatever you want but you end up writing a lot of lines of code. And so you end up getting some areas all of because in tabular you do not have any syntax in that that understands the concept of a hierarchy. So when you write something, when you say for example, I want to perform a ratio between this number and the parent over this hierarchy you have to write something in DAX that cannot take advantage of their data model. So you move DAX code, the notion of who is the parent like you do in SQL.
Greg Low: Yes.
Marco Russo: If someone changed the hierarchy tomorrow this change is going to break any DAX formula you wrote. Whereas in multidimensional by changing the data model you will immediately apply the changes in the data model across your calculation. In a relatively easy way and even if you have to change something you have a low number of code to manage, you are quicker in applying the changes for your data model. When you have strong hierarchy based calculations in MDX script that cannot be moved into the ETL then in this case of course we don’t spend time trying tabular. Multi dimensional just for the specificity of MDX is better.
Then there is the case a huge amount of data, huge I mean today we mean terabytes. You have several terabytes of data of SQL Server and you know that you cannot eliminate columns. You cannot compromise dated too much, you end up having the need of pre-aggregated data because you cannot store everything in memory.
Greg Low: Do you have any guide that you use yourself? I mean this is one of the tricky things with tabular, deciding what will fit in memory. How do you go about making that assessment?
Marco Russo: This is not easy because we have usually we make some tests. I mean when you have relational database under 1 TB usually we know that we can store this date a memory. If we have enough resources in analysis services because usually uncompressed data in SQL Server can compress between seven times and 10 times usually in tabular. We also reached 100 fact of compression with particular optimized facts.
Greg Low: Yes
Marco Russo: It is the same condition but the average is 7 to 10. When you have more data than that we try to understand how much of this data has to really move into analysis services because often times you don’t need to get all the columns into your analysis services model. This will be the same in multidimensional in reality.
Greg Low: Yes
Marco Russo: You are not worry too much even if you import some data you don’t really need to much. At the end the model works you can just optimize this later but it is not going to stop you from deploying the project. Whereas in tabular you have to understand that if you break the memory available the entire database is no longer available so. You have to be more careful with that so when we have large data warehouses the limits of the yellow warning is 1 TB and above that limit when we approach this limit we are careful. Above 10 TB we start to consider probably to tabular is still not good but we still evaluate how much of the data we have. It has to be imported into the data model.
Greg Low: Yes, actually it is interesting another thing I will put another link to in the show notes is the hardware sizing guide for tabular models. I was one of the reviewers on recently and I remember the discussions that happened with the authors of that. I was saying how realistic is it to say 1 TB of memory is a good starting point. This is something that people are not used to now in terms of provisioning service with very large amounts of memory.
Marco Russo: I still haven’t seen companies with 1 TB of data on their servers we usually try and put. Today we usually requests for 128 GB to 156 GB, probably with the cost of the ram and the cost of the server today, this is a good return on investment for the money you spend. Probably the hardware that supports the 1 TB is still expensive.
Greg Low: Actually it is surprisingly getting much better, I had one of my clients the other day who was just a medium-sized client and they were looking at a new server. I was intrigued to see that that was supporting 24 memory sockets. It is interesting to see service starting to get to that sort of memory sockets.
Marco Russo: Yes yes, it is interesting the problem is that tabular we still have to be careful because if you have many sockets and you have many CPU sockets. And you have numeric technology; tabular today is still not numer aware, so this means that you may a surprise in terms of performance when you start using a lot of memory. So to day you have to be careful with the memory just because even if the server supports lots of RAM, the best way to use your RAM is having everything on a single node in terms of CPU sockets.
In general, if you have a dedicated analysis services machine all better for a single instance of analysis services if this instance runs on a single socket with all the cores that you have available there it is better. So if you try and allocate memory, only memory that is local to the socket. The problem is that in an numer architecture, if you spread the allocation of your database across the memory for all the sockets. Because the calculations is not aware of the thread, wherever the thread should run to access the memory faster this could be a problem for performance. You often see a smaller machine performs faster than a larger machine with a lot of RAM just because of this issue.
Greg Low: Yes.
Marco Russo: It is something that Microsoft is aware of but today we still don’t have a, in the current version we don’t have a solution for that. If you have several services on this machine then you can of course petition your hard drive. You can say I run analysis services here, I run SQL Server here so it makes sense.
Greg Low: Listen, one of the decisions you take with a multidimensional model is usually whether I put everything in one cube all multiple cubes and I am interested in your take on what you did with multidimensional in that regard and now whether that changes given the memory limitations in tabular?
Marco Russo: I will say that the regional idea of UDM was, put everything into one single cube and before the performance issues of that because there are performance issues of that. There is another important reason that suggested asked to not do that. It is the fact that when you have so many different fact tables you end up having not just one conformed don’t mention that is good for everyone but you end up having different versions, different views over the same dimension for different departments of your company for example.
If I know, imagine this for the sales department it is important to look at the customers as slowly changing dimension type II dimension. Whereas for the marketing department, it will be important to look at the customer of the slowly changing dimension type I. So the underlying data are is exactly the same but even everyone was to look at the data in a different way now it is not easy trying to keep a single dimension that is good for everyone because this ends up usually creating many attributes. Then you start hiding in some perspective than showing in another.
Greg Low: Yes.
Marco Russo: This makes the complexity of the data model growing and growing and at the end usually after one or two years the data model become so complex that nobody wants to touch it and they start creating new cubes or they stop improving the cube just because they are scared about breaking something in some existing report.
There is another important factor especially multidimensional, re-factoring the structure of the cube called mean that you break an existing report. You don’t have a way of fixing that, other than changing the query because you do not have the way. You cannot create a view on top of the cube, which is something on SQL that can solve this problem because you define an interface for the report and I just have to comply with the existing interfaces. You cannot do that in multidimensional, so we have seen over the years that in multidimensional it is better if you consider multidimensional as a way to expose data marts, or Department data marts. I don’t care too much about creating to different cubes that shows the dimensions in a different way, in different cubes because I know that they are have different purposes for that.
Then of course I need to have a single version of the tool but this is a problem for the data warehouse not for the cube. But if I need to compare data between marketing, when we identify this need we create cubes that gets data from different star schemas and in that case we have the conformed dimensions for that particular view of the cube. In any case, we have duplicated data across different cubes within the same company and this is not a problem because this is just a report of data coming from the same source. This is much more flexible at the end, we have a description of this in the SQL methodology that describes also how to create the data mart layer with a star schema. In order to support these in a better way but again this is not a philosophical discussion, it is just about being flexible with the request you get from their customers without breaking existing systems. This is very important usually.
Greg Low: On the topic of things that aren’t present in that model at the moment, things like data mining is another area, so is this something, what is your take? Do you think it will migrate into the same, into the new project styles all will they keep it separate? Do you have any feel?
Marco Russo: This is a good question but we have to redirect to Microsoft. Currently I don’t have a huge amount of customers that I have seen using data mining tools. Most importantly when data mining has been used, even if it can be integrated with multidimensional cubes, the area of use of this integration that I know is in the previous model. Today it is not a big issue having the multidimensional instance running just to satisfy the data mining queries and having the tabular instance running for a calculation.
Of course I hope that it will be easier to integrate tabular with data mining because it is so natural that you get the table. The results for example, the results of a DAX query which is a table which is an data source same data mining processes and data mining add-ins for Excel is good for exactly this reason. I expect some integration to happen in the future but I don’t know and if I knew I could not say.
Greg Low: yes that is exactly right. It does lead to a situation where you might feel the need to have multiple both types of analysis services servers present but having both present on the same machine it is usually not such a great idea given the memory requirements on the tabular I think. Is there anything else that you feel is like totally missing all are you getting more and more comfortable doing more and more projects with tabular? I am thinking may be translations or some of those areas?
Marco Russo: Well I have to say that translations are an important area for one reason that has not been a goal from Microsoft instead it is very important and I will try to explain. Power Pivot is simple to use and so you get a table and you present in the same table so the names of the columns between that you import are the same name that you expose to the external user. If you rename a column this information is only within the data model, so for this reason we suggest that you create views on top of the tables that you have. So that you expose in the view the same names that you want to expose in Power Pivot. The problem is that for example you may have a naming convention for your tables a you may have some reasons to keep the same naming conventions also in your project. For example one of the issues that I discovered a company in North Europe used English internally but presented the data is just the meta data in the local language. They were used to using translation not because have users using different languages; they just supported one language for the end user. But the names they used internally were different than the names that were used externally. But the most important part the names that they presented to the user can change and when they rename column they do not want to rename the references to that column in any calculation all in any report that they have already built.
This is the main reason for using translations because you decouple the user interface from the internal naming convention that we would never change over time because you create formal rules using these names.
Greg Low: Yes, that is one of the challenges with these analysis type tools is the fact that you don’t have the ability to have like a real stored proc or view layer as you are saying just sitting above that, that could be then used as the interface to the applications.
Marco Russo: Yes there is a workaround for that today because translations are not supported in a way that is supported in multidimensional for tabular. Even if we know that Microsoft knows that this is high on the priority list. However if we use BIDs helper actually you can translate the meta data. You cannot translate the content of a table but you can translate the name of a table and the name of a column that is exposed to the end user.
Greg Low: Yes.
Marco Russo: This is a workaround because you can maintain the same top of the coupling between the internal representation and external representation by using a language that is used by the end user and our language that is used for internal representation. For example if your users have English you could use US English, UK English for the two versions of the names for example.
Greg Low: Actually I will put a reference in the show notes to bids helper. One of the things I haven’t tried, I have changed the other day myself to using the new visual studio 2012 projects do you know it bids helper runs in that environment as well?
Marco Russo: This is a good question, in the last couple weeks I was just able to install it, to open it, to look at it but I haven’t had time to make some tests. I don’t remember.
Greg Low: Overall I have found it pretty good. Although there are some things that I’m not that happy with, they are more in relation to integration services. One of the challenges for visual studio 2012 is they have gone to a very flat color scheme. I suppose modern interface they would say but the problem with that is when they have removed all the colors out of everything, so for example if I have a task in integration services and I now disable it there is almost no visual clue at all that it is actually disabled.
Marco Russo: As I have said I am a developer, I was born as a developer so I was one of the beta testers for visual studio 2012 and the first versions of this in the beta was so bad as an interface that they got such bad feedback. They had the same issue fundamentally but they fix that actually over time. There has been a big difference in the first beta of Visual Studio 2012 and what they are actually released in terms of user interface. I will expect that they will fix that, they didn’t release this is a preview. They just released it; I hope they will fix this in a future version because the feedback is very clear.
I remember I read something from the product manager of visual studio that was trying probably just to justify the issue of the beta and those say we put us in a position where in the data we keep everything black and white. Then we’d add color when it is just required, we use color when we need to do that. So they started with black and white, in this way they didn’t make any assumptions about what was necessary to put into because otherwise it would probably have been an arbitrary decision. This way they tried to get the feedback and to fix the user colors in a way that was good.
Actually I didn’t use integration services on visual studio 2012, so now that you mentioned this I will make some tests. Even if they don’t have the.
Greg Low: The only difference is the name of a task changes from bold to slightly more grey. I would imagine for someone with some sort of visual problems that would just be a real problem. I think also when I look at a package and I just glance at a package I used to be able to see disable tasks just immediately but now it is not so immediately obvious. Anyway.
Marco Russo: Yes I know, I have to say that I hope they will improve these according to the feedback they did it in visual studio 2012 for developers. I hope they will take care of us, like they did for the developers.
Greg Low: Listen I do want to ask you one thing around ETL though, the underlying models do you still prefer to have some form of dimensional model and/or star schemas and so on underneath? Because I read a lot of content that seems to come from Microsoft that sorts of promotes the fact that you can just layer tabular models over the top of existing systems without doing that sort of thing. I must admit my experience is that that is not a good outcome so I am interested in your thoughts.
Marco Russo: It is not a good idea, but the point is that we have to; it is a long discussion because first of all there is a big difference between the use of the dimensional model for storing that in the data warehouse and the use of the dimensional model for reporting needs. So for example I am of the school which is a hybrid between Inman and Kimble so we tend to use the more normalized version of the database or data warehouse but we use pure star schema for a data mart. For an entity we do use the conformed dimension approach of Kimble for the data warehouse for the same reason that I have said before. Into three years it tends to become a manageable. I have seen in long-term projects that this is not a good idea. It is not flexible lucky can be in another approach.
When I say normalized I don’t mean, just get the OLTP system and put this into the data warehouse. We are talking about something different; we are talking about entities which do normalize data but in a smaller number of tables which are entities not dimensions or facts. Every entity can represent fact or dimension depending on the type of analysis you want to do.
The idea is that in a certain way with just the number of tables in the data warehouse and then we change this table when we go into the data mart. For simple schemas, the two schemas are very similar in reality what changes are that in the dimensional modeling in Kimble we use surrogates keys. Where are in data warehouse we only use app application keys. And we sometimes as an exception create keys that are internal to the data warehouse because it is not possible to use an external application key as an entity key for integration reasons because data comes from different sources or something like that. But it is an exception.
Now tabular has the ability to use application keys instead of surrogate keys. Moreover if you use power pivot and you have a star schema, it is better not to expose these surrogate keys to the end user. But in an ideal world, the user of a power pivot has the access to some views over the data mart. Now if you data mart is a star schema, you think okay I expose the dimensional table, fact dimension and I created join by using the surrogate keys this is natural to anyone. If you do that the end user will have a very good simple power pivot data model, with data coming from your data source but he will get in trouble as soon as he needs to integrate this data with other data that may be coming from other sources. Maybe from data mart or Azure data market or whatever or for example he extracted data from a campaign he made on the website. He imported the IIS log, you know something like that, and when you have to relay data you will never ever use in power pivot the surrogate keys. It is the key that you can use only for data coming from that source; it is out of this data mart this number doesn’t make any sense.
In a way it would be better for power pivot both from a performance point and also from a usability point. Having just the application keys from a performance point of view because you reduce the number of columns because in the end you will always expose the application keys but when you have integrated data in power pivot and the same of course is for tabular. If you have in your fact table the surrogate key, you have to follow the relationship to the dimension just to get the application key then with this application key you create the relationship with another table. You are adding one relationship, one relationship that will not be necessary if you had the application key in the fact table.
From the point of view of tabular, it doesn’t matter if you have an application key or surrogate keys in the fact table. It is absolutely the same, now of course if you have a slowly changing dimension type II, you need the surrogate key. But usually you do not have the dimensional is in a slowly changing dimension type II. I am not saying that now we have to create the data mart by using application keys, absolutely no. But I have doubts about what we exposed to the end user that imports data into power pivot and the same is true or the views that you used to load the data into tabular could take an advantage from exposing application keys instead of surrogate keys.
I know in terms of processing time, if we need to created join just to extract the application key in the don’t mention in order to expose in the fact table we are going to slow the performance of the process. I know that. I am just evaluating, especially for the user power pivot which has a lower knowledge about how to traverse the relationship and something like that. It could be easier having something simpler to use, something simpler surrogate keys creates more issues to the users of power pivot.
Greg Low: Yes I must admit that is an area I haven’t had a big issue with. Part of the reason why I still tend to like having surrogate keys I suppose there are a number of reasons; one obviously is if you have some sort of slowly changing dimension like a version thing. Often I have dimensions that have come from multiple source systems and in which case it then gets really really hard to know which one you would use anyway. I will also find in many of the source systems they have things like multi column keys
Marco Russo: Yes.
Greg Low: Which aren’t going to translate terribly well and so on and some of them are like big alphabetic keys and so on. Actually personally I like the idea of having a fairly clean data model that is laid out the same way in each of them.
Marco Russo: I agree with you, I am just raising one possible issue because when I design a model today I still use surrogate keys. Also for tabular, we usually still have some surrogate keys internally, my concern is about the, because in the end if I have a surrogate key if I know for a particular model it will be better to expose the application key. I can obtain this information in many ways. I can also create this information directly in the tabular model by creating a calculated column in the fact table. I have a lot of ways to obtain what I need.
My point is that I can do that in many ways but the user of self-service BI has the access to some views, gets a data from these views, import this data into a power pivot data model and start browsing the data. Maybe start to integrate this data with other data, in this scenario, for this specific scenario exposing surrogate keys would not be a good idea. Even if I know that we have facts for that.
Greg Low: I must admit, even in the data models I build like that I would not normally ever expose those keys. But anyway yeah that is interesting.
Marco Russo: I think we are just at the beginning of this travelling to the self-service BI experience and we have since to understand. It is just something that I realised when users have to manage these surrogate keys they have more trouble.
Greg Low: Yes.
Marco Russo: I try to reduce their exposure to surrogate keys for the user of the self-service BI. BI is certainly better, but this does not mean in my opinion that we have to start creating star schema is in data marts using application keys. I am not saying that because I know that we are going into much more troubles on the SQL side.
Greg Low: Indeed, actually one thing I should ask you there to is a lot of the messaging from Microsoft tends to talk around self-service. Very much self-service, they were thinking, they look at power pivot and they say oh a user could do this. That is not my experience what I find even with power pivot is they still want an IT person to do the data model even inside power pivot. Then they will happily build pivot charts, pivot table things like that over the top of it but I find very very few users recently want to tackle anything in the data model at all. What is your finding on this?
Marco Russo: Probably I am in a position, we have people around the world I can say that sends questions to us and we are discovering that there are people who are not in IT or not supposed to be data modelers that are starting to use power pivot in the way Microsoft tended it is to be used. If I look at the request we received, I would say in many probably there are still very few. They go to our website because we have articles and we have a lot of things and they probably call and request something to us because they know that they need some help.
The point is that they try to create the model by themselves; these users are usually formerly or actually currently still Excel users, very advanced Excel users. They have a very deep knowledge of the domain model; they know very well their data. They know very well what they then need to obtain from the data and they realize it is much better if they try to create the model by themselves than trying to move their knowledge to somewhere else in order to create the model they want. They spend a lot of time transferring all their information that this requires.
The point is that in a scenario you have users that are still using an always been using to our someone else to create a report and they trust limit their action to look at the numbers. There is no hope that they will change and become data modelers. If you go to people that maybe never used a BI system before because they were used to getting the row data in terms of Excel worksheets, database like access where they dump a data that was not able to be stored next to them. These users actually use power pivot in a good way. If you fine and in user that has a big Excel workbook, very complex with many worksheets that requires the minutes to be updated when they refresh talk to this user try to present him power pivot. You have a good chance that he will try to use it, but this user probably is not a user of your system before because he’s a top of user the says okay this are the numbers that you prepared for the people who just look at the numbers. I had to create new numbers, I have to create new analogies, for example people in accounting, people that control the costs for a company they try to understand where you can say some costs by finding inefficiencies in the company looking at data crossing data from financial data, production data, manufacturing company and they try to integrate data coming from different sources by themselves. They usually, this type of user sometimes asked the IT please can you give me a text file with just this data. They don’t specify what they would do with this data they just asked for the raw data, it is a summarization of the data that is available that is somewhere and they integrate this data with other data in some way themselves.
Greg Low: I was going to say I have a colleague who used to say the target person is someone who knew what a V-LookUp was in Excel.
Marco Russo: Absolutely yes if end user never heard about V-Lookup he is not a candidate user for power pivot. Absolutely! If he knows V-Lookup, he will get engaged in power pivot in a few minutes. Just to show how to create relationships in tables, he will buy the product in the precise moment.
Greg Low: Yes that is excellent, listen getting towards time but Marco is the life outside SQL Server and BI?
Marco Russo: Well I have a wife and two children, so when I have some spare time I try to spend some time with them. I still haven’t force my son to be a programmer so I am proud of that.
Greg Low: Good.
Marco Russo: He knows that he can use a computer at any time but he is 10 years old now and is still hasn’t managed to program and he plays Xbox very well now so. My personal interest, I try to read something that is not just technical books sometimes. Not very much though to tell the truth, I try to be updated with economics or technology that is not strictly computer related. I go to ski when I can, I run a little bit just to be in shape enough otherwise our work.
Greg Low: Yes very sedentary work. Absolutely.
Marco Russo: Absolutely. I don’t have a hobby on my to be a musician for example. Knowing that I am not able to learn a thing in a good way, I abandon this before beginning. But now my little daughter is starting to play piano, I see her playing and I like it. I like to do that but if she is able to learn this that is fine for me.
Greg Low: There is an amazing intersection between my friends that are musicians as well are IT people. There are clearly particularly developers; there is a very strong connection between developers and musicians.
Marco Russo: Probably yes. This is the note that this has to play this way; there is fee involved, some rationale. Music is not just everything is rationale.
Greg Low: Oh no but on the contrary I actually think it is the artistic side of it that appeals to the developers as well. I think good development is art as well.
Marco Russo: To me I have to say I am used to listening to classical music and I prefer to listen to classical music in auditoriums is played live. I really when I listen to this type of music I am horrified by looking at something that has something electric. Everything has to not be amplified, absolutely natural and it is strange because in the end I am a very technical guide but when it comes to the classical music for example in some little auditoriums in Torina where they play every week. When I’m he I try to go to the concert on the Friday night, they don’t have a real all didn’t and they play an electronic one and it is horrible. When there is a concert in a large place they use it to amplify the music I’m used to listening to the music played directly by the instrument. You stop listening to the direction of the music because music comes from the instruments and it is very strange. When you get used to listening to live music, so when I travel a lot for my work when I can I try to look to see if there are some interesting concerts and I try to go to the concert when I am in the city.
Greg Low: Very good.
Marco Russo: When I can.
Greg Low: You are also a fairly prolific writer in recent years, so have you got anything else that you’re working on book wise or?
Marco Russo: Sorry?
Greg Low: Writing books, anything else you are working on at present?
Marco Russo: We have just finished Excel 2013 book, power pivot for Excel 2013 and we are currently writing some content about DAX.
Greg Low: Is this in conjunction with Alberto?
Marco Russo: Yes yes. I and Alberto started working together, in our activity we can split our jobs so there is something that I do better than Alberto just because I like more, and Alberto lots more something else. In reality I’m power pivot and DAX we cover pretty much the same things and being into we are able to split the work of writing for example. Which doesn’t mean half the book is written by one and the other half by another.
Greg Low: Yes no.
Marco Russo: We complement each other in some parts every book is reviewed by both so in the end, in some books it is hard to say who wrote what. Yes we work together because in this way we can also recycle more of our jobs. For example the sessions where we have when we travel all we deliver workshop or course somewhere we usually all we try to deliver some community session in their local area and we try to prepare a certain number of sessions that we can recycle over one or two years. You cannot generate a new session every week of course.
Greg Low: No.
Marco Russo: Also in this way we can prepare a better quality materials, for the slides, for the demos and I think the experience is much better.
Greg Low: That is great. While speaking of conferences and things where will people see you coming up?
Marco Russo: I will be on April 10 in Chicago because there is PASS analytics conference. I will be in SQL Bits in Nottingham in the United Kingdom from 2-4 May. I will be in New Orleans at Tech Ed North America and this is at the beginning of June and Tech Ed Europe in Madrid at the end of June. Of course I will also be in Italy for their SQL Server Intelligence conference at the end of May. This is a conference, an Italian conference so if you love listening to the Italian language then it is a nice conference in Milan. It is also good food but if you speak only English it is just nice to hear but you will not understand it.
Greg Low: Actually it is so funny, I did put a little effort into learning Italian at one stage, I was playing soccer for an Italian club actually in Brisbane. It was originally called Azuri it then became Brisbane City, it was interesting that the entire team was all Italian apart from me and there was one Greek guy in the team which was good because they always picked on the Greek guy instead of me. That was great but I must admit most of the Italian I learned there was probably more conversational Italian at the soccer club. Not very polite.
Marco Russo: I always say guys, in English you have so few words, so few bad words in Italian we have so many bad words that we do not have an entire dictionary is not enough.
Greg Low: That is lovely.
Marco Russo: Yes the bad words.
Greg Low: No good well listen thank you so very much today for your time Marco and we will talk to you again soon.
Marco Russo: Okay thank you very much.
Phone: 1300 SQL SQL (1300 775 775) l International +61 1300 775 775 l Fax: +61 3 8676-4913
Copyright 2017 by SQL Down Under | Terms Of Use | Privacy Statement