Matthew Roche
SQL Down Under Show 25 - Guest: Matthew Roche - Published: 21 Nov 2007
In this show Matthew provides a collection of tips and tricks from his experiences using SQL Server Integration Services.
Details About Our Guest
Matthew Roche is a data architect with systems management innovator Configuresoft Inc, and is also chief software architect of integral thought and memory, of Business Intelligence Focus Consultancy. Matthew has worked with Microsoft SQL Server since the mid 1990s starting with SQL Server 6 and 6.5, falling deeper in love with SQL Server each year. As a Microsoft certified trainer, Matthew has taught hundreds of database administrators and database developers how to better utilize SQL Server’s capabilities, and as a software developer and architect, Matthew has built thousands of applications that use SQL Server platform. Matthew has been using SQL Server since 2005, in the early beta days, and has been learning SQL Server 2008 since June of this year. Matthew’s current focus is on the SQL Server business intelligence stack, specifically SQL Server integration services.
Show Notes And Links
Show Transcript
Greg Low: Introducing Show 25 with guest Matthew Roche.
Our guest today is Matthew Roche, a data architect with systems management innovator Configuresoft Inc, and is also chief software architect of integral thought and memory, of Business Intelligence Focus Consultancy. Matthew has worked with Microsoft SQL Server since the mid 1990s starting with SQL Server 6 and 6.5, falling deeper in love with SQL Server each year. As a Microsoft certified trainer, Matthew has taught hundreds of database administrators and database developers how to better utilize SQL Server’s capabilities, and as a software developer and architect, Matthew has built thousands of applications that use SQL Server platform. Matthew has been using SQL Server since 2005, in the early beta days, and has been learning SQL Server 2008 since June of this year. Matthew’s current focus is on the SQL Server business intelligence stack, specifically SQL Server integration services.
Welcome Matthew.
Matthew Roche: Good morning Greg, thank you for having me on the show.
Greg Low: Tell us how you got to be involved in SQL Server. Before we start, shout out to Chris Randall, a dear friend, sent an email saying I demand an integration services show! We’ve been trying to get integration services shows in place for some time. He suggested Matthew; what a perfect choice. Tell us how you came to be involved with SQL Server.
Matthew Roche: A lot of it came through my training background, in the mid 1990s, when I got my Microsoft trainer certification. Started off doing NT4 systems administrator /network administrator track, teaching people the knowledge they needed to manage Windows NT4 domains, help get their NT4 certifications and so on. Had a software development background in college, this is where demand was back then. The training center I worked for said they needed someone to teach SQL Server administration class. Never heard of SQL Server, had done some database work with Oracle in college. Attended class, prepped, did a whole bunch of reading, research to teach class on this topic. Knew very little about it. If only I had known how little I knew! I fell in love with the details of platform, the dance of data under the hood, database design, index tuning, performance tuning in those days, just compelling, interesting. Everything I had always wanted. Got more into the database development side. Over ten years I moved out of classroom, further away from administration side. These days doing database development, BI development, some upper tier development. Happiest when I’m closest to data; they say if you’re not happy at work, time to look for new job.
Greg Low: If you were teaching SQL Server 6.5, did you end up doing the old performance tuning examinations they had back then?
Matthew Roche: I did not. That was always on my list; that was a special performance, trainer-only exam?
Greg Low: One of first exams I did was for the equivalent for SQL Server 4.2, which was performance tuning. Things were primitive then.
Matthew Roche: Not a lot of demand for the course. I was so busy then, some things don’t change, had to look to see where to devote my time teaching two core SQL Server 6.5 exams SQL Server 7.0. By the time SQL Server 2000 came out, I was doing much more hands-on development than training, focused on development side.
Greg Low: Those old exams, intriguing, being a trainer-only exam, didn’t get debugged, they were some of the worst exams for quality. For example, you needed to get 85 percent correct to pass, frustrating, out of 50 questions, there were at least two to three that you could not answer. For example, pick two of following; no way to do that because they had a radio button for answer choice. Automatically got those wrong! Only had a single exam, if you retook it you’d get the same questions.
Matthew Roche: Real-world equivalent of walking into training center, it’s already set up, nothing you can do, yet still need to succeed.
Greg Low: Integration services is what we’re talking about today. Best practices is a good topic, what are they?
Matthew Roche: This is something I can’t take credit for, here’s my list. Inspired by SQL Server MVP Jamie Thompson, he has very active blog on integration services, one of the best and brightest voices. He has 30 or so best practices out there. For listeners, I recommend checking out his blog. First, metadata. Get metadata right first time, not later! SQL Server integration services gets performance through strong metadata bound typing inside data flow. When working with relational data source, integrated designers look at data source, pull out column, sizes, all metadata used to determine size downstream. Common mistake, having changes going on in data source after data flow is built. One of the things I’m doing in my work with Configuresoft is building a lot of SSIS connectors that will draw information from third party service desk systems like mercury, remedy, CA, pick a source. Tend to be hosted on Oracle, iterative process working with source system experts help find data, no standard installation with any software. True discovery process. Have standard data flow we use, have to tweak top each time. Most time-consuming part is getting data types finalized, cleaned up. Designer tends to be inflexible to make those changes once there are components downstream that use metadata. I have found, when you get into the QA, the pain I felt makes this stick in my mind, if designer offers to clean up metadata for you, generally want to say yes, but see what it has cleaned. Hooking up data to new view or select statement, designer says ‘metadata out of sync’, cleanup removes mapping between dataflow. No warnings or errors, runs great, but no data! Not what you wanted.
Greg Low: Big move where had scripting in DTS was a variant in VB script, very loose typing, anything goes, one of things I find people struggle with is precision required in dealing with data types. Another thing you want your thoughts on, expression syntax, C-like, but not obvious.
Matthew Roche: First, SSIS expressions are single most powerful feature of integration services as a whole. Having a place to put in your own logic at arbitrary places. If this were DOT.NET development, like developing your own prop accessors (run my logic instead) for someone else’s components.. Love it. Huge capabilities. Can’t build a real world package without them. Syntax `has worst of PSQL, C, etc. Has indiscriminate mix of language features form different languages. Functional when you get used to it, sooner or later, not intuitive or easy to learn.
Greg Low: Your next one on the list; using template packages?
Matthew Roche: Any real world SSIS package tends to be complicated. Most complicated things about it not core things, like moving data from a to b, but tend to be auditing, logistical infrastructure around it. Most of my packages start off with three execute sequel tasks reading information in from auditing tables, updating , performing data flow, then going through to update records, basically housekeeping to have paper trail, audit trail for what took place.
Greg Low: I see very strong mix with people who put error handling in control flow; others build error handling into package events. Doing that gives you cleaner view of control flow; do you have any strong preference?
Matthew Roche: Real advantage of using template packages whatever standard you come up with, can reuse consistently. Want to make sure you have all things your package needs to have within that framework; all packages will have from that point on. As far as where to put housekeeping logic, I have no personal preference. When I was first working with SSIS, I would put it in the handlers for the cleaner control; pre/post execute, not cluttering up main design surface. I’ve found I’ll be working on SSIS projects alone, I joined Configuresoft in July, was independent for five years before that. Would always work with development teams of consulting clients; found that most on teams not experienced with technologies. They would say, build something and build a team at the same time. Problem I saw, if housekeeping was inside event handler, no one knew it was there.
Greg Low: It’s not in your face, that’s the downside. Do you have any specific guidance either way?
Matthew Roche: What I’ve done is gone opposite direction; put everything in control flow; is in your face. Tradeoff is visual cleanliness versus self documenting and discoverable; this is a big theme of mine.
Greg Low: Again, any program language with eventing model, things are not so easy to follow. By comparison, when entire control flow there. For complicated packages, removes a lot of clutter.
Matthew Roche: Few more things. Would love to see in future versions of SSIS tools, see those events be more in your face. Look at traditional development environment; logic is more complex than in SSIS package. Having everything in text files; easily skim through code, flatter surface to look at. SSIS gives you visual design. For SSIS tools to keep up, needs to be more effort in having them be discoverable, self-documenting, easier to work with for developer. Most complaints not related to core functionality of platform, but to designers inside visual studio.
Greg Low: Fair point. Next on list, using OLE DB connections.
Matthew Roche: One of things comes up; which connection manager should I use? A lot of questions come down to performance. From my experience, this is not the question. If working with billions of records, need to spend lots of time doing performance tuning, figure out what works best. Most of the time, you’re not working with billions of records. In those situations, five to ten percent performance improvement that working with SQL Server destination might give you is often outweighed by productivity, consistency from using OLE DB connections consistently. Advantage is working with different dataflow components, some only work with OLE DB connection. Need OLE DB involved. If multiple connections, some use one, some use another, not only have additional overhead against data source, also have additional maintenance in updating package. Also see inconsistencies in parameterized queries, using execute sequel task in control flow, if use SQL Server connection or ADO DOT.NET connection, have named parameters to specify values, variables. If OLE DB, have positional parameters. If using OLE DB throughout, you have one technology to worry about, can get great performance. I have packages that are pulling two to four million rows per minute using OLE DB source, destination, have great performance all around.
Greg Low: I’m seeing very few issues with OLE DB connections. Have one client, has large set of tables, importing data at extra rate. In that case, maybe. In general, difference is not difference in package.
Matthew Roche: One more thing, quick story, back in old days, “SQL Server destination gives best performance” was my mantra. We’re all happy, we love SQL Server, for all data destinations. Was able to get confirmation from client, yes, only deployment scenario we would ever need is when SSIS packages running on destination server. Great, requirement of SQL Server destination. When we moved into full development, from prototype on my machine to larger organization, nothing worked. Took couple hours, figured that problem was in order to do real development, needed data set too large to work on one machine . In production environment, packages would run on server, but in development environment they would not. Had disconnect, had to go back to drawing board. One more reason to use OLE DB consistently. You don’t need to worry about some of other details.
Greg Low: Next on list? Only configure package variables?
Matthew Roche: Technique that has served me well, I love about SSIS is how easy to pull out things that are not core to package functionality that you would think of as command line parameters; pull out, store in configuration. Maybe environment variable in registry; I’m preferential to XML configuration. Can have a strictly file system deployment, makes it easy.
Greg Low: Agree. Easy to write a configuration application that writes out configuration if people want to modify it. If they’re XML-terrified.
Matthew Roche: This is a wheel that gets reinvented over and over again. I’m developing a multi-project version of package installation wizard using that workflow. I’ve got SSIS installations that is made up of 20 to 30 different visual studio projects, want ability to have GUI, MSI-type installer where a systems administrator can pick deployment location, enter values for those files, so on. See common theme.
Greg Low: Most people weren’t too bad at editing IMI files years ago, but frustrating with XML file, subtly change something so not a valid XML file then. Easy to do. Causes people some panic. Easy once it’s an XML file, build configuration tool for the app.
Matthew Roche: I felt the pain of having someone edit configuration file, hit “s” at end of document instead of control “s”. I built “pac-man” for package manager for bulk updates for updates using object model. Bulk package validation; when I hand packages over to QA, they usually screw things up, tech savvy in their minds, but don’t know specific technologies like XML. Give them click on a button to validate all packages and environment in which deployed. Allows us to find errors early on, versus letting it run four to eight hours before aware of errors.
Next on list, only package variables. Look inside XML configuration files, notice xpath expression saying what part of package you want to configure. Start with package; go to what contains value you want to configure. Example: if going against variable defined in package scope; path always the same. If you have configuration explicitly setting property on a task, task been moved to inside different container, like reworking package logic, put inside sequel container; path inside configure file can change. Subtle error, wouldn’t think location on design surface will impact functionality, but because of package container hierarchy, tasks as containers themselves, and so on. I found this works, having all configurations sent to variables defined at package scope. Anything else that needs that configure information will have properties set with expressions set with those variables.
Greg Low: Yes, very obvious place.
Matthew Roche: 100 percent. This is one of my bullets in my PowerPoint.
Greg Low: Next, one target table per package?
Matthew Roche: I got this from Kimball groups’ book, Microsoft data warehouse tool kit, guideline for keeping things simple. When building simple project, whether using VB, C sharp, java, SSIS , initial thing people do is, this will be simple, this one thing I’ll use once, lump all code in one place, then down road, they need to rip/replace/gut and build the way it should have been done. Lessons learned, I over-engineer the little pet projects that no one else will use. DOT.NET project, DLL implementing business logic, that always serves me well. Never regret that step. With SSIS, people say “only one thing we need to do, will lump together”. Package tends to grow and grow, become unmaintainable. If people step back, development tool, let’s work with it, do it right. Each package modular, set up with single test. Table for each package, master package using execute package task to drive logic to coordinate all. Much more consistency. Easier to maintain. When editing, amount of regression testing smaller. Some have pushed back at this, but most of the time, following one target table per package will give you easier maintenance.
Greg Low: Often when building code I find, I wonder if I’m being anal-retentive, I build code to keep forever. Number of times I reuse it is amazing. Happens all the time. Run into sites where looking at code, someone used nonsensical names, thought ‘I was only going to use it once’.
Matthew Roche: Like I blogged yesterday, on importance of pain when learning. Real world experience important because of lessons learned, will never leave you.
Greg Low: Experience is you remember your last mistake, but it doesn’t stop you from making it again! Next, annotate like you mean it.
Matthew Roche: I’m called a “comment nazi”. I tend to have two to three comments for every line of executable code I write. “Test-first” development, I believe in comment-first development. If you can’t write it in English, you can’t write it in code. I outline my logic in my own words before putting it in syntax of day. SSIS packages are code. Executable logic. You need to annotate it. Describe, add comments. I believe every task needs to have comments. Say what it does. Some say, if it’s a descriptive name it’s self-documenting. Often true, but additional annotation right there in design makes huge difference. You’re never the only one in the package. You don’t want to be that guy that everyone complains about when you leave.
Greg Low: You’re preaching to the choir.
Matthew Roche: Wait, one more, SSIS presentations recently, for demonstrations, people look at data flow in packages I’ve built, you have select statement as annotation; what a great idea. Most of time, for complex package, select statement is built by expression, has different inputs coming in, go through find out what it is, is a multistep process. What variable, blah blah blah. I will edit variable, evaluate expression, copy output from expression, paste into annotation, color it green, drop it on dataflow surface. Advantage of this, self documenting. Looks great. Downside, obsolete when put on surface, no way to keep the two in sync. What do you think of that practice?
Greg Low: I like that. Find endlessly drilling to find out what’s going on. Is there some way in debugging to dynamically set annotation? To expose quite a bit more while package was executing.
Matthew Roche: Have annotation based on expression.
Greg Low: Where could you set that? Expression of any type, or fixed variable per component.
Matthew Roche: Maybe in 2010.
Greg Low: Or 2011! Avoiding row-based operations, is that SQL Server sound advice?
Matthew Roche: In training/consulting, get used to “beating set-based drum”; most don’t listen, keep beating that drum. For SQL Server development, avoid cursors, do everything with set-based logic. Same applies in SSIS, people not as familiar with environment, not obvious what’s set based and what’s not. Set of data flow components that do row by row operations. Non-cash lookups, hit database once for every row; and OLE DB command transformation. It’s there to run update/delete, historic procedure for every row passing through. Powerful tools; but when used for any but most trivial flow, performance will suck. Hitting database once for 100 million rows every morning will increase execution time. End up to support, send outside work instead with temp table, worker table. Bulk insert into database, set-based update operation once back to control flow.
Greg Low: I do that all the time. See lots with row by row logic; frustrating. Could be completely avoided. Interesting in SQL courseware, tended to remove discussion of cursors or put in appendix, trying to enforce. Intriguing, every example of triggers would only work on single rows. So many examples, those migrated from Oracle, writing cursors, they tend to work with only single row at a time. Another one, trying to sequence data via date/time. Realized date/time wasn’t precise enough, multiple rows with same; so literally put “wait for” in trigger to force to take longer; added seven hours to processing time.
Matthew Roche: And people were saying “SQL Server is slow, Oracle never did this!”
Greg Low: That’s the winner of “trigger of the year!”
Matthew Roche: Excellent observation of old SQL courseware relegating cursors to appendix. Wonder if good practices I got into were because of that.
Greg Low: I’m sure it did. Most of examples in courseware with triggers have “set at country=”. Wow, ok.
Matthew Roche: Doing update based on adjoin not difficult, but more complex SQL syntax than people are used to. Those who write courseware should be aware of it. Seen this in real world, trigger only works on one row. Good developers will have the logic up front to check rows modified.
Greg Low: Matthew, anything you can share with us about you?
Matthew Roche: I like to think of myself as an evangelist for things I love. Speak lots on SQL Server. Dedicated fan of Man-O-War, a heavy metal band, kings of metal since early 1980s, loud, powerful, inspirational music. When possible, I follow them around the world touring. I’ve seen them in 2005 in Orlando, with about 500 people. Seen them in Germany with 50,000 people. In the front row, it’s a powerful experience. Would recommend them to anyone. “Kings of Metal” album. Great stuff.
Greg Low: I’m always amazed at interests in music in relation to technical pursuits. Has to be connection between creative side of both. Quick shout out to Rob Whaley in Red Rock New Zealand, great to have a passion. He has Neil Young Friday. He works at home on Fridays, puts on Neil Young really, really loud the whole day.
Matthew Roche: I learned in college, the louder the music was, more open my mind was to accepting new information when I was studying, and outputting new information when I was developing. Very fortunate I work from home most days. Have big speakers in office, turn up volume, gets me in the zone, more productive.
Greg Low: I have a friend with a consulting business. One of rules in office, no working with headphones or music. Question, is that productive? I literally can’t write code well in a quiet room. Every little noise distracts you. Even if TV is on, I can concentrate better. Seems counterintuitive, honestly must have to do with small distractions.
Matthew Roche: I wouldn’t want to work for your friend. I would not be earning my wage in a quiet environment.
Greg Low: Next, to avoid asynchronous transforms. Let’s define what an asynchronous transform is.
Matthew Roche: SSIS data flow there are two or two and a half different types of transformations, synchronous, to define, produces same records that it consumes. Another way to think about, same row but data modified. Look at underlying data buffer, derived column transformation. Is asynchronous transform. Think, this has additional column, but when SSIS is creating metadata for data flow, any column added through execution of given memory buffer, will allocate space up front, space always set aside. Advantage is, synchronous transforms do pointer math. Walking pointer over records in buffer, not actually copying data, incredibly fast. Asynchronous transform produces new rows, separate from ones it consumes. Dedicated memory buffer, records copied from one location in memory to another when transform executes. Two types, fully blocking, partially blocking. Partial, start producing new records soon as consumed records to produce. Both input and output at same time. Fully blocking, not produce any output until all input records consumed. Look at sort, or aggregate transforms. Downside, if can’t produce any records until all consumed, all stored in memory or worse, spooled to disk before data flow continues. For large data flows, provides unacceptable performance.
Greg Low: And memory usage
Matthew Roche: Absolutely can bury server very quickly. Run into memory limitations before anything else. Performance itself big danger sign. “Ran great in test”, of course, performance goes downhill very quickly.
Greg Low: Interview with Donald Farmer recently; example asynchronous transform, data sources themselves are asynchronous transform.
Matthew Roche: Inherent. Data source must have own memory buffer. Even though MSDN are imprecise, inconsistent describing definition for asynchronous transforms and synchronous transforms. Best source for this is Donald Farmer’s book. Back around beta two timeframe, book on SSIS scripting.
Greg Low: I know it well, “Rational Guide to Scripting SQL Server Integration Services”. I’ve completed two books for Rational Press this year. I finally have a SQL CLR integration book, started shipping today. CLR guide finally out! I really like the series, like the format, describes in short format, topic well. No room for fluff. Have to be to the point.
Matthew Roche: I agree. Would say, that’s a book people should read who what to know more about platform. Angle he looks at things from, sheds light on nature of data flow. Not actually asynchronous transforms or synchronous transforms, but outputs that are actually synchronous or asynchronous. When you write code, most coverage does not go into this.
Greg Low: Same series, Walter’s book, previous guest on show. Love descriptions of difference between dialogue, only product manager able to have conversations. Gives you insight not otherwise available.
Matthew Roche: If you could convince Roger to come on and talk about master data management stuff he’s working on, I’d love to hear him. What he’s working on interesting, well done, master data management space more and more important to BI, cleanliness, stewardship of data driving the quality of the BI platform. Now keeping my ears open, lots of chatter coming out.
Greg Low: Client asking me, very little published. Next, really know your data.
Matthew Roche: Really really! Single most important thing. That will cause package to break, projects to fail are NOT technical, but lack of knowledge of source system. Understanding what data from what fields, dynamic nature of data. What does update look like? What user interfaces or batch processes going against data? All these questions usually more interesting than answers. Questions are things you need to have answers before start building data flow. Also which questions you don’t have answers for.
Greg Low: Imprecise. Having domain knowledge really important. Again, hassles for example, schools, so on. Rules, ‘school has center code; codes never change”. Today, guy said he had school where center code had changed.
Matthew Roche: That key thing, I have yet to be on project where someone says “never”, or “always”. We’ll go back, are we sure? Yes, no exceptions, positive….two months later, “this has changed”. Or, run into it right away. Nature of BI project, paying attention to data more than anyone else. Find things domain experts might now know. Maintain risk list. Like for traditional development project. Proactive in documenting assumptions. Things you’re working on based on assumptions. So everyone understands how information impacts success of project overall.
Greg Low: I really like upcoming tools for 2008 for data profiling. Really interesting. Get much better picture of what data looks like.
Matthew Roche: Waiting to see how data profiling task matures. The UI in documentation were sparse, had never run through paces.
Greg Low: Sat through session with Donald Farmer, fairly impressive. Have high hopes for that. You’re also saying, do it in the data source.
Matthew Roche: Waiting synchronous transformations, wherever possible, firm believer in sorting, aggregations, joins things that would be costly, do not do in data flow, do as part of source ware.
Greg Low: Couldn’t agree more. Reporting services; preference is to look for formatting data in report. Don’t like complex logic inside elements of report itself. No reason can’t have something that you call, returns data in right format.
Matthew Roche: With SSIS, see ‘want to be consistent’, not realizing nature of transform. Also, if pulling information from flat file, can’t do much in source system, even though not a general rule, cases when actually faster go from flat to dump into SQL table, then sort/group inside select statement pulling out of that table, taking advantage of optimization the SQL team has built into platform over the years. Not trying to do all on memory. What tool works best for your scenario. Mostly, when pulling information from relational database, can get significant performance improvements by pulling in source.
Greg Low: Often with reduced memory requirements. May well be indexes that support appropriate order. Very common. General approach, love to get into some form of table, process in bulk inside table.
Matthew Roche: Devil’s advocate, do same thing, often question myself, doing because SQL tool is most comfortable, a crutch for me? When I look, see what is correct, best tool for given scenario? Most appropriate place for raw file for staging information versus relational database?
Greg Low: Another one, don’t use data sources.
Matthew Roche: Data Sources I’m talking about are not source components; performance would be better if avoid source data. But data source object included in SSIS projects inside visual studio. Did presentation on SSIS for local chapter, going over this point, ‘this is dangling there like candy’. Here’s a place to define a connection once, easily reuse anywhere in project. Auto updates, tool support, simple, candy…but poison candy. Data sources are not anything to do with integration services, visual studio only tool, so to maintain consistency, once you deploy, no mechanism to get connection strings up to date. In addition, because implementation as part of studio tools is poor, if you update connection string inside string in project, don’t see changes until you manually open up each package. If have master/child package set up, can have inconsistent data throughout. Worse, data source feature inside visual studio for integration services project, can define one common data source in one project; reference it in other projects. Glowing super sweet poison candy! You think, “this is a tool just for what I need”; worse, need to open packages, to update command data source, need to edit master, open each individual related ones, open each package. Huge waste of time. My time was wasted. Lesson learned, use configurations uniformly and consistently across development, test, production. Use one mechanism. Ignore data source folders inside project; they have nothing to do with what you want to accomplish.
Greg Low: If external thing needs to be configured. In DTS packages, migration, editing previous packages, is because of embedded connections strings.
Matthew Roche: I hated DTS projects. Wanted nothing to do with it other than import/export wizard. Simple to me is boring. Simple tool. People talked to me; we want to use SSIS…topic requested most regarding SSIS is deployment. As developer, what could be more boring than deployment? But key thing, good deployment story. SSIS not best deployment story. Not lots of robust tools.
Greg Low: Hoping for detailed Microsoft? Build integration.
Matthew Roche: Key thing to smooth deployment, make packages location agnostic. Anything where package references an external resource, information not stored in package, be it connections string, execution flag, etc. anything part of logical interface between your package and outside world, needs to be configured. Moving from environment to environment, have seamless transition. Good thing is, tons of tools to support it. Bad thing, they’re not leaping out at you. Need to pay attention, know platform, spend time to find out what works best.
Greg Low: Important. I see it as one of key enhancements, ability to easily do configurations. Could do it before, just not easy.
Matthew Roche: I’ve had experienced SQL Server people talking about SSIS, he thinks amusing I’m in love with ETL tool. Got into a mini-lecture on it. For even people who will spend 40 to 60 hours week in SQL Server world, unless they spend the time to see what best practices are for them, they’ll miss configurations, expressions, others. Very little you can’t do with these tools.
Greg Low: Trainers tend to do certifications endless. What I’ve always liked about certifications is nice list of topics you need to know; it’s very interesting list. Make sure you know about all of those. BI exams interesting, with new MCITP exams, about 2,900 worldwide. 1,100 database, only about 100 BI developers worldwide have done those exams. Such a small number.
Matthew Roche: DBA and DBD exams around two years longer than BI exams.
Greg Low: True. 446 exam for BI was reasonable, I thought.
Matthew Roche: Perhaps with 2008, dedicated ISAS or RS exams. Now happy that exams are out there. I use as focused study tool to get up to speed on a tool. Use as a validation. Check, do I know enough to talk about this stuff?
Greg Low: Last, treat packages as code?
Matthew Roche: One of those things that DTS environment people tend to miss. DTS packages work with weird amorphous thing, could save to structured file but like live database version, you go in and modify with enterprise manager. Worst practices tool end to end.
Greg Low: Because people building those packages didn’t have background in source code control?
Matthew Roche: Some building packages were DBAs, not developers. I’ve stood on both sides, not going to go there.
Greg Low: I could not imagine doing any development project without source code control system in place.
Matthew Roche: People used to it hate SSIS. DTS made it easy to do things the bad way; SSIS does opposite. Treating your packages like code, always use source code control. I set up Team Foundation Server. I use source code control 100 percent of time, can’t get around it.
Greg Low: Some not familiar with Team Foundation Server (TFS); fits in nicely for developers with team suite in particular. Shout out to Maraid O’Donovan others with Visual Studio team edition for database professionals. TFS has great source code control. If doesn’t work for you, third party offerings like SVN, etc. no excuse not to be using one.
Matthew Roche: Caution, anecdotally, third party source code providers corrupting SSIS code package. Specifically about subversion and perforce. If this is happening, due to repositories not properly handling DTSX extension as UNI code files. Most source code repositories work, other than original version, never any subsequent versions stored, only deltas stores. When you get latest version, you get dynamically reconstructed on fly. Things can get corrupted during parsing.
Greg Low: Liked comment about number of people having issues with database corruption on source code control. Its old pain, not where you should be.
Matthew Roche: Old pain with source safe, look at Team Foundation Server. Night and day difference. To treat packages like code, in development environment, have several environments. Do same thing when working with SSIS packages. Repeatable deployment process either file copy or bt util to move things around so that when need to push out to production, know 100 percent to click that button, it will work. That confidence makes a big difference. Test using real data. Data set I get is handed off, say this is what it will look like, then it doesn’t look like what you were given. One time DBA wrote a script to produce sample data for production, had no relationship to actual production due to errors in script. Development life cycle best practices, all apply with SSIS even though with different domain. Same basic tools, steps. Most important one, have that discipline. SSIS tools give you ability to work right out of box. Take advantage of that.
Greg Low: Only downside using real data for testing, privacy concerns. One of real challenges is using test data in small quantities, when scrubbing real data is successful, intensive to do. Another option in DB Pro product, data generation options. Generate wide variety of data in large quantities quickly. I prefer real data, not always case privacy-wise.
Matthew Roche: Not always possible, but steps you can take. Bits of data to start initial development good only for first couple iterations. If can build SSIS package that dose some cleansing, identify columns and tables, have lookup tables first name, last name, social security number, pulling in data, merging into meaningful data stream. Not 100 percent data, but getting 80 percent value for 20 percent effort. Got right volume. Once de-identify them, have more to work with. One concern I have, anytime deploy package into production environment without having first run data, don’t know what will happen.
Greg Low: Often if have true staging environment, can do that. Few other items, SQL Server integration myths.
Matthew Roche: Call to action for anyone listening, matthewmct@gmail.com. I encourage people to email me if you’ve heard people talking trash about SSIS. Message not out there about SSIS; things they say are missing are actually included, for example. People look for things in simple way, often hidden in SSIS. Send me your myths, I’ll do best to de-bunk them. Have ‘bi-polar’ blog , bi-polar23.blogspot.com . Would like to get input from as many people as possible. Kick start SSIS myths series on my blog.
Greg Low: In SQL MVP group, building up SQL server myths this session.
Matthew Roche: Most amazing product. Things SQL group have done, it’s so good.
Greg Low: Where will people see you next?
Matthew Roche: Upcoming MCT summits in Berlin and Redmond, I’ll be presenting on SQL Server 2008. Working with colleagues from Microsoft on book proposal on comprehensive book on SQL Server 2008 BI. Have final approval by October. Speaking at users groups in New York in November-December. Anyone in Northeast US looking for speaker on integration services, drop me an email.
Greg Low: Thank you so much for your time. Certainly been enlightening.
Matthew Roche: Thank you Greg.
Phone: 1300 SQL SQL (1300 775 775) l International +61 1300 775 775 l Fax: +61 3 8676-4913
Copyright 2017 by SQL Down Under | Terms Of Use | Privacy Statement