Friday, March 9, 2012

Importing data from various data sources with non standard formats

Hi all :)

I'm wondering if SSIS will be the solution to the problem I'm working on.

Some of our customers give us an Excel sheet with data they want to insert or update in the database.

I've created a package that will take an Excel sheet, do some data conversion so the data types match up and after that I use a Slowly Changing Data component to create the insert/update commands.

This works great. If a customer adds a new row to the Excel sheet or updates an existing row changes are nicely reflected in the database.

But now I’ve got the following problem. The column names and the order of the columns in the Excel sheet are not standard and in the future it could happen a customer doesn't even use an Excel sheet but something totally different.

Can I use SSIS for this? Is it possible to let the user set the mappings trough some sort of user interface? I’ve looked at programmatically creating the package but I’ve got to say that’s quit hard to do… It would be easier to write the whole thing myself than to create the package trough code ;)

If not I thought about transforming the data in code before I pass it on to the SSIS package in something like XML. That way I can use standard column names and data types.

So how should I solve this problem? Use SSIS or not?

Thnx :)

Wouter de Kort

A question and comment:

1. Why are you using SCD for insert/updates on non-dimension data? It doesn't sound like you have a warehouse and a simple insert/update scenario should work (lookup data, if row not found, then insert, else update). But as long as it's working for you, that's ok.

2. You will have various problems with non-standard columns or file formats. You need to get your customer to send a defined format. One way around this is to create a staging package, loading the data into staging tables and then running the insert/updates on it from a standardized interface. There are numerous ways to set up staging, and I use VIEWS for mine as they are easy to replicate.

Wes|||

1. I already thought that SCD was a little bit overkill for this problem but it has the logic build in to check if a row is already present -> update it and otherwise it will delete it and that was just the thing I need. Would it be easier to 'duplicate' this functionality by using a lookup?

2. Off course it would be easiest to let the customer send a defined format... but we're not in the position the demand that from them (maybe in the future :))

I've started building code to convert the format the user sent to a standard xml format which I can feed to my package and the user can supply the mapping between the data and the XML schema. So they can send in whatever they want... the only ′problem′ is that I will have to build code that manually reads an Excel/Text/.... file and parse that to create the XML structure so I replicate some things that are already in SSIS :(

Could you tell me somewhat more about your staging package? Is that just an Insert into a custom build table that will hold the data? And then use a View to map it to the structure the SSIS package wants?

Wouter :)

|||

The lookup method I'm talking about does a similar function as the SCD. It just sounds like you are doing constant updates which violates the idea of "slowly" changing dimensions. I'm not too keen on the SCD stage in SSIS so I prefer to build my own lookup/update mechanism unless it truly is a type 2 SCD, but like I said whatever is working for you that meets the requirements is sufficient. (look at the sticky at the top of the forums referencing Jamie Thompson's blog and how he handles this)

Before I explain my staging, how often do the clients change their file formats? I don't want to lead you on the wrong track.

Thanks.

|||

Some of our clients send in an Excel sheet once a week with approximately 600 records. The strange thing is there system doesn't tell them if some stores are new or some are updated... it just gives the whole list.

But as I already said this could be changing in the future. Some other clients have requested Import functionality so they can connect their system/data to ours.

But what do you understand as slowly changing? My data has some modifications, some new records on each update so I thought this would be handy because it nicely encapsulates the logic?

Wouter :)

|||

Razorblade wrote:

Some

of our clients send in an Excel sheet once a week with approximately

600 records. The strange thing is there system doesn't tell them if

some stores are new or some are updated... it just gives the whole list.

But as I

already said this could be changing in the future. Some other clients

have requested Import functionality so they can connect their

system/data to ours.

But what

do you understand as slowly changing? My data has some modifications,

some new records on each update so I thought this would be handy

because it nicely encapsulates the logic?

Wouter :)

Your input list is fine if it's mixed data. As stated, you can

use the lookup function to process the list and sort into new/inserted

rows, and update or insert accordingly.

Nothing you have sounds slowly changing dimension(al), which is why I

questioned your use of the SCD wizard. SCD (in theory) is

for dimensions that don't change much, say a customers' home address or

an employees' position. But if the SCD wizard is working for you,

use it.

Go ahead and mark your question as answered if you are satisfied.

Thanks.

No comments:

Post a Comment