Friday, November 4, 2011

Increasing Likelyhood of Duplicate Values While Loading Data





I was working on a QlikView document last week that loaded several kinds of sales forecast data and also actual sales for a chemical manufacturer. The forecast data came from a statistical forecasting application. The charts on the document were simple and showed the forecast data rounded to integer numbers of tons.


The document contained a lot of data and the performance was a little slow when making selections so I looked for ways to improve performance. Examining the forecast data, I saw that although the quantities in the database were expressed in tons almost every quantity included a decimal fraction. Now, QlikView is very good at compressing data and especially good at compressing data with duplicate values since it stores each unique value for a field only once. This forecast data did not contain a lot of duplicates because of the decimal fractions. I saw that if I rounded the values as they were loaded then values like 1.001, 1.00032, 1.2, etc. would all be loaded as 1. After changing the loadscript to round values to integers as the data was loaded the report file size on disk was reduced by 40% and had a corresponding reduction in memory required. The charts on the report were not affected all.


This technique would not be useful in all situations. It would cause a lot of trouble if you rounded data values on some accounting documents! But, sometimes, altering data slightly as it is loaded to increase the likelyhood of duplicate values can save memory for a QlikView document. Besides rounding (not necessarily to integers, you can round to a specific number of decimal places), you can also increase the likelyhood of duplicate values by:



  • Truncating text data - keeping only the first ten characters, for example.



  • Trimming text data to remove any leading or trailing blanks.



  • Converting to all capital letters.


  • Converting non-abbreviated words to standard abbreviations, like state abbreviations in address data.



  • Removing the time stamp portion from date data



  • Converting date data to beginning-of-week or beginning of month dates.



  • Converting date data to month abbreviations if that is the level of data used in the document charts.



* * *

2 comments:

qlikview reporting said...

I think its a great work done as the qlikview is on the way for the next bridging society.

Medical billing services said...

I think the great work is really done.