Spreadsheets or Databases

To database, or not, that is the question.

So, let’s say you are tasked to develop a test station which will automatically take data from devices or products you have manufactured.  Development time for the physical station and software takes, let us say, 4 weeks.  Your lot of devices (or products) to be tested is only a few dozen.  The plan is to have this station automatically measure and extract 15 key parameters needed for product development.  At this point you have all of your manager's attention.  Then, you go and tell them it is going to take another 2 months to get the data into a database.  At this point you'll probably hear, "just stuff it into a spreadsheet".  But is it really worth not developing the database?  

An errored entry might be seen and could be manually removed for the data population in a spreadsheet.  You could assume these extreme outliers were typed in error.  But what do you do when you are running multi-variate experiments where you are changing 3 or 4 input parameters of the device?  These extreme outliers may not be so obvious and could actually be real data.  But how do you know without wasting additional engineering hours manual checking the data.

A skilled typist, which I am certainly not, can enter data at a rate of around 10,000 characters an hour.  Large floating-point numbers would probably be at a much slower rate.  Let’s say, this most skilled typist has an error rate of 0.1%.  For this example, that would mean 10 errors were introduced per hour, 80 errors per shift.  Now, multiply that times the number of people entering data.  So even with an error rate of 0.1% you would still be introducing significant damage to your data set.  You could try to bypass the manual data entry by using Optic Character Recognition (OCR).  Just print the page of test data, scan it, and convert it so your spreadsheet can digest it.  However, even this comes with its own problems.  I would like to say the technology is 100%, but the reality is it is not.  And any benefit to scanning in the data would be lost because the data would still need to be massaged for entry into your spreadsheet.

In addition to the problems of the speed, quality, and the costs related to manual data entry, you also lose the ability to cross-correlate data.  Yes, some spreadsheets do have sophisticated querying tools, but these spreadsheets tend to become single use projects.  Use them and lose them, lost in the cloud or network drive forever.  How many times have you had to scan your network looking for data from a spreadsheet you generated 3 years ago, and then cut & paste it together with your current project spreadsheet, and then having to convert the data to common units.  Too much wasted time, too much opportunity to introduce error.

Years ago, I read an article which applied the 1-10-100 rule to manual data entry.  A company will spend $1 to verify the data it has entered, $10 correcting the mistakes, and $100 to correct errors caused when these errors slip through.  Just one error strategically located in your spreadsheet could set a project off in the wrong direction costing time, money, and manpower.  So, to database we go, but where do you begin?  You begin by organizing your thoughts, thinking ahead, and planning for future expansion.

They'll be the temptation to just deal with the data from the project at hand.  This would give you the ability to query parameters (data) which are important to you.  But what if you wanted / needed to cross-correlate data from a different lot?  You would need to query your database and then, hand manipulate the data so it could be entered into another database or spreadsheet.  And with data manipulation come the potential for data corruption and error.

This is where the 'organizing of your thoughts' comes into play.  You'd want to start small but develop something that could be expanded and grown as the need arises.  For this, I would suggest starting with a flat database design.  This is basically a one-page (table) database.  They are easy to setup but tend to get big fast.  However, according to Moore’s Law, memory sizes double every 18 months.  No need to worry about it any time soon.  Starting with a flat database allows you to start populating the database as the database itself is developed.  The neat thing about databases is the data can be reconfigured at any time.  No efforts are lost.  As the need grows you can then grow your database.  You may find your engineering test database may become useful to other departments in the company.  With that, these other department may want to add table of their own.  At this point, the database could be converted from a flat style database to a relational database.  In short, a relational database is a series of different tables that interconnect using common fields from each table.  And this is where the 'planning for future expansion' comes in.     

One of the big problems with databases, especially engineering databases, is 'definition collision'.  This is where similar field types are defined in different ways.  An example of this would be a field for voltage defined as 'double float', and another voltage defined as a 'text' field.  Both work but the data would need to be manipulated before it could be used in a correlation.  So, what field types do you use and what units to make the data fields?  Should the data be stored as ‘100ma’, ‘0.001A’, ‘1’ in a field defined as mA???  Should it be an integer, float, double, text or whatever???  Should the units be pre-embedded into the number and just leave to the user to figure it out?  This is where most database projects fail.  And having done this before I will say, ‘there is nothing more annoying than seeing units pop up in reports like KuA (kilo-micro A) or muA (milli-micro A).  Or trying to plot data that was defined as text with embedded alphanumeric characters in it.  Look, a Number is a Number, and a Label is a Label.

Getting back to Moore’s Law, memory is cheap, with size doubles every 18 months.  Unless your database is being used for at the New York Stock Exchange I wouldn’t worry about data size.  Let Information Technologies (IT) worry about of that.  So, to keep it simple, I just defined labels as 'short text', indexes as 'long integers', date & time as 'date/time', and floating-point numbers as 'double floats'.  With that said, I never embed the units in the number.  A value like 100uA is entered into the field as 0.000001, and NOT 1.0 leaving you to assume the field is in uA.  Computer are good with large numbers, let it do the work.  As shown in the example just below, I sometimes put a secondary label field to describe the units in the data field (ex. Test_I2_Units).   

To the left is a simple test database example for a flat style database.  The Key_Field is a unique long integer number, maybe made up of other data fields which identifies the data set.  They're not really necessary, but it will make it easier to cross-correlate data from other tables as they become available.  I’ll usually do something like Key_Field = Product_ID (8 or more digits) + Test_ID (4 digits) + Test_Instance (2 digits).

When entering data for the test station's grphical user interface (GUI) I disallow users to directly enter data into the database fields.  It is a lot easier to force convention than to go back and post-process data to correct it.  So, generate a GUI which uses drop-down menus with predefined selections for the operators to select.  You’d be surprised how many derivatives there are for a label like 'Preamp_1234' (pre-amp1234, Pre-Amp 1234, pre_amp_1234, Preamp 1234, PA1234, etc.).  You get the idea.  My preference would be to barcode or Q code the device under test, the test station, and operator ID tag.  A bar code scanner and tags takes all the guess work out of it.

On last tip in closing.  The Holy Grail for engineering test stations would be for all your stations to be directly tied to the database.  Been there, big mistake.  For starters, what happens when IT takes down the database for maintenance, or repair.  I can tell you what happens, your production floor grinds to a halt and your boss starts throwing darts at your picture.  The same goes for lost communications connections, server errors, data corruption, and station errors.  Business stops until these problems are corrected.  So, what is the work around?  The solution I came up with was to have all of the stations dump their data directly to a directory on the network.  A parsing program would then check this directory for files every 5 or 10 minutes and parse them into the database.  The parser program could be written so that it could distinguish between several different types of test station files and parse the data into the correct field.  This method will insulate your test stations from the server and network crashes.  As systems come back online the parser continues where it left off and parses the files into the database.

I hope this was helpful.  Good Luck.