Problem solve Get help with specific problems with your technologies, process and projects.

Normalize a column containing a list

I've just completed an assignment for my databases class in which we were given a database and told to put it in 1NF. Several of the columns contained multiple values (e.g. 1;2;3;4), and I wanted a new row for each value in the column. I ended up projecting out the columns with multiple values and then importing them back in split up by semicolons, copying them into the original database and then using a bunch of select statements to get it down to what I wanted. I'm dying to know what a better way to do this is. If you don't understand what I'm talking about, just try to explain how to split a column that contains multiple values separated by a semicolon into separate columns.

The obvious way to do this is with a scripting language. Both Microsoft SQL Server and Oracle have comprehensive languages (Transact-SQL and PL/SQL, respectively) in which this would be both easy and efficient.

The scripting approach involves a loop, in which the target column is searched for an occurrence of the list separator (a semicolon in your case, more often a comma). In pseudocode:

variable tempstring varying character
set tempstring = columnvalue
while position(';' in tempstring) > 0
      into normalizedtable 
         ( newcolumn )
         ( left( tempstring
               , position(';' in tempstring)-1) ) 
    set tempstring = 
         from position(';' in tempstring)+1
          for length(tempstring)
              - position(';' in tempstring) )               
  into normalizedtable
     ( newcolumn )
     ( tempstring )

The POSITION function finds the separator, or returns zero if one is not found. If a separator is found, then the substring, up to the position before the separator, is extracted with the LEFT function, and used in the insert statement to insert a new row. Then the substring just used is chopped out of the string, by setting the string equal to the remainder of the string starting one position beyond the separator, and looping continues. After looping has finished, the last part of the string is used to generate the last row.

In a database that has a built-in scripting language like SQL Server or Oracle, this script would be saved as a stored procedure and invoked with one call to the database. With an external scripting language, like PHP with MySQL for example, control would be bouncing back and forth between the scripting engine and the database engine for each insert, so an external script, while perhaps just as easy to write, is substantially less efficient, although still better than doing it manually.

But is there a straight SQL solution, one that does not involve scripting? If there is, it would have to involve a cross join of some sort, since it must be able to generate multiple rows for each row that contains at least one separator. It will more than likely involve a cross join with an integers table, so that if, for example, there are four separators in a given column value, then five rows would be generated, probably by the cross join with integers 0 through 4.

That said, I have seen an SQL solution, one written by Joe Celko. It was truly complex, involving CAST, MAX, COUNT, SUBSTRING, and DATALENGTH functions, a double cross join with two copies of the integers table, and a GROUP BY clause. Not for the faint of heart. My advice is to use a script, or even do it by hand. After all, you don't normalize on the fly all the time, right?

Dig Deeper on Oracle and SQL

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.