Saturday, February 21, 2009

“SELECT DISTINCT” in Excel and Powershell

Just recently at work we had to import a large dataset into a database, but two of the columns were “category” and “subcategory”.  Naturally we wanted these in a separate table, so we needed a way of parsing through the existing data and get the unique categories, and the unique matching subcategories for each.


Firstly, there’s the Excel ‘07 way (most likely also possible in other versions too):

Select the two columns that hold your categories and subcategories. Go to the “Advanced” menu item in Filtering:


The presented dialog has the option of “Unique Records Only”


Done, it’s the equivalent of doing a “distinct” in SQL.  Now sort the data, and you’re ready to go.


But no discussion of parsing/manipulating data is complete without mentioning Powershell – surely we can do this in PS.  Let’s assume that the data was saved in csv form (you could always save your Excel spreadsheet as .csv).

Luckily, there’s already a command that is able to parse csv’s for us: “Import-Csv”.  This cmdlet will parse the csv and create objects that have properties named after the columns – in the below image you can see that the object has a “Category” and “SubCategory” property.


Now that we have this collection of objects, use your favourite way of flushing out duplicates.  I’m keen on building up a new list, checking on each insert that the same entry doesn’t exist:

  1. $temp = @{}  
  2. $data | foreach {  
  3.       if ($temp.ContainsKey($_.Category))  
  4.       {  
  5.         #check if subcategory exists, insert subcategory if doesnt  
  6.         if ($temp[$_.Category].ContainsValue($_.Subcategory) -eq $false)  
  7.         {  
  8.             $temp[$_.Category].Add($_.Subcategory, $_.Subcategory);  
  9.         }  
  10.     }  
  11.     else  
  12.     {  
  13.         #put it in:  
  14.         $temp.Add($_.Category, @{$_.Subcategory = $_.Subcategory});  
  15.     }  

This is all well and good, but there are issues.  The "Category” property names are hardcoded, and it depends on an existing $data variable.  Ideally we want a cmdlet that can support piping, and will let us specify the names of the columns.

Luckily for us, Powershell is able support the following constructs:


What happened there is by putting the variable in brackets, PS will evaluate the variable and put it in the script.  I was able to create a variable $somestring, which had a string value of “Subcategory”.  When I used it in brackets, it expanded out to “Subcategory” in the script, and functioned in the same way as $data[2].Subcategory.  Neat.  This is an invaluable feature in scripting.

So we replacing all places where we previously had “Category” and “Subcategory” with expansions, and add them as required parameters.

Last thing, we need to get rid of the assumption of pre-existing $data variable.  Ideally the input will be piped in.  No problem, the reserved $input variable is the piped in parameters.

So now we can call it like so:

import-csv c:\MyTest.csv | c:\categoryParser.ps1 "Category" "Subcategory"

The final script looks like this:

  1. param
  2. $categoryName = $(throw "Specify category column name" ), 
  3. $subcategoryName = $(throw "specify subcategory name"
  5. $temp = @{} 
  6. $input | foreach
  7.     if ($temp.ContainsKey($_.($categoryName))) 
  8.     { 
  10.         #check if subcategory exists, insert subcategory if doesnt 
  11.         if ($temp[$_.($categoryName)].ContainsValue($_.($subcategoryName)) -eq $false
  12.         { 
  13.             $temp[$_.($categoryName)].Add($_.($subcategoryName), $_.($subcategoryName)); 
  14.         } 
  15.     } 
  16.     else 
  17.     { 
  18.         #put it in: 
  19.         $temp.Add($_.($categoryName), @{$_.($subcategoryName) = $_.($subcategoryName)}); 
  20.     } 
  22. #dump out the output 
  23. $temp 

And produces something like this, a hashtable of objects each of which has the category name, and a hashtable of subcategories. 


From here it would be easy to traverse them all with two nested foreach loops, and do whatever with them, like generating INSERT statements.


Stephen MIlls said...

You could also have used the Sort-Object and Group-Object Cmdlets to do the same thing.

import-csv c:\mydata.csv | select Category, Subcategory | sort category, subcategory -Unique | Group-Object -Property category

Ilya said...

Oh boy, is my face red. I was completely unaware that "sort" had a -unique option (or the Get-Unique cmdlet). Live and learn I guess - thanks Stephen!