Cleaning up data for analysis with visualization software – Using MS Excel

This a is a quick tip and how-to, to help you cleanup data that contains special characters using Microsoft Excel a Macro with VBA ultimately helping you save time dealing with those files when wanting to use data visualization or related platforms.

At Robobo we work and support our clients with managing their data, this includes cleaning data and uploading to use with data visualization platforms such as Superset, STATA and others. Normally when working with data collection, especially dealing with languages with special characters (ã, ç, et. al.) that we have in the Portuguese language and others, we usually have a hard time pulling data with those characters as most packages will give you an error.

The are many options to clean data of those special characters, including for instance in STATA you can specify that the information is cleaned, but for the sakes of superset and related platforms, we can use Microsoft Excel and some good old Macro rule. For simplicity, I will assume that the reader knows how to work with Macros, nevertheless, the basic steps to use or active it can be found on: https://support.office.com/en-us/article/run-a-macro-5e855fd2-02d1-45f5-90a3-50e645fe3155 or https://www.datanumen.com/blogs/how-to-run-vba-code-in-your-excel/ provide clear instructions on how.

Code to use:

Sub CleanDataFromSpecialChars()
'
' CleanDataFromSpecialChars Macro
' Cleans or Strips all Special Characters from the worksheet
'

'
    Cells.Select
    Selection.Replace What:="Ç", Replacement:="C", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:= _
        xlReplaceFormula2
    Selection.Replace What:="ç", Replacement:="c", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:= _
        xlReplaceFormula2
    Selection.Replace What:="Â", Replacement:="A", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ã", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Á", Replacement:="A", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="À", Replacement:="A", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ä", Replacement:="A", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Å", Replacement:="A", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="É", Replacement:="E", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="È", Replacement:="E", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="È", Replacement:="E", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ê", Replacement:="E", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ë", Replacement:="E", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ì", Replacement:="I", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Í", Replacement:="I", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Î", Replacement:="I", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ï", Replacement:="I", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ñ", Replacement:="N", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ò", Replacement:="O", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ó", Replacement:="O", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ô", Replacement:="O", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Õ", Replacement:="O", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ö", Replacement:="O", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ù", Replacement:="U", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ú", Replacement:="U", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Û", Replacement:="U", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="Ü", Replacement:="U", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="â", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ã", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="á", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="à", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ä", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="å", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ª", Replacement:="a", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="é", Replacement:="e", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="è", Replacement:="e", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ê", Replacement:="e", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ë", Replacement:="e", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ì", Replacement:="i", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="í", Replacement:="i", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="î", Replacement:="i", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ï", Replacement:="i", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ñ", Replacement:="n", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ò", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ó", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ô", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="õ", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ö", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="º", Replacement:="o", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ù", Replacement:="u", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ú", Replacement:="u", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="û", Replacement:="u", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ü", Replacement:="u", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="ÿ", Replacement:="y", LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=True, FormulaVersion:=xlReplaceFormula2
        
        
    
End Sub

Hope this helps you save time cleaning the entire csv or data file with special characters.

NOTE:

This approach does indeed have its drawbacks, for instance:

  • On large datasets, this can take a very long time to process and may render Excel irresponsive, but with patience it should process every line (we have dealt with a 26MB+ excel file) which ran for just under an hour to process
  • A second version of this script could focus on cleaning only select rows or columns, that way we can specify only columns we need.
Tags: