Have you ever encountered a perfectly constructed MATCH or a VLOOKUP function failing for no apparent reason? Have you had VBA tell you that a file does not exist even though it does? You may have non-printing characters to blame. There are notorious space-like characters that are invisible to the naked eye, which sneak in when you are importing data from Web Pages, Word Documents or PDFs. In this post we will learn more about these characters and how to remove them from your data.
A brief history of Character Encoding
Letters, digits and punctuations that we use in human languages are represented in a format that a computer can understand through a process known as character encoding. The most commonly used systems are ASCII and Unicode. The ASCII started off with 128 (7-bits) characters to represent American English. This character set was extended to 256 characters independently in many encoding standards to include accented characters in European languages. Finally Unicode absorbed ASCII, and extended it to include all human languages in the world, and we still have room for more!
The reason for my rambling on about character encoding is, in this post, we will confine ourselves to the first 256 characters – 0 to 255 – of the Unicode character set. Remember that the first 128 characters – 0 to 127 – in Unicode character set are the same as ASCII.
Getting back to Non printing characters…
The first 32 characters – 0 to 31 – are non printing. You can use Excel’s CLEAN function to remove all of them. However, there are a few more higher value non printing characters: 127, 129, 141, 143, 144, and 157. Excel does not have a built-in function to remove all of them in one shot. We have to use the SUBSTITUTE and CHAR function to remove them one after the other.
For instance, to remove Character-129 from your text, you have to set up the following formula:
=SUBSTITUTE("Your Text here",CHAR(129),"")
To remove multiple non printing characters, we have to nest substitute formulas within each other:
=SUBSTITUTE(SUBSTITUTE("Your Text here",CHAR(129),""),CHAR(141),"")
The character that gets inserted when you hit the Space-bar on your keyboard is 32. There is another type of space called a non-breaking space, which is represented as 160. This often turns up in HTML pages. It is best to convert the 160’s to 32’s. Note that the first 32 characters include the new-line characters; and multiline text will get converted to single line text, if you use the CLEAN function. I recommend replacing them with 32
Doesn’t this sound like a lot of work?
I created a Function to get rid of all the non-printing characters you’d normally encounter in everyday life. Here it is:
Function CleanS2XL(ByVal Value As String, _ Optional ReplaceWith As String = " ", _ Optional TrimResult As Boolean = True) As String 'List of Non-Printing Characters: '0 to 31 and 127, 129, 141, 143, 144, 157 '32 and 160 are space characters '32 is the Space we know and love 'And 160 is a weird space, which I will replace with 32 Dim NonPrint() As Variant, _ Counter As Long CleanS2XL = Value 'While the Clean function completely removes all the non-printing 'characters, I often found that I needed to replace them with Spaces 'Therefore, instead of using the clean function, I have replaced it with 'the following loop to explicitly replace the first 32 non printing 'characters, with the ReplaceWith argument. For Counter = 0 To 31 CleanS2XL = Replace(CleanS2XL, Chr(Counter), ReplaceWith) Next Counter 'Now we will replace the higher value non printing characters NonPrint = Array(127, 129, 141, 143, 144, 157) For Counter = LBound(NonPrint) To UBound(NonPrint) CleanS2XL = Replace(CleanS2XL, Chr(NonPrint(Counter)), ReplaceWith) Next Counter '160 is another type of space that often shows up when you are copying 'data from web pages, PDFs and word documents. I will replace it with 'the regular space, so the TRIM function can do its thing CleanS2XL = Replace(CleanS2XL, Chr(160), Chr(32)) 'I use Excel's TRIM Function here instead of VBA's because Excel's 'function converts multiple successive spaces to a Single Space If TrimResult Then CleanS2XL = Application.Trim(CleanS2XL) End Function
The ReplaceWith argument lets you replace non-printing character with any character you please. It defaults to the normal Space Character (32). I recommend using the Space character (32), along with the TrimResult argument set to TRUE, because, Excel’s TRIM Function, which I call in the end, removes double spaces.
Since this function removes the new-line characters (10 and 13), multiline text would get converted into single line text. Incase you’d like to retain the “multi-line-ness” of your string, you may want to call this function with the line-feed character (10).
Note that I use Excel’s TRIM Function instead of VBA’s TRIM Function. VBA’s function removes only leading and trailing spaces, while Excel’s function also replaces multiple spaces with a single space.
Download the following Workbook for all the illustrative code in this post:
- This post is based on the recommendations in the ‘Removing spaces and nonprinting characters from text’ section of the ‘Top ten ways to clean your data‘ Office support article. It has more tips on how one should approach the data cleanup process; it is a good read. I intend to incorporate more suggestions in this article into my code and post them here in the future.
- We have used Excel functions in our VBA code here. For more information on how to use Excel function in your VBA code, I recommend reading my ‘Is the WorksheetFunction property necessary in Excel VBA?‘ post.
- I also have a sub that automatically cleans up all the cells in a range. If you are interested in knowing more about it, check out my ‘Do You Hate The Numbers Stored As Text Error?‘ post.
- Check out the ‘The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)‘ post by Joel Spolsky , for an entertaining account on Character Encoding.
- If you are looking for a more technical take on character encoding, I’d recommend the ‘What every programmer absolutely, positively needs to know about encodings and character sets to work with text‘ post by David C Zentgraf