Detecing and fixing encoding problems with NSString

When you’re working with strings on iOS, it’s only a question of time before you start using stringWithContentsOfURL, either for downloading something from the web or handling a file import to your App. One of the major pains of working with strings is the encoding issue: a string is an array of bytes and to make sense of it, you got to know how what the bytes mean.

In the early days, one just used one byte for a character and came up with the famous ASCII encoding. But of course 256 characters is by far not enough to handle all the characters in the world (think of all the Asian languages) so different people invented different encodings until at one point, the Unicode people came around in an effort to propose an encoding that contains all characters for all languages. Unfortunately, there is both UTF8 and UTF16, so there is not even a single Unicode encoding, but hey, that’s besides the point here. Unicode made a lot of stuff simpler and the world a better place.

Classes like NSString do a good job of hiding away that problem. The problem hits you when you’re receiving bytes from an external source (a.k.a. a webpage) and have to figure out what encoding the stuff is. Let’s take the German character “Ä” for example: in Latin1 encoding, that’s just one byte with a value of 196. In Unicode UTF8, its two bytes: 0xc3 0x84. So you download that list of bytes and have to figure out what’s what. If you have a UTF8 encoded page and incorrectly assume it’s Latin1, you end up with “Ä”. Luckily, most modern formats like HTML or XML suggest that the encoding should explicitly be stated somewhere in the file.

What’s that got to do with iOS you may ask yourself. Well, I ran into a couple of problems when trying to use Apple’s methods to automatically detect the correct encoding when it comes to Latin1 encoded data. So here is some source code to help others with the same problem:

StringUtils.h


#import <Foundation/Foundation.h>

@interface NSString (NSStringAdditions)
// Checks for UTF8 German umlauts being incorrectly interpreted as Latin1.
- (BOOL)containsUTF8Errors;
// Replaces the umlaut errors with the correct characters.
- (NSString*)stringByCleaningUTF8Errors;
// Uses various attempts to guess the right encoding or fix common
// problems like NSStrings problem to detect Latin1 correctly.
+ (NSString*)stringWithContentsOfURLDetectEncoding:(NSURL*)url error:(NSError**)error;
@end

StringUtils.m


#import "StringUtils.h"

@implementation NSString (NSStringAdditions)

- (BOOL)containsUTF8Errors
{
    // Check for byte order marks
    // http://en.wikipedia.org/wiki/Byte_order_mark
    if ( [self rangeOfString:@"Ôªø"].location != NSNotFound )
    {
        return true;
    }
    // Now check for weird character patterns like
    // Ä ä Ö ö Ü ü ß
    // We basically check the Basic Latin Unicode page, so
    // U+0000 to U+00FF.
    for ( int index = 0; index < [self length]; ++index )
    {
        unichar const charInput = [self characterAtIndex:index];
        if ( ( charInput == 0xC2 ) && ( index + 1 < [self length] ) )
        {
            // Check for degree character and similar that are UTF8 but have incorrectly
            // been translated as Latin1 (ISO 8859-1) or ASCII.
            unichar const char2Input = [self characterAtIndex:index+1];
            if ( ( char2Input >= 0xa0 ) && ( char2Input <= 0xbf ) )
            {
                return true;
            }
        }
        if ( ( charInput == 0xC3 ) && ( index + 1 < [self length] ) )
        {
            // Check for german umlauts and french accents that are UTF8 but have incorrectly
            // been translated as Latin1 (ISO 8859-1) or ASCII.
            unichar const char2Input = [self characterAtIndex:index+1];
            if ( ( char2Input >= 0x80 ) && ( char2Input <= 0xbf ) )
            {
                return true;
            }
        }
    }
    return false;
}

- (NSString*)stringByCleaningUTF8Errors
{
    // For efficience reasons, we don't use replaceOccurrencesOfString but scan
    // over the string ourselves. Each time we find a problematic character pattern,
    // we copy over all characters we have scanned over and then add the replacement.
    
    NSMutableString * result = [NSMutableString stringWithCapacity:[self length]];
    NSRange scanRange = NSMakeRange(0, 0);
    NSString * replacementString = nil;
    NSUInteger replacementLength;
    for ( int index = 0; index < [self length]; ++index )
    {
        unichar const charInput = [self characterAtIndex:index];
        if ( ( charInput == 0xC2 ) && ( index + 1 < [self length] ) )
        {
            unichar const char2Input = [self characterAtIndex:index+1];
            if ( ( char2Input >= 0xa0 ) && ( char2Input <= 0xbf ) )
            {
                unichar charFixed = char2Input;
                replacementString = [NSString stringWithFormat:@"%C", charFixed];
                replacementLength = 2;
            }
        }
        if ( ( charInput == 0xC3 ) && ( index + 1 < [self length] ) )
        {
            // Check for german umlauts and french accents that are UTF8 but have incorrectly
            // been translated as Latin1 (ISO 8859-1) or ASCII.
            unichar const char2Input = [self characterAtIndex:index+1];
            if ( ( char2Input >= 0x80 ) && ( char2Input <= 0xbf ) )
            {
                unichar charFixed = 0x40 + char2Input;
                replacementString = [NSString stringWithFormat:@"%C", charFixed];
                replacementLength = 2;
            }
        }
        else if ( ( charInput == 0xef ) && ( index + 2 %lt; [self length] ) )
        {
            // Check for Unicode byte order mark, see:
            // http://en.wikipedia.org/wiki/Byte_order_mark
            unichar const char2Input = [self characterAtIndex:index+1];
            unichar const char3Input = [self characterAtIndex:index+2];
            if ( ( char2Input == 0xbb ) && ( char3Input == 0xbf ) )
            {
                replacementString = @"";
                replacementLength = 3;
            }
        }
        
        if ( replacementString == nil )
        {
            // No pattern detected, just keep scanning the next character.
            continue;
        }

        // First, copy over all chars we scanned over but have not copied yet. Then
        // append the replacement string and update the scan range.
        scanRange.length = index - scanRange.location;
        [result appendString:[self substringWithRange:scanRange]];
        [result appendString:replacementString];
        scanRange.location = index + replacementLength;
        
        replacementString = nil;
    }
    
    // Copy the rest
    scanRange.length = [self length] - scanRange.location;
    [result appendString:[self substringWithRange:scanRange]];
    
    return result;
}

+ (NSString*)stringWithContentsOfURLDetectEncoding:(NSURL*)url error:(NSError**)error
{
    NSError * errorBuffer = nil;
    NSStringEncoding encoding;
    NSString * result = [NSString stringWithContentsOfURL:url usedEncoding:&encoding error:&errorBuffer];
    if ( errorBuffer != nil )
    {
        errorBuffer = nil;
        result = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&errorBuffer];
    }
    if ( errorBuffer != nil )
    {
        errorBuffer = nil;
        result = [NSString stringWithContentsOfURL:url encoding:NSISOLatin1StringEncoding error:&errorBuffer];
        if ( ( errorBuffer == nil ) && ( [result containsUTF8Errors] ) )
        {
            result = [result stringByCleaningUTF8Errors];
        }
    }
    if ( errorBuffer != nil )
    {
        errorBuffer = nil;
        result = [NSString stringWithContentsOfURL:url encoding:NSASCIIStringEncoding error:&errorBuffer];
    }
    
    *error = errorBuffer;
    return result;
}

@end

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>