Thursday, October 25, 2012

Split

Split separates strings. Strings often have delimiter characters in their data. Delimiters include "\r\n" newline sequences and the comma and tab characters. The C# language introduces the Split method. This method handles splitting upon string and character delimiters.

Key point:Use Split to separate parts from a string. If your input string is "A B C" you split on the space to get an array of: "A" "B" "C".

Example

To begin, let's examine the simplest Split method overload. You already know the general way to do this, but it is good to see the basic syntax before we move on. This program splits on a single character.

Program that splits on spaces [C#]

using System;

class Program
{
static void Main()
{
string s = "there is a cat";
//
// Split string on spaces.
// ... This will separate all the words.
//

string[] words = s.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
}
}
}

Output

there
is
a
cat

The input string, which contains four words, is split on spaces. The result value from Split is a string array. The foreach-loop then loops over this array and displays each word.

Multiple characters

Split strings

Next we use the Regex.Split method to separate based on multiple characters. Please note that a new char array is created in the following usages. There is an overloaded method with that signature if you need StringSplitOptions. This is used to remove empty strings.

Program that splits on lines with Regex [C#]

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
string value = "cat\r\ndog\r\nanimal\r\nperson";
//
// Split the string on line breaks.
// ... The return value from Split is a string[] array.
//

string[] lines = Regex.Split(value, "\r\n");

foreach (string line in lines)
{
Console.WriteLine(line);
}
}
}

Output

cat
dog
animal
person

RemoveEmptyEntries


The Regex type methods are used to Split strings effectively. But string Split is often faster. The next example specifies an array as the first argument to string Split. It uses the RemoveEmptyEntries enumerated constant.

Program that splits on multiple characters [C#]

using System;

class Program
{
static void Main()
{
//
// This string is also separated by Windows line breaks.
//

string value = "shirt\r\ndress\r\npants\r\njacket";

//
// Use a new char[] array of two characters (\r and \n) to break
// lines from into separate strings. Use "RemoveEmptyEntries"
// to make sure no empty strings get put in the string[] array.
//

char[] delimiters = new char[] { '\r', '\n' };
string[] parts = value.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < parts.Length; i++)
{
Console.WriteLine(parts[i]);
}

//
// Same as the previous example, but uses a new string of 2 characters.
//

parts = value.Split(new string[] { "\r\n" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length; i++)
{
Console.WriteLine(parts[i]);
}
}
}

Output
(Repeated two times)

shirt
dress
pants
jacket

One useful overload of Split receives char[] arrays. The string Split method receives a character array as the first parameter. Each char in the array designates a new block.

Char ArrayArray type

Using string arrays. Another overload of Split receives string[] arrays. This means a string array can also be passed to the Split method. The new string[] array is created inline with the Split call.

String Array

Explanation of StringSplitOptions. The RemoveEmptyEntries enum is specified. When two delimiters are adjacent, we end up with an empty result. We can use this as the second parameter to avoid empty results. The following screenshot shows the Visual Studio debugger.

Split string debug screenshot

Separate words


You can separate words with Split. Usually, the best way to separate words is to use a Regex that specifies non-word chars. This example separates words in a string based on non-word characters. It eliminates punctuation and whitespace from the return array.

Program that separates on non-word pattern [C#]

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
string[] w = SplitWords("That is a cute cat, man");
foreach (string s in w)
{
Console.WriteLine(s);
}
Console.ReadLine();
}

/// <summary>
/// Take all the words in the input string and separate them.
/// </summary>

static string[] SplitWords(string s)
{
//
// Split on all non-word characters.
// ... Returns an array of all the words.
//

return Regex.Split(s, @"\W+");
// @ special verbatim string syntax
// \W+ one or more non-word characters together

}
}

Output

That
is
a
cute
cat
man

In the example, we showed how to separate parts of your input string based on any character set or range with Regex. Overall, this provides more power than the string Split methods.

Regex.Split Examples

Text files

Note

Here you have a text file containing comma-delimited lines of values—this is called a CSV file. We use the File.ReadAllLines method here, but you may want StreamReader instead. This code reads in both of those lines. It parses them.

Then:It displays the values of each line after the line number. The output shows how the file was parsed into the strings.

Contents of input file: TextFile1.txt

Dog,Cat,Mouse,Fish,Cow,Horse,Hyena
Programmer,Wizard,CEO,Rancher,Clerk,Farmer

Program that splits lines in file [C#]

using System;
using System.IO;

class Program
{
static void Main()
{
int i = 0;
foreach (string line in File.ReadAllLines("TextFile1.txt"))
{
string[] parts = line.Split(',');
foreach (string part in parts)
{
Console.WriteLine("{0}:{1}",
i,
part);
}
i++; // For demo only
}
}
}

Output

0:Dog
0:Cat
0:Mouse
0:Fish
0:Cow
0:Horse
0:Hyena
1:Programmer
1:Wizard
1:CEO
1:Rancher
1:Clerk
1:Farmer

Directory paths

Path type

You can Split the segments in a Windows local directory into separate strings. Please note that directory paths are complex and this may not handle all cases correctly. It is also platform-specific. You could use System.IO.Path. DirectorySeparatorChar for more flexibility.

Path Examples

Program that splits Windows directories [C#]

using System;

class Program
{
static void Main()
{
// The directory from Windows
const string dir = @"C:\Users\Sam\Documents\Perls\Main";
// Split on directory separator
string[] parts = dir.Split('\\');
foreach (string part in parts)
{
Console.WriteLine(part);
}
}
}

Output

C:
Users
Sam
Documents
Perls
Main

Internal logic

Framework: NET

The logic internal to the .NET Framework for Split is implemented in managed code. The methods call into the overload with three parameters. The parameters are next checked for validity.

Next:It uses unsafe code to create the separator list, and then a for-loop combined with Substring to return the array.

ForSubstring

Benchmarks


I tested a long string and a short string, having 40 and 1200 chars. String splitting speed varies on the type of strings. The length of the blocks, number of delimiters, and total size of the string factor into performance. The Regex.Split option generally performed the worst.

And:I felt that the second or third methods would be the best, after observing performance problems with regular expressions in other situations.

Strings used in test [C#]

//
// Build long string.
//

_test = string.Empty;
for (int i = 0; i < 120; i++)
{
_test += "01234567\r\n";
}
//
// Build short string.
//

_test = string.Empty;
for (int i = 0; i < 10; i++)
{
_test += "ab\r\n";
}

Methods tested: 100000 iterations

static void Test1()
{
string[] arr = Regex.Split(_test, "\r\n", RegexOptions.Compiled);
}

static void Test2()
{
string[] arr = _test.Split(new char[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries);
}

static void Test3()
{
string[] arr = _test.Split(new string[] { "\r\n" },
StringSplitOptions.None);
}

Longer strings of 1200 chars. The benchmark for the methods on the long strings is more even. It may be that for long strings, such as entire files, the Regex method is equivalent or even faster. For short strings Regex is slowest. For long strings it is fast.

Benchmark of Split on long strings

[1] Regex.Split: 3470 ms
[2] char[] Split: 1255 ms [fastest]
[3] string[] Split: 1449 ms

Benchmark of Split on short strings

[1] Regex.Split: 434 ms
[2] char[] Split: 63 ms [fastest]
[3] string[] Split: 83 ms

Short strings of 40 chars. This shows the three methods compared to each other on short strings. Method 1 is the Regex method. It is by far the slowest on the short strings. This may be because of the compilation time. Smaller is better.

Performance optimization

Performance recommendation. For programs that use shorter strings, the methods that split based on arrays are faster and simpler. They will avoid Regex compilation. For somewhat longer strings or files that contain more lines, Regex is appropriate.

Escaped characters


You can use Replace on your string input to substitute special characters in for any escaped characters. This solves lots of problems on parsing computer-generated code or data.

Replace

Delimiter arrays


Let's focus on how you can specify delimiters to the Split method. My further research into Split shows that it is worthwhile to declare your char[] array you are splitting on as a local instance. This reduces memory pressure. It improves runtime performance.

Note:We see that storing the array of delimiters separately is good. My measurements show the above code is less than 10% faster when the array is stored outside the loop.

Slow version, before [C#]

//
// Split on multiple characters using new char[] inline.
//

string t = "string to split, ok";

for (int i = 0; i < 10000000; i++)
{
string[] s = t.Split(new char[] { ' ', ',' });
}

Fast version, after [C#]

//
// Split on multiple characters using new char[] already created.
//

string t = "string to split, ok";
char[] c = new char[]{ ' ', ',' }; // <-- Cache this

for (int i = 0; i < 10000000; i++)
{
string[] s = t.Split(c);
}

StringSplitOptions

Question and answer

What effect does the StringSplitOptions argument have? It affects the behavior of the Split method. The two values of StringSplitOptions—None and RemoveEmptyEntries—are actually just integers that tell Split how to work.

Program that uses StringSplitOptions [C#]

using System;

class Program
{
static void Main()
{
// Input string contain separators.
string value1 = "man,woman,child,,,bird";
char[] delimiter1 = new char[] { ',' }; // <-- Split on these

// ... Use StringSplitOptions.None.
string[] array1 = value1.Split(delimiter1,
StringSplitOptions.None);

foreach (string entry in array1)
{
Console.WriteLine(entry);
}

// ... Use StringSplitOptions.RemoveEmptyEntries.
string[] array2 = value1.Split(delimiter1,
StringSplitOptions.RemoveEmptyEntries);

Console.WriteLine();
foreach (string entry in array2)
{
Console.WriteLine(entry);
}
}
}

Output

man
woman
child


bird

man
woman
child
bird
String type

The input string in the example contains five commas, which are the delimiters. However, two fields between commas are 0 characters long (empty). In the first call to Split, these fields are put into the result array. In the second call, where we specify StringSplitOptions.RemoveEmptyEntries, the two empty fields are not in the result array.

Tip:You can use the StringSplitOptions.RemoveEmptyEntries enumerated constant as the second parameter in the Split method. By removing empty entries, you can simplify some logic.

However:Sometimes empty fields are useful for maintaining the order of your fields.

StringReader

Programming tip

We can instead use the StringReader type to separate a string into lines. StringReader can additionally lead to performance improvements over using Split. This is because no arrays are allocated.

StringReader

Summary

The C# programming language

We saw several examples and benchmarks of the Split method in the C# programming language. You can use Split to divide or separate your strings while keeping your code as simple as possible.

Tip:Using IndexOf and Substring together to parse your strings can sometimes be more effective.

IndexOf

1 comment: