Wednesday, January 15, 2003

"Strings" (i.e. character arrays) (K&R § 1.9, 5.5)

C does not have a first-class string data type. Instead, strings are represented as an array of characters that are terminated with a nul byte, '\0' (ASCII 0).

Consider the following example:


#include	<stdio.h>

int
main()
{
	char	string1[] = { 'H', 'e', 'l', 'l', 'o', '\0' };
	char	string2[] = "World.";

	printf("%s, ", string1);
	puts(string2);

	string2[3] = 'k';
	string2[4] = '\0';
	printf("Array containing \"%s\" has %d bytes\n",
			string2, sizeof(string2));
	return 0;
}


string1 is initialized by specifying each individual character in much the same why that the integer array was initialized earlier. Note that the nul byte is explicitly specified at the end of the string. Needless to say, it is very tedious to have to specify the characters of the string this way, so C allows you to initialize an array using a string literal (in this case "World."). When this is done, the compiler creates an array of sufficient size to hold each of the characters as well as the nul byte. During runtime, the characters of the string literal, including its implicit nul byte at the end are copied into the array.

If we had explicitly specified a dimension for the array that was too large (e.g. char string2[20] = "World."), then the unused space would be filled with '\0'.

The printf() function call uses the %s conversion specifier in its format string to display string1 followed by a comma and a space. This specifier requires that the corresponding argument in printf()'s argument list be a pointer to a character. As we will see later, using string1 satisfies this requirement. The %s specifier will display the sequence of characters starting at the specified location until it encounters '\0'.

The puts() function (which is also declared in the stdio.h header file) is then called using string2. The puts() function simply puts the supplied string on the display followed by a newline. It is simpler to use and quicker to execute than printf(), so it should be used when all you wish to do is simply display a collection of characters (that must be terminated with a nul byte) with no formatting.

We can change the contents of the array of characters as we would any other array. For example, when the line string2[3] = 'k', is executed, the fourth character of the string2 array is changed from an 'l' to a 'k'. We can also shorten the string by writing a nul byte earlier in the array. For example, string[4] = '\0' shortens the string to just "Work".

Finally, the above program displays string2 delimited by quotation marks and a count of the number of bytes (characters) that the string2 array can hold (including the nul byte). Note that we can display a double-quote character by escaping it with a backslash inside printf()'s format string.

The output of the program is:

Hello, World.
Array containing "Work" has 7 bytes

Note that making the string shorter does not actually change the size of the array that contains it.

Ultimately, when dealing with strings, there are a couple of very important points to remember:

  1. Ensure that all strings are terminated with a nul byte.
  2. Always make sure that any array to which a string is copied has enough room for the characters of the string as well as the terminating nul byte.

In some cases when you forget to add the trailing nul byte or forget to ensure there is enough space in your character array to accommodate it, your program may still appear to be working fine. Unfortunately, problems may not actually arise until much later. It is for this reason that nul byte issues can be very problematic to resolve.

Standard string functions: strcpy(), strcat(), strcmp() and strlen() (K&R § 2.8, 5.3, 5.5)

The C standard library provides several functions for handling with strings. These function are all declared in string.h and so any source file that calls these functions should have #include <string.h>.

strcpy(dst,src) Copies string src to dst (including the nul byte).
strcat(dst,src) Concatenates src to the end of dst. The nul byte from src is placed at the end of the concatenated string.
strcmp(str1,str2) Compares the characters of the two strings. If the first one is alphabetically less than the second, then return an integer which is less than 0. If the first one is greater than the second, then return an integer that is greater than zero. Otherwise, if they are equal, then return 0.
strlen(str) Return the length of the string (this length does not include the nul byte)

The following program demonstrates their usage:


#include	<stdio.h>
#include	<string.h>

#define	MAX_LEN 10
#define	ALPHA_LEN 26

int
main()
{
	char	strings[][MAX_LEN] = {	"abcdefghi",
					"jklmnop",
					"",
					"qrstu",
					"vwxyz" };
	char	alpha[ALPHA_LEN + 1]; /* "+ 1" is for the nul byte */
	int	i;

	strcpy(alpha, "");
	for (i = 0; i < sizeof(strings)/sizeof(strings[0]); i++) {
		strcat(alpha, strings[i]);
	}
	printf("\"%s\" has length %d\n", alpha, strlen(alpha));

	if (strcmp(alpha, "abcdefghijklmnopqrstuvwxyz") == 0) {
		puts("The resulting string forms the alphabet");
	}
	return 0;
}


This code creates a two-dimensional array (strings) to hold a collection of strings and a one-dimensional array (alpha) to hold the result of concatenating all the strings in the two dimensional array. Note that we add one to the size of alpha's array. This is to explicitly accommodate the nul byte.

Internally, the two-dimensional array looks as follows:

Col
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Row
strings[0] a b c d e f g h i \0
strings[1] j k l m n o p \0 \0 \0
strings[2] \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
strings[3] q r s t u \0 \0 \0 \0 \0
strings[4] v w x y z \0 \0 \0 \0 \0

Note that there is a lot of wasted space here as nearly half of all of the array's contents are nul bytes. We'll see a more efficient way of storing an array of strings when we discuss pointers. Note also that all the strings have at least one nul byte at the end. Indeed, the string denoted by strings[2] consists of all nul bytes. This is perfectly valid: strings[2] is essentially an empty string (i.e. a string of length 0).

The program copies an empty string into alpha using strcpy() because arrays that are not initialized have undefined contents. We must ensure alpha is an empty string because we are concatenating to it later on. Using strcpy() to copy an empty string isn't particularly efficient. Instead, we could have simply initialized alpha to "" when we defined it (we could also have said alpha[0] = '\0' instead of saying strcpy(alpha, "") -- they both have the same effect.)

The for then executes once for each row of the array i.e. once for each string. Because sizeof(strings) is 50 and sizeof(strings[0]) (strings[0] is a one dimensional array) is 10, the loop will execute five times. Each time through the loop, the next string from the strings array is concatenated onto the end of alpha. Note that the code does not explicitly check whether or not there is enough room in the destination string for the additional characters. If the concatenated string overflows its array bounds, the program could exhibit undefined behaviour.

When the looping is completed, printf() is used to display the resulting string and its length (using strlen()). Note that the length of the string returned by strlen() does not include the trailing nul byte.

Finally, using strcmp() we compare alpha with a string literal representing the alphabet. If they are identical (i.e. strcmp() returns 0), then we display a simply message indicating so.

Note that these string functions could seriously misbehave if either of the string arguments are not nul terminated or (in the case of strcpy() and strcat()) if there is not enough room in the destination string for the result.

Last modified: Wed Jan 15 18:46:33 2003