Main

February 13, 2004 (Friday)

Chapter 5: Using sequential containers and analyzing strings (cont'd)

We've seen two types of containers, list and vector which can be used to hold data of any type. As mentioned, each container has its advantages and disadvantages and which one you choose depends largely upon how you intend to access and modify elements in the container. Some of the advantages and disadvantages of lists and vectors is given below.

Advantages of list containers

Disadvantages of list containers

Advantages of vector containers

Disadvantages of vector containers

In summary, if you need to be able to access the elements in the container randomly and you are only adding/deleting elements from the end of the container, then a vector is your best option. If you need to be able to efficiently insert/remove elements from anywhere in the container and do not want iterators to become invalidated during insert insert/remove operations, then a list container is your best option.

String Processing (K&M § 5.6, 5.7)

Using the string class as if it were a vector of char elements, we can do some fairly sophisticated string manipulation in C++. This section demonstrates one of the more popular string processing algorithms: tokenizing a string by breaking it up into its constituent words.

Splitting a string

One operation which is used frequently in the parsing text files is to break up the file into a collection of words. For the purposes of this example, we'll consider a word to be be any collection of characters separated by whitespace characters. Whitespace characters include characters like the space, tab, newline, etc. What we want is a function called split() that takes a string object and returns all the words in that string. In order to return all the words, we'll return a value of type vector<string>. We can declare such a function in a header file split.h:

#ifndef GUARD_split_h
#define GUARD_split_h

#include <vector>
#include <string>
std::vector<std::string> split(const std::string&);

#endif
split.h

Note the use of the guard macro to prevent multiple inclusion and the use of the std:: namespace specifier on all the standard library entities.

We can then define the split() function in our split.cpp file. The algorithm behind this function is to use two index variables i and j to delimit the start and end of words within the string.

#include <cctype>
#include <string>
#include <vector>

#include "split.h"

using std::vector;
using std::string;

vector<string> split(const string& s)
{
	vector<string> ret;
	typedef string::size_type string_size;
	string_size i = 0;

	// invariant: we have processed characters [original value of i, i)
	while (i != s.size()) {
		// ignore leading blanks
		// invariant: characters in range
		// [original i, current i) are all spaces
		while (i != s.size() && isspace(s[i]))
			++i;

		// find end of next word
		string_size j = i;
		// invariant: none of the characters in range
		// [original j, current j) is a space
		while (j != s.size() && !isspace(s[j]))
			++j;

		// if we found some nonwhitespace characters
		if (i != j) {
			// copy from s starting at i and taking j - i chars
			ret.push_back(s.substr(i, j - i));
			i = j;
		}

	}
	return ret;
}
split.cpp

The code will increment i until it indexes the first nonspace character in the string, then, using i as a starting point, it will increment j until it indexes a space character. Once the indices i and j have been set appropriately, we copy the word from the string and store it in our vector of words as follows:

ret.push_back(s.substr(i, j - i));

The substr() member function of the string class takes an index and a length and makes a copy of the characters in the range specified by the index and length. In the above example, the word starts at index i and its length can be determined by j - i. (Note that j indexes the first whitespace character after the word indexed by i. Therefore, j - i is the length of the word.) After the word has been copied from the original string, it is then stored onto the end of the ret vector. When the entire string has been processed, the ret vector, containing all the words in the string, is returned by the split() function.

Note that the code above uses the isspace() function which returns true if its argument is a whitespace character. This is very similar to the same function in C. To get the declaration for this function, you must include the <cctype> header.

We'll see a simpler implementation of the split() function in Chapter 6 which uses iterators and functions available in the <algorithm> header.

The main() function below will read lines from standard input into a string variable and then pass the string to the split() function for processing. It will then display all the words in the string.

#include <iostream>
#include <string>
#include <vector>

#include "split.h"

using std::cin;
using std::cout;
using std::endl;
using std::getline;
using std::string;
using std::vector;

int main()
{
	string s;

	// read and split each line of input
	while (getline(cin, s)) {
		vector<string> v = split(s);

		// write each word in `v'
		for (vector<string>::const_iterator it = v.begin();
		     it != v.end(); ++it)
			cout << *it << endl;
	}
	return 0;
}
split_main.cpp

In order to input entire lines from standard input, the main() function uses the global getline() function. This function is roughly analogous to the fgets() function in C. getline() takes two parameters: an istream, and a string. It returns a reference to the istream passed in after the input operation takes place. The getline() function itself reads the next line of input (including the newline character) from the input stream and stores it in the supplied string (without the newline character). Upon end of input, the istream object returned by getline() will return false when used in a boolean context. As a result, you can use getline() as the condition of a while loop to determine when to stop reading.

Character pictures

Although we will not be covering K&M § 5.8 in class, you should still read this section as it provides further examples of string manipulation using so-called character pictures. There isn't too much in the way of new ideas presented in this section -- it's basically applying ideas that have been introduced in earlier sections.

Chapter 6: Using library algorithms

The <algorithm> header, which we've seen before when sorting vectors, contains many useful functions for performing high-level operations on standard types such as vectors and strings. Effectively using these functions can reduce development time and make our C++ programs more robust. This chapter will introduce several algorithm functions which can make C++ programming a less tedious chore.

For example, consider the following (very contrived) program which creates three vectors containing letters of the alphabet and then uses the insert() method of the vector class and the generic copy() function in order to build up a fourth vector containing all the letters of the alphabet in the correct order.

#include	<algorithm>
#include	<vector>
#include	<iostream>

using std::endl;
using std::vector;
using std::cout;

int main()
{
	vector<char>	v1, v2, v3, v4;
	for (char c = 'a'; c < 'i'; c++)
		v1.push_back(c);	// abcdefgh

	for (char c = 'i'; c < 'r'; c++)
		v2.push_back(c);	// ijklmnopq

	for (char c = 'r'; c <= 'z'; c++)
		v3.push_back(c);	// rstuvwxyz

	// Make a vector containing |v2| blanks.
	vector<char> blanks(v2.size(), ' ');

	v4 = v1;
	v4.insert(v4.end(), blanks.begin(), blanks.end());
	copy(v3.begin(), v3.end(), back_inserter(v4));

	copy(v2.begin(), v2.end(), v4.begin() + v1.size());

	for (vector<char>::const_iterator it = v4.begin();
		it != v4.end(); it++) {
		cout << *it;
	}
	cout << endl;
}
inscopy.cpp

After creating the vectors v1, v2, v3 and blanks, we assign v1 to v4, essentially making v4 a copy of v1.

Then we use the insert() method of the vector class to append copies of all the elements of the blanks vector onto the end of the the v4 vector. The insert() method will copy elements denoted by the iterators given as the second and third arguments and insert them into the container immediately before the iterator denoted by the first iterator argument.

Next, we use the global copy() function to append elements onto the end of the v4 vector. The copy() function will copy all those elements within the range denoted by the the first two iterator arguments and copy them to the location denoted by the third argument. In the above example, we must use the iterator adapter back_inserter() in order to tell the copy() function that it should append the copied elements onto the back of the v4 vector. If we had just said:

copy(v3.begin(), v3.end(), v4.end());

Then copy will have attempted to write its first copied element into the location denoted by v4.end() but because v4.end() denotes the location one passed the end of the vector (and not an actual element within the vector itself), the program would likely crash. This is a subtle, but important difference between the insert() method and the copy() function. Note that instead of using copy(), we could have used the insert() method as we did in the previous statement to achieve the same result.

Finally, we copy all the elements of the v2 vector into the appropriate position of the v4 vector.

copy(v2.begin(), v2.end(), v4.begin() + v1.size());

This will cause all the blanks in the v4 vector to be overwritten with the elements of the v2 vector. Note that because all the elements that are being written to in the v4 vector actually exist, there is no need to use an iterator adapter like back_inserter() in this case.

Note that even though we are inserting and copying a lot of elements in this code, we are not using any for or while loops to do so. The only loop used is for displaying the elements of the v4 vector. By relying on iterators and generic functions in the <algorithm> header, we can let the standard library do all the low-level ``heavy lifting'' for us while we concentrate on the higher level aspects of the program. Just as long as we use the correct function with the correct iterator arguments, the standard library will do the rest for us.

Detecting palindromes

The program below will read words from standard input and display a message indicated whether or not the word is a palindrome. The core of the functionality is in the is_palindrome() function which calls a single <algorithm> function, namely equal(). equal() tests the equality of all the elements in the range specified by the first two iterator arguments with the elements denoted by the third iterator argument. This function assumes that the two sequences indicated by the first two iterators and the last iterator are of equal size.

#include <algorithm>
#include <cctype>
#include <iostream>
#include <string>

using std::cin;             using std::cout;
using std::endl;            using std::equal;
using std::string;          using std::transform;

bool is_palindrome(const string& s)
{
	return equal(s.begin(), s.end(), s.rbegin());
}

int main()
{
        string s;
        while (cin >> s) {
                if (is_palindrome(s))
                        cout << s << " is a palindrome" << endl;
                else
                        cout << s << " is not a palindrome" << endl;
        }
        return 0;
}
palin.cpp

The third iterator to the equal() function is s.rbegin() which denotes the last element in the (string) container. When this iterator is ``incremented,'' it denotes the previous element in the container. Hence, when the equal() function executes, it will test the first and the last character it the string for equality, then test the second character and the second last character and so on. If all the characters are equal, the function will return true

Another way to split strings

Earlier, we used indices in order to split a string into its constitute words. Using iterators and the find_if() function from the <algorithm> header, we can make the split() function simpler, as demonstrated below:

#include <algorithm>
#include <cctype>
#include <string>
#include <vector>

#include "split.h"

using std::find_if;
using std::string;
using std::vector;

// `true' if the argument is whitespace, `false' otherwise
bool space(char c)
{
	return isspace(c);
}

// `false' if the argument is whitespace, `true' otherwise
bool not_space(char c)
{
	return !isspace(c);
}

vector<string> split(const string& str)
{
	typedef string::const_iterator iter;
	vector<string> ret;

	iter i = str.begin();
	while (i != str.end()) {

		// ignore leading blanks
		i = find_if(i, str.end(), not_space);

		// find end of next word
		iter j = find_if(i, str.end(), space);

		// copy the characters in [i, j)
		if (i != str.end())
			ret.push_back(string(i, j));
		i = j;
	}
	return ret;
}
split-alg.cpp

As mentioned, this split() function uses iterators (i and j) instead of indices to copy the words from the supplied string. We set up the initial iterator i to denote the beginning of the string then find the first non-space character in the string. To do this it uses the find_if() function. As with many of the <algorithm> functions, this function takes a pair of iterators denoting a range within a container. The function also takes a predicate function not_space. find_if() will then search all the elements denoted by the iterator range looking for the first element that satisfies the predicate. If it finds such an element, it will return an iterator that denotes that element. Otherwise it will return its second iterator argument.

After the first find_if() function executes, the iterator i will denote either the first character of the first word or, if there are no words in the string, i will be set to str.end() (i.e. the position one passed the end of the string). Then, we'll attempt to set the iterator j to denote the character immediately after the word. Again, we use the find_if() function but this time with the opposite predicate that we used before. Once we have i and j, we can copy the word from the string and save it to our word vector by writing:

ret.push_back(string(i, j));

Note that we do not use the substr() method of the string class to copy the word from the string. Instead, we can supply the two iterators as arguments to the string constructor. This will build a new string which consists of all the characters in the range [i, j) denoted by the iterators.

Also note that we can't just use isspace() as a predicate because it is overloaded for internationalization purposes and we cannot tell which isspace() function to execute based solely on the name, so we create our own space() function to eliminate the ambiguity.

Extracting URLs from a string

The following program demonstrates how we can extract URLs from a string. A URL would be a string of the form protocol-name://resource-name For example:

http://www.cs.mun.ca/~donald/comp3710/

We first define a header file which we include in any module that needs to extract URLs from a string. Again, note the use of the multiple inclusion guard and the use of the std:: namespace prefix on all the standard library elements.

#ifndef GUARD_urls_h
#define GUARD_urls_h

#include <vector>
#include <string>

std::vector<std::string> find_urls(const std::string& s);

#endif
urls.h

Next, we write the source module for the find_urls() function. This module defines several helper functions which help make the find_urls() function easier to write. The basic strategy behind the find_urls() function is to first search for the :// separator in the string. If we find that particular character sequence, we move backwards through the string searching for the actual beginning of a URL (i.e. the beginning of the protocol name). Then, we move forward from the location of the :// separator to search for the end of the URL. The beginning and end of each URL will be denoted by the iterators b and after. Once we have these iterators, we can simply copy the string representing the URL and store it in a vector of URLs in a manner similar to how we copied words from a string earlier.

#include <algorithm>
#include <cctype>
#include <string>
#include <vector>

#include "urls.h"

using std::find;
using std::find_if;
using std::search;
using std::string;
using std::vector;

// Function declarations
bool not_url_char(char);

string::const_iterator
url_end(string::const_iterator, string::const_iterator);

string::const_iterator
url_beg(string::const_iterator, string::const_iterator);

// Function definitions
vector<string> find_urls(const string& s)
{
	vector<string> ret;
	typedef string::const_iterator iter;
	iter b = s.begin(), e = s.end();

	// look through the entire input
	while (b != e) {

		// look for one or more letters followed by `://'
		b = url_beg(b, e);

		// if we found it
		if (b != e) {
			// get the rest of the URL
			iter after = url_end(b, e);

			// remember the URL
			ret.push_back(string(b, after));

			// advance b and check for more URL on this line
			b = after;
		}
	}
	return ret;
}

string::const_iterator
url_end(string::const_iterator b, string::const_iterator e)
{
	return find_if(b, e, not_url_char);
}

bool not_url_char(char c)
{
	// characters, in addition to alphanumerics, that can appear in a URL
	static const string url_ch = "~;/?:@=&$-_.+!*'(),";

	// see whether `c' can appear in a URL and return the negative
	return !(isalnum(c) ||
	         find(url_ch.begin(), url_ch.end(), c) != url_ch.end());
}

string::const_iterator
url_beg(string::const_iterator b, string::const_iterator e)
{
	static const string sep = "://";

	typedef string::const_iterator iter;

	// `i' marks where the separator was found
	iter i = b;

	while ((i = search(i, e, sep.begin(), sep.end())) != e) {

		// make sure the separator isn't at the beginning or end of
		// the line
		if (i != b && i + sep.size() != e) {

			// `beg' marks the beginning of the protocol-name
			iter beg = i;
			while (beg != b && isalpha(beg[-1]))
				--beg;

			// is there at least one appropriate character before
			// and after the separator?
			if (beg != i && !not_url_char(i[sep.size()]))
				return beg;
		}

		// the separator we found wasn't part of a URL;
		// advance `i' past this separator
		i += sep.size();
	}
	return e;
}
urls.cpp

The find_urls() function makes use of two helper functions, url_beg() and url_end() which return iterators that denote the beginning and (one passed) the end of a URL found in the string.

The url_beg() function will search for the separator string (://) within the range denoted by the two iterators passed into the function. To do this, it uses the search() <algorithm> function:

while ((i = search(i, e, sep.begin(), sep.end())) != e) {

This function takes a pair of iterators, the first pair denotes the range of the string to be searched, the second pair denotes the string for which we are searching. If search() finds the string, it will return an iterator denoting the position at which the string was found. If the string was not found, then the second iterator, e, will be returned.

If the separator string was found, we then sanity check the location at which the string was found to make sure that it didn't occur at the very beginning or at the very end of the string in which we searched. If it didn't, then we create a new iterator beg and move it backwards through the string character by character until we encounter the beginning of the line or until we encounter a non-alphabetic character:

while (beg != b && isalpha(beg[-1]))
	--beg;

Note the use of the beg[-1] expression. This returns the character immediately before the position denoted by beg. We could have equivalently written this as *(beg - 1). Also note that we can decrement iterators as well as increment them as demonstrated by the --beg expression which updates beg to denote the character immediately before it. Also, note that because of the first test, beg != b, it is not possible for beg[-1] to denote a character outside the range of the original string.

Before we can actually return beg as the iterator that denotes the beginning of the URL, we must do a couple of additional sanity checks to ensure that this is the actual beginning of a URL. We must make sure that we successfully moved the beg iterator back at least one position. This means that there is at least one alphabetic character before the :// separator. We also check to make sure that there is at least one valid url character following the separator. These two tests are carried out by the following conditional:

if (beg != i && !not_url_char(i[sep.size()]))
	return beg;

Again, we use the indexing operation [ ] on the iterator i (which denotes the beginning position of the separator found in the string. By using an index value of sep.size(), we will look at the character immediately following the separator string. If this character is indeed a valid URL character (and the beg != i test conducted before was true), then we finally return the beg iterator from this function.

Note the use of the local static variable sep in the url_beg() function. This has the same semantics as in C -- the variable will be created and initialized once during the entire program and its value will persist between calls to the url_beg() function. Because it is also const, the value of sep can never change during the execution of the program.

The url_end() function is much simpler. It simply takes the beginning of the URL (returned by url_beg()) and the end of the string in which we are searching for URLs and calls the find_if() function to find the first non-URL character in the range denoted by the two iterators. In order to find the first non-URL character in the string, it passes a new predicate function called not_url_char to the find_if() function. This predicate simply defines a string which contains all the valid URL characters.

static const string url_ch = "~;/?:@=&$-_.+!*'(),";

It then returns the boolean result of a conditional test. A character is a valid URL character if it is alphanumeric or if the character exists in the url_ch string. We negate the value of this test thereby returning true if the character given to the predicate function is not a valid URL character and false otherwise:

return !(isalnum(c) ||
	 find(url_ch.begin(), url_ch.end(), c) != url_ch.end());

To determine whether or not the character is in the url_ch string, we use the find() function from the <algorithm> header. This function is very similar to find_if() except that instead of taking a predicate function as its third argument, it takes an actual value. The find() function will then scan the range denoted by its first two parameters searching for this value. If it finds it, it returns an iterator denoting the position of the value in the container. Otherwise, it returns the value of the second iterator (url_ch.end() in this case).

The main() function that calls the find_urls() function is relatively straighforward. It reads lines from standard input, passes each one to find_urls(). The main() function then displays the contents of the vector returned from this function using a const_iterator.

#include <iostream>
#include <string>
#include <vector>

#include "urls.h"

using std::cout;
using std::cin;
using std::endl;
using std::find_if;
using std::getline;
using std::string;
using std::vector;

int main() {
        string s;
        while (getline(cin, s)) {
                vector<string> v = find_urls(s);
                for (vector<string>::const_iterator i = v.begin();
                        i != v.end(); ++i)
                        cout << *i << endl;
        }
        return 0;
}
urls_main.cpp

If you compile and run this program on the Computer Science home page http://www.cs.mun.ca/, we get the following:

$ g++ -Wall -ansi -pedantic urls.cpp urls_main.cpp -o urls
$ wget http://www.cs.mun.ca/
13:52:43 (4.93 MB/s) - `index.html' saved [5172]
$ ./urls < index.html
http://www.cs.mun.ca
http://www.cs.mun.ca/webmail
http://www.cs.mun.ca/academics/helpctr.php
http://www.cs.mun.ca/~csclub
http://www.mun.ca
http://silveryear.cs.mun.ca
$

Last modified: April 17, 2004 12:38:48 NDT (Saturday)