February 11 (Wednesday) February 16 (Monday)
We've seen two types of containers, list
and vector
which can be used to hold data of any type. As mentioned, each container
has its advantages and disadvantages and which one you choose depends
largely upon how you intend to access and modify elements in the container.
Some of the advantages and disadvantages of list
s and
vector
s is given below.
Advantages of list
containers
list
container
is quick relative to doing similar operation on a vector
container.
list
invalidates only the erased
item, not iterators that refer to subsequent items.
list
cannot potentially
invalidate all the iterators that refer to other elements in the
list
.
Disadvantages of list
containers
list
.
(e.g. No indexing operator ([ ]
) is defined for
a list
.)
sort()
function cannot be used. (However, the list
class provides
its own sort()
member function to sort its items.)
Advantages of vector
containers
vector
.
(e.g. An indexing operator ([ ]
) is defined for
a vector
.)
sort()
function can be used to sort
vector items. There's no need for vector
to define its
own sort()
function.
Disadvantages of vector
containers
vector
container
is slow relative to doing similar operation on a list
container.
vector
invalidates the iterator
of the erased item as well as iterators that refer to subsequent items.
vector
can potentially
invalidate all the iterators that refer to other elements in
the vector
. This is because all elements in the
vector
may have to be copied and moved in memory to ensure
there is (contiguous) space for the new elements.
In summary, if you need to be able to access the elements in the container
randomly and you are only adding/deleting elements from the end of the
container, then a vector
is your best option. If you need
to be able to efficiently insert/remove elements from anywhere in the
container and do not want iterators to become invalidated during insert
insert/remove operations, then a list
container is your
best option.
Using the string
class as if it were a vector
of char
elements, we can do some fairly sophisticated
string manipulation in C++. This section demonstrates one of the
more popular string processing algorithms: tokenizing a string
by breaking it up into its constituent words.
One operation which is used frequently in the parsing text files is
to break up the file into a collection of words. For the purposes of
this example, we'll consider a word to be be any collection of
characters separated by whitespace characters. Whitespace characters
include characters like the space, tab, newline, etc. What we want
is a function called split()
that takes a string
object and returns all the words in that string. In order to return all
the words, we'll return a value of type vector<string>
.
We can declare such a function in a header file split.h
:
#ifndef GUARD_split_h
#define GUARD_split_h
#include <vector>
#include <string>
std::vector<std::string> split(const std::string&);
#endif
split.h
Note the use of the guard macro to prevent multiple inclusion and the
use of the std::
namespace specifier on all the standard
library entities.
We can then define the split()
function in our
split.cpp
file. The algorithm behind this function is to
use two index variables i
and j
to delimit
the start and end of words within the string
.
#include <cctype>
#include <string>
#include <vector>
#include "split.h"
using std::vector;
using std::string;
vector<string> split(const string& s)
{
vector<string> ret;
typedef string::size_type string_size;
string_size i = 0;
// invariant: we have processed characters [original value of i, i)
while (i != s.size()) {
// ignore leading blanks
// invariant: characters in range
// [original i, current i) are all spaces
while (i != s.size() && isspace(s[i]))
++i;
// find end of next word
string_size j = i;
// invariant: none of the characters in range
// [original j, current j) is a space
while (j != s.size() && !isspace(s[j]))
++j;
// if we found some nonwhitespace characters
if (i != j) {
// copy from s starting at i and taking j - i chars
ret.push_back(s.substr(i, j - i));
i = j;
}
}
return ret;
}
split.cpp
The code will increment i
until it indexes the first nonspace
character in the string, then, using i
as a starting point,
it will increment j
until it indexes a space character.
Once the indices i
and j
have been set
appropriately, we copy the word from the string and store it in our
vector of words as follows:
ret.push_back(s.substr(i, j - i));
The substr()
member function of the string
class takes an index and a length and makes a copy of the characters
in the range specified by the index and length. In the above example,
the word starts at index i
and its length can be determined
by j - i
. (Note that j
indexes the first
whitespace character after the word indexed by i
. Therefore,
j - i
is the length of the word.) After the word has been
copied from the original string, it is then stored onto the end of the
ret
vector. When the entire string has been processed,
the ret
vector, containing all the words in the string,
is returned by the split()
function.
Note that the code above uses the isspace()
function which
returns true
if its argument is a whitespace character.
This is very similar to the same function in C. To get the declaration
for this function, you must include the <cctype>
header.
We'll see a simpler implementation of the split()
function in
Chapter 6 which uses iterators and functions available in the
<algorithm>
header.
The main()
function below will read lines from standard input
into a string variable and then pass the string to the split()
function for processing. It will then display all the words in the
string.
#include <iostream>
#include <string>
#include <vector>
#include "split.h"
using std::cin;
using std::cout;
using std::endl;
using std::getline;
using std::string;
using std::vector;
int main()
{
string s;
// read and split each line of input
while (getline(cin, s)) {
vector<string> v = split(s);
// write each word in `v'
for (vector<string>::const_iterator it = v.begin();
it != v.end(); ++it)
cout << *it << endl;
}
return 0;
}
split_main.cpp
In order to input entire lines from standard input, the
main()
function uses the global getline()
function. This function is roughly analogous to the fgets()
function in C. getline()
takes two parameters:
an istream
, and a string
. It returns a
reference to the istream
passed in after the input operation
takes place. The getline()
function itself reads the next
line of input (including the newline character) from the
input stream and stores it in the supplied string (without the newline
character). Upon end of input, the istream
object returned
by getline()
will return false
when used in a
boolean context. As a result, you can use getline()
as the
condition of a while
loop to determine when to stop reading.
Although we will not be covering K&M § 5.8 in class, you should still read this section as it provides further examples of string manipulation using so-called character pictures. There isn't too much in the way of new ideas presented in this section -- it's basically applying ideas that have been introduced in earlier sections.
The <algorithm>
header, which we've seen
before when sorting vectors, contains many useful functions for
performing high-level operations on standard types such as
vector
s and string
s. Effectively
using these functions can reduce development time and make
our C++ programs more robust. This chapter will introduce
several algorithm functions which can make C++ programming a less
tedious chore.
For example, consider the following (very contrived) program which
creates three vectors containing letters of the alphabet and then uses
the insert()
method of the vector
class and
the generic copy()
function in order to build up a fourth
vector containing all the letters of the alphabet in the correct order.
#include <algorithm>
#include <vector>
#include <iostream>
using std::endl;
using std::vector;
using std::cout;
int main()
{
vector<char> v1, v2, v3, v4;
for (char c = 'a'; c < 'i'; c++)
v1.push_back(c); // abcdefgh
for (char c = 'i'; c < 'r'; c++)
v2.push_back(c); // ijklmnopq
for (char c = 'r'; c <= 'z'; c++)
v3.push_back(c); // rstuvwxyz
// Make a vector containing |v2| blanks.
vector<char> blanks(v2.size(), ' ');
v4 = v1;
v4.insert(v4.end(), blanks.begin(), blanks.end());
copy(v3.begin(), v3.end(), back_inserter(v4));
copy(v2.begin(), v2.end(), v4.begin() + v1.size());
for (vector<char>::const_iterator it = v4.begin();
it != v4.end(); it++) {
cout << *it;
}
cout << endl;
}
inscopy.cpp
After creating the vectors v1
, v2
,
v3
and blanks
, we assign v1
to v4
, essentially making v4
a copy
of v1
.
Then we use the insert()
method of the vector class to
append copies of all the elements of the blanks
vector onto
the end of the the v4
vector. The insert()
method will copy elements denoted by the iterators given as the second
and third arguments and insert them into the container immediately before
the iterator denoted by the first iterator argument.
Next, we use the global copy()
function to append elements
onto the end of the v4
vector. The copy()
function will copy all those elements within the range denoted by the
the first two iterator arguments and copy them to the location
denoted by the third argument. In the above example, we must use
the iterator adapter back_inserter()
in order to tell
the copy()
function that it should append the copied
elements onto the back of the v4
vector.
If we had just said:
copy(v3.begin(), v3.end(), v4.end());
Then copy will have attempted to write its first copied element
into the location denoted by v4.end()
but because
v4.end()
denotes the location one passed the end of the
vector (and not an actual element within the vector itself), the program
would likely crash. This is a subtle, but important difference between
the insert()
method and the copy()
function.
Note that instead of using copy()
, we could have used the
insert()
method as we did in the previous statement to
achieve the same result.
Finally, we copy all the elements of the v2
vector
into the appropriate position of the v4
vector.
copy(v2.begin(), v2.end(), v4.begin() + v1.size());
This will cause all the blanks in the v4
vector to be
overwritten with the elements of the v2
vector. Note that
because all the elements that are being written to in the v4
vector actually exist, there is no need to use an iterator adapter like
back_inserter()
in this case.
Note that even though we are inserting and copying a lot of elements
in this code, we are not using any for
or
while
loops to do so. The only loop used is for displaying
the elements of the v4
vector. By relying on iterators
and generic functions in the <algorithm>
header,
we can let the standard library do all the low-level ``heavy lifting''
for us while we concentrate on the higher level aspects of the program.
Just as long as we use the correct function with the correct iterator
arguments, the standard library will do the rest for us.
The program below will read words from standard input and display a
message indicated whether or not the word is a palindrome. The core
of the functionality is in the is_palindrome()
function
which calls a single <algorithm>
function, namely
equal()
. equal()
tests the equality of all
the elements in the range specified by the first two iterator arguments
with the elements denoted by the third iterator argument. This function
assumes that the two sequences indicated by the first two iterators
and the last iterator are of equal size.
#include <algorithm>
#include <cctype>
#include <iostream>
#include <string>
using std::cin; using std::cout;
using std::endl; using std::equal;
using std::string; using std::transform;
bool is_palindrome(const string& s)
{
return equal(s.begin(), s.end(), s.rbegin());
}
int main()
{
string s;
while (cin >> s) {
if (is_palindrome(s))
cout << s << " is a palindrome" << endl;
else
cout << s << " is not a palindrome" << endl;
}
return 0;
}
palin.cpp
The third iterator to the equal()
function is
s.rbegin()
which denotes the last element in the
(string
) container. When this iterator is ``incremented,''
it denotes the previous element in the container. Hence,
when the equal()
function executes, it will test the first
and the last character it the string for equality, then test the second
character and the second last character and so on. If all the characters
are equal, the function will return true
Earlier, we used indices in order to split a string into its constitute
words. Using iterators and the find_if()
function
from the <algorithm>
header, we can make the
split()
function simpler, as demonstrated below:
#include <algorithm>
#include <cctype>
#include <string>
#include <vector>
#include "split.h"
using std::find_if;
using std::string;
using std::vector;
// `true' if the argument is whitespace, `false' otherwise
bool space(char c)
{
return isspace(c);
}
// `false' if the argument is whitespace, `true' otherwise
bool not_space(char c)
{
return !isspace(c);
}
vector<string> split(const string& str)
{
typedef string::const_iterator iter;
vector<string> ret;
iter i = str.begin();
while (i != str.end()) {
// ignore leading blanks
i = find_if(i, str.end(), not_space);
// find end of next word
iter j = find_if(i, str.end(), space);
// copy the characters in [i, j)
if (i != str.end())
ret.push_back(string(i, j));
i = j;
}
return ret;
}
split-alg.cpp
As mentioned, this split()
function uses iterators
(i
and j
) instead of indices to copy
the words from the supplied string. We set up the initial
iterator i
to denote the beginning of the string
then find the first non-space character in the string. To do
this it uses the find_if()
function. As with many
of the <algorithm>
functions, this function
takes a pair of iterators denoting a range within a container.
The function also takes a predicate function not_space
.
find_if()
will then search all the elements denoted
by the iterator range looking for the first element that satisfies
the predicate. If it finds such an element, it will return an
iterator that denotes that element. Otherwise it will return
its second iterator argument.
After the first find_if()
function executes, the iterator
i
will denote either the first character of the first word
or, if there are no words in the string, i
will be set to
str.end()
(i.e. the position one passed the end
of the string). Then, we'll attempt to set the iterator j
to denote the character immediately after the word. Again, we use the
find_if()
function but this time with the opposite predicate
that we used before. Once we have i
and j
,
we can copy the word from the string and save it to our word vector
by writing:
ret.push_back(string(i, j));
Note that we do not use the substr()
method of the
string
class to copy the word from the string. Instead,
we can supply the two iterators as arguments to the string
constructor. This will build a new string which consists of all the
characters in the range [i, j)
denoted by the iterators.
Also note that we can't just use isspace()
as a predicate
because it is overloaded for internationalization purposes and we cannot
tell which isspace()
function to execute based solely on
the name, so we create our own space()
function to eliminate
the ambiguity.
The following program demonstrates how we can extract URLs from a string. A URL would be a string of the form protocol-name://resource-name For example:
http://www.cs.mun.ca/~donald/comp3710/
We first define a header file which we include in any module that
needs to extract URLs from a string. Again, note the use of the
multiple inclusion guard and the use of the std::
namespace prefix on all the standard library elements.
#ifndef GUARD_urls_h
#define GUARD_urls_h
#include <vector>
#include <string>
std::vector<std::string> find_urls(const std::string& s);
#endif
urls.h
Next, we write the source module for the find_urls()
function. This module defines several helper functions which help make
the find_urls()
function easier to write. The basic strategy
behind the find_urls()
function is to first search for the
://
separator in the string. If we find that particular character
sequence, we move backwards through the string searching for the actual
beginning of a URL (i.e. the beginning of the protocol name).
Then, we move forward from the location of the ://
separator
to search for the end of the URL. The beginning and end of each URL
will be denoted by the iterators b
and after
.
Once we have these iterators, we can simply copy the string representing
the URL and store it in a vector
of URLs in a manner similar
to how we copied words from a string earlier.
#include <algorithm>
#include <cctype>
#include <string>
#include <vector>
#include "urls.h"
using std::find;
using std::find_if;
using std::search;
using std::string;
using std::vector;
// Function declarations
bool not_url_char(char);
string::const_iterator
url_end(string::const_iterator, string::const_iterator);
string::const_iterator
url_beg(string::const_iterator, string::const_iterator);
// Function definitions
vector<string> find_urls(const string& s)
{
vector<string> ret;
typedef string::const_iterator iter;
iter b = s.begin(), e = s.end();
// look through the entire input
while (b != e) {
// look for one or more letters followed by `://'
b = url_beg(b, e);
// if we found it
if (b != e) {
// get the rest of the URL
iter after = url_end(b, e);
// remember the URL
ret.push_back(string(b, after));
// advance b and check for more URL on this line
b = after;
}
}
return ret;
}
string::const_iterator
url_end(string::const_iterator b, string::const_iterator e)
{
return find_if(b, e, not_url_char);
}
bool not_url_char(char c)
{
// characters, in addition to alphanumerics, that can appear in a URL
static const string url_ch = "~;/?:@=&$-_.+!*'(),";
// see whether `c' can appear in a URL and return the negative
return !(isalnum(c) ||
find(url_ch.begin(), url_ch.end(), c) != url_ch.end());
}
string::const_iterator
url_beg(string::const_iterator b, string::const_iterator e)
{
static const string sep = "://";
typedef string::const_iterator iter;
// `i' marks where the separator was found
iter i = b;
while ((i = search(i, e, sep.begin(), sep.end())) != e) {
// make sure the separator isn't at the beginning or end of
// the line
if (i != b && i + sep.size() != e) {
// `beg' marks the beginning of the protocol-name
iter beg = i;
while (beg != b && isalpha(beg[-1]))
--beg;
// is there at least one appropriate character before
// and after the separator?
if (beg != i && !not_url_char(i[sep.size()]))
return beg;
}
// the separator we found wasn't part of a URL;
// advance `i' past this separator
i += sep.size();
}
return e;
}
urls.cpp
The find_urls()
function makes use of two helper functions,
url_beg()
and url_end()
which return iterators
that denote the beginning and (one passed) the end of a URL found in
the string.
The url_beg()
function will search for the separator
string (://
) within the range denoted by the two iterators
passed into the function. To do this, it uses the search()
<algorithm>
function:
while ((i = search(i, e, sep.begin(), sep.end())) != e) {
This function takes a pair of iterators, the first pair denotes the
range of the string to be searched, the second pair denotes the string
for which we are searching. If search()
finds
the string, it will return an iterator denoting the position at
which the string was found. If the string was not found, then
the second iterator, e
, will be returned.
If the separator string was found, we then sanity check the location
at which the string was found to make sure that it didn't occur at the
very beginning or at the very end of the string in which we searched.
If it didn't, then we create a new iterator beg
and move it
backwards through the string character by character until we encounter the
beginning of the line or until we encounter a non-alphabetic character:
while (beg != b && isalpha(beg[-1])) --beg;
Note the use of the beg[-1]
expression. This returns the
character immediately before the position denoted by beg
.
We could have equivalently written this as *(beg - 1)
.
Also note that we can decrement iterators as well as increment them
as demonstrated by the --beg
expression which updates
beg
to denote the character immediately before it. Also,
note that because of the first test, beg != b
, it is not
possible for beg[-1]
to denote a character outside the
range of the original string.
Before we can actually return beg
as the iterator
that denotes the beginning of the URL, we must do a couple of
additional sanity checks to ensure that this is the actual
beginning of a URL. We must make sure that we successfully
moved the beg
iterator back at least one position.
This means that there is at least one alphabetic character before
the ://
separator. We also check to make sure that
there is at least one valid url character following the separator.
These two tests are carried out by the following conditional:
if (beg != i && !not_url_char(i[sep.size()])) return beg;
Again, we use the indexing operation [ ]
on the iterator
i
(which denotes the beginning position of the separator
found in the string. By using an index value of sep.size()
,
we will look at the character immediately following the separator
string. If this character is indeed a valid URL character (and the
beg != i
test conducted before was true
), then
we finally return the beg
iterator from this function.
Note the use of the local static
variable sep
in the url_beg()
function. This has the same semantics
as in C -- the variable will be created and initialized once during
the entire program and its value will persist between calls to the
url_beg()
function. Because it is also const
,
the value of sep
can never change during the execution of
the program.
The url_end()
function is much simpler. It simply
takes the beginning of the URL (returned by url_beg()
)
and the end of the string in which we are searching for URLs and calls
the find_if()
function to find the first non-URL character
in the range denoted by the two iterators. In order to find the first
non-URL character in the string, it passes a new predicate function
called not_url_char
to the find_if()
function.
This predicate simply defines a string
which contains all
the valid URL characters.
static const string url_ch = "~;/?:@=&$-_.+!*'(),";
It then returns the boolean result of a conditional test. A character
is a valid URL character if it is alphanumeric or if the character
exists in the url_ch
string. We negate the value of this
test thereby returning true
if the character given to the
predicate function is not a valid URL character and
false
otherwise:
return !(isalnum(c) || find(url_ch.begin(), url_ch.end(), c) != url_ch.end());
To determine whether or not the character is in the url_ch
string, we use the find()
function from the
<algorithm>
header. This function is very similar to
find_if()
except that instead of taking a predicate function
as its third argument, it takes an actual value. The find()
function will then scan the range denoted by its first two parameters
searching for this value. If it finds it, it returns an iterator denoting
the position of the value in the container. Otherwise, it returns the
value of the second iterator (url_ch.end()
in this case).
The main()
function that calls the find_urls()
function is relatively straighforward. It reads lines from standard
input, passes each one to find_urls()
. The main()
function then displays the contents of the vector
returned
from this function using a const_iterator
.
#include <iostream>
#include <string>
#include <vector>
#include "urls.h"
using std::cout;
using std::cin;
using std::endl;
using std::find_if;
using std::getline;
using std::string;
using std::vector;
int main() {
string s;
while (getline(cin, s)) {
vector<string> v = find_urls(s);
for (vector<string>::const_iterator i = v.begin();
i != v.end(); ++i)
cout << *i << endl;
}
return 0;
}
urls_main.cpp
If you compile and run this program on the Computer Science home page
http://www.cs.mun.ca/
, we get the following:
$ g++ -Wall -ansi -pedantic urls.cpp urls_main.cpp -o urls $ wget http://www.cs.mun.ca/ 13:52:43 (4.93 MB/s) - `index.html' saved [5172] $ ./urls < index.html http://www.cs.mun.ca http://www.cs.mun.ca/webmail http://www.cs.mun.ca/academics/helpctr.php http://www.cs.mun.ca/~csclub http://www.mun.ca http://silveryear.cs.mun.ca $
Last modified: April 17, 2004 12:38:48 NDT (Saturday)