The C++ library for WebMapReduce attempts to efficiently emulate map-reduce behavior as provided by Hadoop. It is composed of two namespaces, and makes heavy use of std::string. Much of the library is not intended to be exposed to users, and is reserved for background use in ensuring correct behavior. The key parts of the library that users should be familiar with are the public elements of wmr_common.h and datastream.h.
This library is designed such that users are only required to provied two pieces of code: the mapper class and the reducer class. These classes must fit the following template:
class mapper
{
public:
void map(std::string key, std::string value);
};
class reducer
{
public:
void reduce(std::string key, wmr::datastream values);
};
The classes may also have constructors and destructors, but the constructer cannot take arguments. Once these classes are provided, the library instantiates them, and calls the methods map() and reduce(). Users should note that they must explicitly include the wmr namespace (or fully qualify library types) to use library classes.
The wmr namespace is intended to be exposed to users, while the internal namespace is meant to be used behind the scenes. Technically, users may access the internal namespace, but doing so may cause unexpected results.
File: datastream.h
class datastream
{
public:
datastream(void);
datastream(std::string);
datastream(const datastream &);
~datastream(void);
bool eof(void) const;
const std::string & associated_key(void) const;
template <typename T>
datastream & operator>>(T &);
template <typename T>
datastream & get(T &);
datastream & operator>>(std::stringstream &);
datastream & get(std::stringstream &);
private:
// ...
};
The datastream class is a one-way stream used to retrieve values in the reducer. Rather than load the entire list of values into memory at the same time, the datastream class works in conjunction with wmr::internal::key_value_stream to defer loading until a value is needed. Since values are fed through standard input to the program running as the reducer, there is no way to know in advance how many values a particular key has. Therefore, by associating a datastream with a key, it is possible to check for the end of the list by receiving a key-value pair that does not have the same key. Due to the shuffling stage of map-reduce guaranteed by Hadoop, it is known that all key-value pairs will arrive to the reducer in-order, which guarantees the behavior of datastream.
Note
datastream objects are provided by the reducer and should not be created by hand.
Calls datastream & datastream::get<T>(T &)().
Calls datastream & datastream::get(std::stringstream &)()
Sets the first argument to be the current value, then advances the stream (gets the next value in the stream).
Adds the current value to the first argument, then advances the stream (gets the next value in the stream).
File: wmr_common.h
class utility
{
public:
template <typename T>
static std::string toString(const T &);
template <typename T>
static T fromString(const std::string &);
// Alternatives to fromString<T>() without template arguments
static int stringToInt(const std::string &);
static long stringToLong(const std::string &);
static float stringToFloat(const std::string &);
static double stringToDouble(const std::string &);
static bool stringToBool(const std::string &);
static std::vector<std::string> split(const std::string &, char);
static std::vector<std::string> split_multi(const std::string &, const std::string &);
};
template <typename T, typename K>
static void emit(const T &, const K &);
Provides a set of helper methods to those using this library. In particular it focuses on string manipulation and conversion since data is passed to and from the mapper and reducer as strings.
Converts data of type T to a std::string by using a std::stringstream object. Returns the string that is created.
Converts the given std::string to type T by using a std::stringstream object. Returns the data that is created from the string.
Splits the given string (argument 1) on every instance of the given delimiter (argument 2). Only a single delimiter may be passed to this method. Returns a std::vector of std::string objects resulting from the split.
Splits the given string (argument 1) on every instance of any of the delimiters passed in argument 2. If two delimiters occur side-by-side, they are skipped (no empty string is provided). Returns a :class:std::vector of std::string objects resulting from the split.
Prints the first argument followed by the delimiter specified by wmr_common (almost always tab) followed by the second argument to standard output. Emit can only be called with types that support the << operator with a std::ostream object (cout).
Convenience methods to avoid templates:
Since the generic utility::fromString method returns data of type T, the template parameters must be given when calling it. Given the audience of WMR and the strong emphasis on beginning CS students, non-templated methods have been provided in case providing template arguments provides to be too confusing.
This method calls utility::fromString with the appropriate template parameter.
This section presents the code to run a word count job using WebMapReduce and the C++ library.
#include <string>
#include <vector>
class mapper
{
public:
void map(std::string key, std::string value)
{
using std::string;
using std::vector;
// split the line into a vector of words
vector<string> words = wmr::utility::split(key, ' ');
// loop through the vector, and emit each word with a 1
for (size_t i = 0; i < words.size(); ++i)
{
wmr::emit(words.at(i), 1);
}
}
};
#include <string>
class reducer
{
public:
void reduce(std::string key, wmr::datastream stm)
{
using std::string;
long sum = 0;
string value;
// grab data from the stream until there are no more values
while (!stm.eof())
{
stm >> value;
// convert the value to a number, then add it to the
// running total
sum += wmr::utility::fromString<long>(value);
}
// emit the key with its total count
wmr::emit(key, sum);
}
};