File: datastream.h
class datastream
{
private:
std::string m_key;
std::string m_value;
bool m_endOfKey;
public:
datastream(void);
datastream(std::string);
datastream(const datastream &);
~datastream(void);
bool eof(void) const { return m_endOfKey; }
const std::string & associated_key(void) const { return m_key; }
template <typename T>
datastream & operator>>(T &);
template <typename T>
datastream & get(T &);
datastream & operator>>(std::stringstream &);
datastream & get(std::stringstream &);
private:
std::string key(void) const { return m_key; }
std::string value(void) const { return m_value; }
void advance(void);
};
The datastream
class is a one-way stream used to retrieve values in the reducer. Rather than load the entire list of values into memory at the same time, the datastream class works in conjunction with wmr::internal::key_value_stream
to defer loading until a value is needed. Since values are fed through standard input to the program running as the reducer, there is no way to know in advance how many values a particular key has. Therefore, by associating a datastream
with a key, it is possible to check for the end of the list by receiving a key-value pair that does not have the same key. Due to the shuffling stage of map-reduce guaranteed by Hadoop, it is known that all key-value pairs will arrive to the reducer in-order, which guarantees the behavior of datastream
.
Note
datastream
objects are provided by the reducer and should not be created by hand.
datastream(void)
The default constructor (which should never be used) automatically sets the end of key member, making the stream incapable of advancing. Therefore, any calls to operator>>
or get
will provide empty strings.
datastream(std::string)
This constructor associates the datastream with a key, and allows it to advance. This is the constructor which should generally be used to create datastream
objects.
datastream(const datastream &)
The copy-constructor creates an exact copy of an existing datastream
object, and should generally be used to make copies of objects passed to the reducer.
~datastream(void)
The destructor is empty, as the datastream
class has no resources to manage.
datastream & operator>> <T>(T &)
Calls: datastream & get<T>(T &)
datastream & operator>>(std::stringstream &)
Calls: datastream & get(std::stringstream &)
datastream & get<T>(T &)
Sets the first argument to be the current value, then advances the stream (gets the next value in the stream).
datastream & get(std::stringstream &)
Adds the current value to the first argument, then advances the stream (gets the next value in the stream).
void advance(void)
The advance
method is the workhorse of the datastream class. If the m_endOfKey
member is set, it returns immediately, preventing the current key from being overrun. If not, it grabs the next line from standard input, and splits it into a key and a value (see: wmr::internal::key_value_stream
). It then sets the value to be whatever the value of the new line is. If the current key is expended (a new key is reached), then m_endOfKey
is set to prevent the stream from advancing any further.