I am trying to read in files for text processing, and the idea is to run them through Hadoop pseudo distributed file system on my virtual machine, using map-reduce code I am writing. The interface is Ubuntu Linux, I am running Python 2.6 with the installation. I need to use sys.stdin for reading in the files, and sys.stdout so I pass from mapper to reducer. So here is my test code for the mapper:
#!/usr/bin/env pythonimport sys
import string
import glob
import osfiles = glob.glob(sys.stdin)
for file in files:
with open(file) as infile:
txt = infile.read()
txt = txt.split()
print(txt)
I’m not sure how glob works with sys.stdin, but this is not working. I get the following errors:
After testing with piping:
[training@localhost data]$ cat test | ./mapper.py
I get this:
cat: test: Is a directory
Traceback (most recent call last):
File “./mapper.py”, line 8, in <module>
files = glob.glob(sys.stdin)
File “/usr/lib64/python2.6/glob.py”, line 16, in glob
return list(iglob(pathname))
File “/usr/lib64/python2.6/glob.py”, line 24, in iglob
if not has_magic(pathname):
File “/usr/lib64/python2.6/glob.py”, line 78, in has_magic
return magic_check.search(s) is not None
TypeError: expected string or buffer
For the moment I am just trying to read in three small .txt files in one directory.
Thanks!
#python #bash #hadoop