Hadoop tutorial for Windows and Eclipse.
Just posted a tutorial on how to configure hadoop environment for Windows using CYGWIN. The tutorial explains how to set-up a hadoop cluster in the pseudo distributed mode and how to get it working with the Eclipse.
If you have any questions / comments / suggestions about this tutorial post them here.
59 comments59 Comments so far
Leave a reply
Thanks for your excellent tutorial! I followed it this weekend and was able to get mostly up and running.
One question I had was how to use it with EC2 — I set up on EC2 rather than on localhost, and I’m wondering what I need to do in order to make it run… getting weird unknown host errors when I run, despite having set up a proxy server.
Thanks for the very helpful tutorial!
Ben
No problem.
Setting hadoop right on EC2 could be tricky. I am going to post another tutorial about it in a few weeks.
Hey, this page on your tutorial (Unpacking Hadoop)
http://v-lad.org/Tutorials/Hadoop/09%20-%20unpack%20hadoop.html
is not working.
Strange. Works for me, can’t see what the problem is. Does anybody else have this problem?
Thanks for the tutorial… it would have saved me a few hours of frustration.
Have you tried it with other versions of Eclipse. The main distribution is 3.4 (Ganymede), which will shortly be 3.5 in May.
Jeff,
I tried with the other version of eclipse and it doesn’t work with 3.4 and probably won’t work with 3.5 until somebody fixes the hadoop plugin, because plug-in API has been changed for new versions of eclipse. You can use the plug-in with 3.4 to browse for the HDFS, but you won’t be able to start the project.
Vlad,
thanks for the well documented tutorial. it is good work..
Towards the last step i got following error
09/04/15 15:00:33 INFO mapred.JobClient: Task Id : attempt_200904151224_0004_m_000000_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
kindly advice for some clue..
my code is as follows:
// TODO: specify input and output DIRECTORIES (not files)
//conf.setInputPath(new Path(“src”));
//conf.setOutputPath(new Path(“out”));
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(“In”));
FileOutputFormat.setOutputPath(conf, new Path(“Out3″));
thanks and regards
Joseph
The error you getting is actually correct. The Mappers / Reducers generated by the plug-in need some tweaking. I will post another tutorial regarding sometime in May.
Hi Vlad,
thanks for the excellent turorial.. in the last step when i try to run the TestDriver class i get this error.
Pls help…
>>>>>>>>> START >>>>>>>
09/04/17 11:58:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
09/04/17 11:58:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
09/04/17 11:58:40 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 4
09/04/17 11:58:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
09/04/17 11:58:40 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 3
09/04/17 11:58:41 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
09/04/17 11:58:41 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 2
09/04/17 11:58:42 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
09/04/17 11:58:42 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 1
09/04/17 11:58:46 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
09/04/17 11:58:46 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
09/04/17 11:58:46 WARN hdfs.DFSClient: Could not get block locations. Source file “/tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar” – Aborting…
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>>>>>> END >>>>>
I seen this error before. Usually it is caused by having not enough space on your workstation. Try to clean up some space and recreate HDFS. Also check for error messages in DataNode and NameNode windows.
hi, i followed what you showed and also that quick start on the official apache website.
There is a big problem when i execute the command :
“bin/hadoop namenode -format”
it shows that the bin/hadoop ,the “hadoop” script, contains certain errors.
While i installed it on linux in VM, that was ok.
how can i run this hadoop script in cygwin correctly?
thanks
Could you post the error message that you are getting?
Hi VLAD,
May I know how recent is your tutorial? Is it updated to the most recent versions of hadoop and eclipse?
Thank you,
Wen-Han
The tutorial was written in April using the most recent version of the hadoop 0.19.1. As for eclipse the newest version of the Eclipse ( Ganymede ) is not compatible with the Hadoop plug-in that is supplied with version 0.19.1, so you have to use previous version of the eclipse ( Europa ).
I saw that the new version of the Hadoop 0.20 came out, so I will take a look at what have changed and update the tutorial if needed.
Hi vlad tutorial is good
I am setting it on my Mandriva Machine &whenever i run
ssh localhost
I get::
[abc@localhost .ssh]$ ssh localhost
ssh: connect to host localhost port 22: Connection refused
Please Help me
Hmm,
This tutorial is done for windows machines. To resolve your problem check that you have sshd installed and running. Also check that you don’t have firewall blocking port 22.
Hi I am working on the hadoop eclipse in Linux everything was working fine when one day hadoop started to ignore any code changes I did in my project. Instead it just ran an old copy of the code from somewhere. Looking at the mapred.local folder where the temporary source files are jared together to run the job the source code was indeed changed… i created another dummy project in eclipse and ran it and it ran just fine, changes were reflected every time… What could be the problem?
Sorry man never seen that happen. Maybe somebody else on this board will comment.
Vlad,
Thank you so much for this tutorial. I am having a problem when running : bin/hadoop namenode –format
First it said “JAVA_HOME not set”, so I set my windows environment variable to the correct path, which is c:\program files\Java\jdk1.6.0_06
Then I closed and re-opened cygwin, and tried again. This time it appeard to work, but the first line of the output was “bin/hadoop: line 234: C:\Program: command not found”. The rest of the output looked like your screenshot. Is this normal?
Thanks,
Joe
Hi vlad,
thanks for your reply for last one. I configure Eclipse Europa according to Yahoo tutorial on hadoop:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html
and in the instruction it goes about creating new DFS Location:
“…..Next, click on the “Advanced” tab. There are two settings here which must be changed.
Scroll down to hadoop.job.ugi. It contains your current Windows login credentials. Highlight the first comma-separated value in this list (your username) and replace it with hadoop-user.”
I can’t find this attribute(hadoop.job.ugi) in the advance list from “Define Hadoop location” on Eclipse. Do you have an idea?
Thank you, fast reply will be much appreciated.
Wen-Han
PS., The yahoo tutorial on Hadoop have the hadoop installed on VM ware, not in localhost by cygwin.
Thanks,
hello!!
thank u 4 d good hadoop tutorial… i am setting up a hadoop cluster of 4 systems…when i run bin/start-dfs.sh command i get an error as error:JAVA_HOME NOT set .. can u plz let me know d solution n also can u let me know how to set java home path in .bash_profile in cygwin promt
thank you!!!!!!!!!1
Hi
Tutorial is helping one. I want to know about that how to upload some images or some structured data on HDFS by using cygwin, eclipse, in windows.
One more thing that after restart of my pc while working with hadoop it was not working well but then I restarted the CYGWIN sshd service it started again well. I want to know that after every time restarting the pc the service also has to be restarted?
Thanks.
First you have to ask yourself a question, what are you planning to do with your data. Depending on the answer you could use the hdfs cp command or use HBase.
Note that if you are planning to use binary data you might have to write your own record readers.
As for your second comment. Make sure that in the Services window your sshd service is set to start automatically.
bin/start_dfs.sh script won’t work in the environment described in this tutorial, to start DFS services refer to section 10 of the tutorial. On the additional machines you have to start only data node and task tracker processes.
Remember that on the worker machines you have to edit the hadoop-site file to configure the name of your namenode machine instead of localhost. Also make sure all necessary firewall ports are open.
That’s right. But this way you will incur the penalties of running another operating system, and it is tricky to debug processes in vmware.
Not sure, what could be causing this. Check the dates on the files.
It’s the problem with the scripts. Try setting up your JDK in the directory that doesn’t have a space. I use C:\Java\JDK1.6 for that.
This tutorial is great. Hadoop is running perfectly in VM (windows xp).
Just one question.
Is there any way that I can use “start-all.sh”, instead of initiating “hadoop namenode”, “hadoop jobtracker”, …. in multiple cygwin windows?
Thank you again, for your all efforts.
Not in Windows XP. The hadoop start scripts are written for Linux machines and for debugging purposes it is just easier to run each of the hadoop components in its own window.
Hi vlad, the tutorial is great.
Currently I am facing problem in upload data step, in my eclipse i get localhost->2->error I am unable to see the user and “In” folder and so on…please suggest me what to do now..
error in eclipse europa while running a TestDriver.java….
please advise me. help will be appriciated..
09/05/28 14:40:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9100/user/charitha/Out already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at TestDriver.main(TestDriver.java:41)
Regards,
Charitha Reddy.
Looks like it is a second time you are trying to run the project. Every time you run the project it creates “Out” directory to store the output. You have to delete that directory before you run your project or change the code to create a new directory every time you run. Look at the hadoop examples to see how to do the later.
Do you see any activity in the cygwin windows when you are trying to connect. Could be the firewall blocking incoming ports.
Use the following command from the command window and let me know what do you get, note that you have to have hadoop started.
telnet localhost 9100
Vlad
would like to know whether you have some update on the following
>>snip>>
The error you getting is actually correct. The Mappers / Reducers generated by the plug-in need some tweaking. I will post another tutorial regarding sometime in May.
vlad – April 15th, 2009 at 12:00 pm
>>end of snip>>
Sorry, been really busy lately.
Hello Vlad,
Thanks for the Tutorial. I still have Problem with compiling the TestDriver class. After I compile the class, I got Error message from Eclipse:
09/06/06 16:44:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
09/06/06 16:44:03 INFO mapred.FileInputFormat: Total input paths to process : 4
09/06/06 16:44:04 INFO mapred.JobClient: Running job: job_200906061639_0001
09/06/06 16:44:05 INFO mapred.JobClient: map 0% reduce 0%
09/06/06 16:44:14 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
09/06/06 16:44:18 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_1, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
09/06/06 16:44:22 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at TestDriver.main(TestDriver.java:40)
I have no idea, I use all Programs, you wrote in tutorial (eclipse 3.3.2, hadoop 1.9.1, etc).
Thanks
Martinus
Hi vlad, the tutorial is great.
I followed your tutorial and met a probelm in step:11 – Setup Hadoop Location in Eclipse.
At the step 6, In the Project Explorer tab on the left hand side of the Eclipse window, find the DFS Locations item. Open it using the “+” icon on its left. Inside, you should see the localhost location reference with the blue elephant icon. Keep opening the items below it until you see something like the image below.
I used the “+” icon on the left. Inside, it is a folder with empty name like your image. When I keep opening, the following folder is not a “tmp(1)”, but a “Error: null”.
thanks,
I solved the problem. It is because I did not set the environment variable of cygwin rightly.
Thanks,
Hello,vlad,your tutorial is very helpful.
Only one problem in step:11-Setup Hadoop Location in Eclipse.
At the step 6, in the project explorer tab on the left side of the eclipse window, i have found the DFS location. clink the “+” icon. There has a folder named (1). When i keep opening, the following folder is not “tmp(1)”, but a “Error:call to localhost/127.0.0.1:9000 failed on connection exception:java.net.ConnectException: Connection refused: no further information”.
I think my environment variable of cygwin is right.
so, I don’t know what’s wrong with it?
thanks
Just thought I would mention for hadoop-0.20.0+ under cygwin you also need to install rsynch (under the “NET” section) for filesystem replication to work. Additionally the xml configuration is split up into 3 different files. Details on that can be found here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html
Cheers,
Wylie
Collective Medical Technologies
http://www.collectivemedicaltech.com
Check if your cluster is running. [ No error messages in the command windows]. Also check if you have firewall installed that might be preventing the connections.
To get rid of
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
Change the following lines in TestDriver.java as
//conf.setOutputKeyClass(Text.class);
//conf.setOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
HTH,
Arun Jamwal
for those who have the “bin/hadoop: line 234: C:\Program: command not found” problem. This is caused by the the whitespace between “Program Files”. In other words. if your JAVA_HOME is “c:\Program Files\java”, there is a whitespace between “Program and Files”. So one way to solve the problem is put your jdk in a different folder. I put my jdk in c:\java\jdk . then everything works pretty well. hope it helps.
Hi All,
I was using the article for installing the hadoop.
While running the command
$ bin/hadoop namenode -format
I found that there are errors because the installed JDK was in ‘C:\Prpgram Files’ and the command was reffering it through environment veriable JAVA_HOME since there is space in ‘Program’ and ‘Files’ it was dying.
I resolved it by creating a cymbolic link as
$ln -s /cygdrive/c/Program Files/java/jdk1.6.0_02 /java
inside ‘/’ folder through cygwin and made an entry in <>/conf/hadoop-env.sh like
‘export JAVA_HOME=/java’
Regards
Charanjeet singh
Senior Engineer
Impetus infotech India Pvt. Ltd.
Extremely useful. I’m thinking of pointing a bunch of students at this. One detail: the tutorial has some stale links to hadoop-0.19.1 (as well as a number of references to that elsewhere in the text). It would be good to write the tutorial in such a way that the text doesn’t need to be updated with each new version.
Hi:
it is very helpful for me!
my problem is:
I upload the txt file by command, but I find the uploaded file is empty. why?
If I don’t open five seperate Cygwin windows, instead, I run start-all.sh, I got: Could not obtail block error.
but I open five seperated Cygwin windows as said in the tuorial, it does work.
My Found:
If you want to run start-all.sh, instead open five seperated Cygwin windows as this toturial said, please do
hadoop fs -put before you run start-all.sh, if not, you will get “Could not obtail block” error when you run your job.
i get this error when i open the mapreduce perspective in eclipse and i dont see the file after localhost->1 in dsf locations the below errors was in the namenode window
lVersion(org.apache.hadoop.dfs.ClientProtocol, 35) from 127.0.0.1:3282: error: j
ava.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.ClientP
rotocol
java.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.Client
Protocol
at org.apache.hadoop.hdfs.server.namenode.NameNode.getProtocolVersion(Na
meNode.java:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
09/10/22 15:58:08 INFO ipc.Server: IPC Server handler 4 on 9100, call getProtoco
lVersion(org.apache.hadoop.dfs.ClientProtocol, 35) from 127.0.0.1:3282: error: j
ava.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.ClientP
rotocol
java.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.Client
Protocol
at org.apache.hadoop.hdfs.server.namenode.NameNode.getProtocolVersion(Na
meNode.java:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
Hi there, your tutorial is excellent. Very good job and I dont say that often.
So I was trying to setup hbase using your hadoop tutorial. I was able to follow up to step 12 but when I try to execute
$bin/hbase namenode -format
: No such file or directory
bin/hbase: line 45: $’\r’: command not found
Can you tell me what am I missing?
Thanks
well after a few internet searches and 1 hour later, I am able to execute it, but now I get this error:
$ bin/hbase namenode -format
Exception in thread “main” java.lang.NoClassDefFoundError: namenode
Is there an elegant way to stop dfs? Stopping using Ctrl-C seems to corrupt it and bin/hadoop/stop-dfs.sh don’t seem to work (some error message like localhost: cat: cannot open file /dev/fs/C/tmp/hadoop-sk-secondarynamenode.pid : No such file or directory)
Thanks!
Great tutorial!
I’ve almost got this working, but I’m having trouble connecting to localhost with ssh.
If I do:
ssh localhost -v
the last two lines are:
Offering public key: /home/user.name/.ssh/id_rsa
Connection closed by xxx.x.x.x
Any ideas what is going on?
I also had to manually add ssh_server to administrators and change the password in order to get the sshd service to run.
-Steve
Thanks for your excellent tutorial! However, in the last
step I got the following error, and I mentioned that two others wrote the same Error as comment for you.
Would you please answer Me.
09/11/10 12:53:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
09/11/10 12:53:01 INFO mapred.FileInputFormat: Total input paths to process : 4
09/11/10 12:53:02 INFO mapred.JobClient: Running job: job_200911101209_0003
09/11/10 12:53:03 INFO mapred.JobClient: map 0% reduce 0%
09/11/10 12:53:13 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
09/11/10 12:53:17 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_1, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
09/11/10 12:53:22 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at TestDriver.main(TestDriver.java:41)
I got a problem in eclipse plugin.
—————————————————-
Cannot connect to the Map/Reduce location:localhost.
Failed to get the current user’s information.
—————————————————-
user of my windows need password to login.
Please help me~, thank you!
The prohadoop website has a lot of information on Hadoop and Hadoop setup as well as a good community of people to ask and answer questions with.
This particular error java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
is because the input format for your job is TextInputFormat, rather than KeyValueTextInputFormat
TextInputFormat provides a LongWritable as a key, which is the input line number, and a Text as the value, which is the input line data.
KeyValueTextInputFormat provides a Text key, that portion of the input line up to the first TAB character, and a Text value that portion of the input line after the first TAB character.
Alternatively you can modify the definition of your Map class to accept a LongWritable as the input key type.
hello!
When I run the code I get the below error. I understand there is some change in the path where the job cache files are created; but I don’t know how to change it. Any clue??
Thanks in advance.
INFO mapred.JobClient: Task Id : attempt_201001041128_0006_m_000006_1, Status : FAILED
java.io.FileNotFoundException: File C:/tmp/hadoop-MBS/mapred/local/taskTracker/jobcache/job_201001041128_0006/attempt_201001041128_0006_m_000006_1/work/tmp does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519)
at org.apache.hadoop.mapred.Child.main(Child.java:155)