Vlad’s thoughts

my thoughts on current state of things

Hadoop tutorial for Windows and Eclipse.

Just posted a tutorial on how to configure hadoop environment for Windows using CYGWIN.    The tutorial explains how to set-up a hadoop cluster in the pseudo distributed mode and how to get it working with the Eclipse.

If you have any questions / comments / suggestions about this tutorial post them here.

The tutorial is located here.

59 comments

59 Comments so far

  1. Ben March 29th, 2009 10:19 am

    Thanks for your excellent tutorial! I followed it this weekend and was able to get mostly up and running.

    One question I had was how to use it with EC2 — I set up on EC2 rather than on localhost, and I’m wondering what I need to do in order to make it run… getting weird unknown host errors when I run, despite having set up a proxy server.

    Thanks for the very helpful tutorial!
    Ben

  2. vlad March 29th, 2009 11:14 am

    No problem.

    Setting hadoop right on EC2 could be tricky. I am going to post another tutorial about it in a few weeks.

  3. Rez March 31st, 2009 5:17 pm

    Hey, this page on your tutorial (Unpacking Hadoop)

    http://v-lad.org/Tutorials/Hadoop/09%20-%20unpack%20hadoop.html

    is not working.

  4. vlad April 9th, 2009 8:30 am

    Strange. Works for me, can’t see what the problem is. Does anybody else have this problem?

  5. Jeff April 9th, 2009 2:32 pm

    Thanks for the tutorial… it would have saved me a few hours of frustration.

    Have you tried it with other versions of Eclipse. The main distribution is 3.4 (Ganymede), which will shortly be 3.5 in May.

  6. vlad April 9th, 2009 10:09 pm

    Jeff,

    I tried with the other version of eclipse and it doesn’t work with 3.4 and probably won’t work with 3.5 until somebody fixes the hadoop plugin, because plug-in API has been changed for new versions of eclipse. You can use the plug-in with 3.4 to browse for the HDFS, but you won’t be able to start the project.

  7. Joseph April 15th, 2009 12:09 am

    Vlad,

    thanks for the well documented tutorial. it is good work..

    Towards the last step i got following error
    09/04/15 15:00:33 INFO mapred.JobClient: Task Id : attempt_200904151224_0004_m_000000_2, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    kindly advice for some clue..

    my code is as follows:
    // TODO: specify input and output DIRECTORIES (not files)
    //conf.setInputPath(new Path(“src”));
    //conf.setOutputPath(new Path(“out”));

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(“In”));
    FileOutputFormat.setOutputPath(conf, new Path(“Out3″));

    thanks and regards
    Joseph

  8. vlad April 15th, 2009 12:00 pm

    The error you getting is actually correct. The Mappers / Reducers generated by the plug-in need some tweaking. I will post another tutorial regarding sometime in May.

  9. ash April 16th, 2009 11:40 pm

    Hi Vlad,

    thanks for the excellent turorial.. in the last step when i try to run the TestDriver class i get this error.

    Pls help…

    >>>>>>>>> START >>>>>>>

    09/04/17 11:58:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    09/04/17 11:58:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    09/04/17 11:58:40 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 4
    09/04/17 11:58:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    09/04/17 11:58:40 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 3
    09/04/17 11:58:41 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    09/04/17 11:58:41 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 2
    09/04/17 11:58:42 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    09/04/17 11:58:42 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar retries left 1
    09/04/17 11:58:46 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    09/04/17 11:58:46 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
    09/04/17 11:58:46 WARN hdfs.DFSClient: Could not get block locations. Source file “/tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar” – Aborting…
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-ashwath_kannan/mapred/system/job_200904171117_0002/job.jar could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1280)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

    at org.apache.hadoop.ipc.Client.call(Client.java:697)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2814)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2696)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

    >>>>>> END >>>>>

  10. vlad April 17th, 2009 7:24 am

    I seen this error before. Usually it is caused by having not enough space on your workstation. Try to clean up some space and recreate HDFS. Also check for error messages in DataNode and NameNode windows.

  11. tony April 19th, 2009 12:18 am

    hi, i followed what you showed and also that quick start on the official apache website.
    There is a big problem when i execute the command :
    “bin/hadoop namenode -format”
    it shows that the bin/hadoop ,the “hadoop” script, contains certain errors.
    While i installed it on linux in VM, that was ok.
    how can i run this hadoop script in cygwin correctly?
    thanks

  12. vlad April 20th, 2009 2:59 pm

    Could you post the error message that you are getting?

  13. Wen-Han April 22nd, 2009 1:26 pm

    Hi VLAD,

    May I know how recent is your tutorial? Is it updated to the most recent versions of hadoop and eclipse?

    Thank you,

    Wen-Han

  14. vlad April 22nd, 2009 4:16 pm

    The tutorial was written in April using the most recent version of the hadoop 0.19.1. As for eclipse the newest version of the Eclipse ( Ganymede ) is not compatible with the Hadoop plug-in that is supplied with version 0.19.1, so you have to use previous version of the eclipse ( Europa ).

    I saw that the new version of the Hadoop 0.20 came out, so I will take a look at what have changed and update the tutorial if needed.

  15. Saurabh April 23rd, 2009 5:59 am

    Hi vlad tutorial is good
    I am setting it on my Mandriva Machine &whenever i run

    ssh localhost
    I get::

    [abc@localhost .ssh]$ ssh localhost
    ssh: connect to host localhost port 22: Connection refused

    Please Help me

  16. vlad April 23rd, 2009 6:23 am

    Hmm,

    This tutorial is done for windows machines. To resolve your problem check that you have sshd installed and running. Also check that you don’t have firewall blocking port 22.

  17. Sid April 25th, 2009 12:38 pm

    Hi I am working on the hadoop eclipse in Linux everything was working fine when one day hadoop started to ignore any code changes I did in my project. Instead it just ran an old copy of the code from somewhere. Looking at the mapred.local folder where the temporary source files are jared together to run the job the source code was indeed changed… i created another dummy project in eclipse and ran it and it ran just fine, changes were reflected every time… What could be the problem?

  18. vlad April 25th, 2009 6:20 pm

    Sorry man never seen that happen. Maybe somebody else on this board will comment.

  19. Joe May 1st, 2009 4:44 am

    Vlad,
    Thank you so much for this tutorial. I am having a problem when running : bin/hadoop namenode –format

    First it said “JAVA_HOME not set”, so I set my windows environment variable to the correct path, which is c:\program files\Java\jdk1.6.0_06

    Then I closed and re-opened cygwin, and tried again. This time it appeard to work, but the first line of the output was “bin/hadoop: line 234: C:\Program: command not found”. The rest of the output looked like your screenshot. Is this normal?

    Thanks,
    Joe

  20. Wen-Han May 1st, 2009 11:12 am

    Hi vlad,

    thanks for your reply for last one. I configure Eclipse Europa according to Yahoo tutorial on hadoop:
    http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html

    and in the instruction it goes about creating new DFS Location:
    “…..Next, click on the “Advanced” tab. There are two settings here which must be changed.

    Scroll down to hadoop.job.ugi. It contains your current Windows login credentials. Highlight the first comma-separated value in this list (your username) and replace it with hadoop-user.”

    I can’t find this attribute(hadoop.job.ugi) in the advance list from “Define Hadoop location” on Eclipse. Do you have an idea?

    Thank you, fast reply will be much appreciated.

    Wen-Han

  21. Wen-Han May 1st, 2009 11:15 am

    PS., The yahoo tutorial on Hadoop have the hadoop installed on VM ware, not in localhost by cygwin.

    Thanks,

  22. sneha May 2nd, 2009 8:52 am

    hello!!

    thank u 4 d good hadoop tutorial… i am setting up a hadoop cluster of 4 systems…when i run bin/start-dfs.sh command i get an error as error:JAVA_HOME NOT set .. can u plz let me know d solution n also can u let me know how to set java home path in .bash_profile in cygwin promt
    thank you!!!!!!!!!1

  23. Muhammad Mudassar May 5th, 2009 11:36 pm

    Hi
    Tutorial is helping one. I want to know about that how to upload some images or some structured data on HDFS by using cygwin, eclipse, in windows.
    One more thing that after restart of my pc while working with hadoop it was not working well but then I restarted the CYGWIN sshd service it started again well. I want to know that after every time restarting the pc the service also has to be restarted?

    Thanks.

  24. vlad May 8th, 2009 7:48 am

    First you have to ask yourself a question, what are you planning to do with your data. Depending on the answer you could use the hdfs cp command or use HBase.

    Note that if you are planning to use binary data you might have to write your own record readers.

  25. vlad May 8th, 2009 7:51 am

    As for your second comment. Make sure that in the Services window your sshd service is set to start automatically.

  26. vlad May 8th, 2009 8:01 am

    bin/start_dfs.sh script won’t work in the environment described in this tutorial, to start DFS services refer to section 10 of the tutorial. On the additional machines you have to start only data node and task tracker processes.

    Remember that on the worker machines you have to edit the hadoop-site file to configure the name of your namenode machine instead of localhost. Also make sure all necessary firewall ports are open.

  27. vlad May 8th, 2009 8:02 am

    That’s right. But this way you will incur the penalties of running another operating system, and it is tricky to debug processes in vmware.

  28. vlad May 8th, 2009 8:04 am

    Not sure, what could be causing this. Check the dates on the files.

  29. vlad May 8th, 2009 8:05 am

    It’s the problem with the scripts. Try setting up your JDK in the directory that doesn’t have a space. I use C:\Java\JDK1.6 for that.

  30. Kim May 13th, 2009 2:55 pm

    This tutorial is great. Hadoop is running perfectly in VM (windows xp).
    Just one question.
    Is there any way that I can use “start-all.sh”, instead of initiating “hadoop namenode”, “hadoop jobtracker”, …. in multiple cygwin windows?

    Thank you again, for your all efforts.

  31. vlad May 13th, 2009 9:23 pm

    Not in Windows XP. The hadoop start scripts are written for Linux machines and for debugging purposes it is just easier to run each of the hadoop components in its own window.

  32. Mayank May 21st, 2009 4:35 am

    Hi vlad, the tutorial is great.
    Currently I am facing problem in upload data step, in my eclipse i get localhost->2->error I am unable to see the user and “In” folder and so on…please suggest me what to do now..

  33. Charitha May 28th, 2009 2:12 am

    error in eclipse europa while running a TestDriver.java….

    please advise me. help will be appriciated..

    09/05/28 14:40:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9100/user/charitha/Out already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
    at TestDriver.main(TestDriver.java:41)

    Regards,
    Charitha Reddy.

  34. vlad May 28th, 2009 8:27 am

    Looks like it is a second time you are trying to run the project. Every time you run the project it creates “Out” directory to store the output. You have to delete that directory before you run your project or change the code to create a new directory every time you run. Look at the hadoop examples to see how to do the later.

  35. vlad May 28th, 2009 8:36 am

    Do you see any activity in the cygwin windows when you are trying to connect. Could be the firewall blocking incoming ports.
    Use the following command from the command window and let me know what do you get, note that you have to have hadoop started.

    telnet localhost 9100

  36. Joseph May 28th, 2009 8:59 pm

    Vlad

    would like to know whether you have some update on the following
    >>snip>>
    The error you getting is actually correct. The Mappers / Reducers generated by the plug-in need some tweaking. I will post another tutorial regarding sometime in May.
    vlad – April 15th, 2009 at 12:00 pm
    >>end of snip>>

  37. vlad May 28th, 2009 9:22 pm

    Sorry, been really busy lately.

  38. Martinus June 6th, 2009 7:48 am

    Hello Vlad,

    Thanks for the Tutorial. I still have Problem with compiling the TestDriver class. After I compile the class, I got Error message from Eclipse:

    09/06/06 16:44:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    09/06/06 16:44:03 INFO mapred.FileInputFormat: Total input paths to process : 4
    09/06/06 16:44:04 INFO mapred.JobClient: Running job: job_200906061639_0001
    09/06/06 16:44:05 INFO mapred.JobClient: map 0% reduce 0%
    09/06/06 16:44:14 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_0, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    09/06/06 16:44:18 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_1, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    09/06/06 16:44:22 INFO mapred.JobClient: Task Id : attempt_200906061639_0001_m_000000_2, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at TestDriver.main(TestDriver.java:40)

    I have no idea, I use all Programs, you wrote in tutorial (eclipse 3.3.2, hadoop 1.9.1, etc).

    Thanks

    Martinus

  39. Carspar June 9th, 2009 12:49 am

    Hi vlad, the tutorial is great.

    I followed your tutorial and met a probelm in step:11 – Setup Hadoop Location in Eclipse.

    At the step 6, In the Project Explorer tab on the left hand side of the Eclipse window, find the DFS Locations item. Open it using the “+” icon on its left. Inside, you should see the localhost location reference with the blue elephant icon. Keep opening the items below it until you see something like the image below.

    I used the “+” icon on the left. Inside, it is a folder with empty name like your image. When I keep opening, the following folder is not a “tmp(1)”, but a “Error: null”.

    thanks,

  40. Carspar June 9th, 2009 1:41 am

    I solved the problem. It is because I did not set the environment variable of cygwin rightly.

    Thanks,

  41. kerenann July 22nd, 2009 12:37 am

    Hello,vlad,your tutorial is very helpful.
    Only one problem in step:11-Setup Hadoop Location in Eclipse.
    At the step 6, in the project explorer tab on the left side of the eclipse window, i have found the DFS location. clink the “+” icon. There has a folder named (1). When i keep opening, the following folder is not “tmp(1)”, but a “Error:call to localhost/127.0.0.1:9000 failed on connection exception:java.net.ConnectException: Connection refused: no further information”.
    I think my environment variable of cygwin is right.
    so, I don’t know what’s wrong with it?
    thanks

  42. Wylie van den Akker July 27th, 2009 11:07 am

    Just thought I would mention for hadoop-0.20.0+ under cygwin you also need to install rsynch (under the “NET” section) for filesystem replication to work. Additionally the xml configuration is split up into 3 different files. Details on that can be found here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html

    Cheers,
    Wylie
    Collective Medical Technologies
    http://www.collectivemedicaltech.com

  43. vlad August 5th, 2009 6:05 am

    Check if your cluster is running. [ No error messages in the command windows]. Also check if you have firewall installed that might be preventing the connections.

  44. Arun Jamwal August 7th, 2009 4:55 pm

    To get rid of
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    Change the following lines in TestDriver.java as

    //conf.setOutputKeyClass(Text.class);
    //conf.setOutputValueClass(IntWritable.class);
    conf.setOutputKeyClass(LongWritable.class);
    conf.setOutputValueClass(Text.class);

    HTH,
    Arun Jamwal

  45. richilee August 22nd, 2009 1:08 pm

    for those who have the “bin/hadoop: line 234: C:\Program: command not found” problem. This is caused by the the whitespace between “Program Files”. In other words. if your JAVA_HOME is “c:\Program Files\java”, there is a whitespace between “Program and Files”. So one way to solve the problem is put your jdk in a different folder. I put my jdk in c:\java\jdk . then everything works pretty well. hope it helps.

  46. Charanjeet September 16th, 2009 3:49 am

    Hi All,

    I was using the article for installing the hadoop.

    While running the command
    $ bin/hadoop namenode -format

    I found that there are errors because the installed JDK was in ‘C:\Prpgram Files’ and the command was reffering it through environment veriable JAVA_HOME since there is space in ‘Program’ and ‘Files’ it was dying.

    I resolved it by creating a cymbolic link as

    $ln -s /cygdrive/c/Program Files/java/jdk1.6.0_02 /java

    inside ‘/’ folder through cygwin and made an entry in <>/conf/hadoop-env.sh like

    ‘export JAVA_HOME=/java’

    Regards
    Charanjeet singh
    Senior Engineer
    Impetus infotech India Pvt. Ltd.

  47. Ken Church September 20th, 2009 1:46 pm

    Extremely useful. I’m thinking of pointing a bunch of students at this. One detail: the tutorial has some stale links to hadoop-0.19.1 (as well as a number of references to that elsewhere in the text). It would be good to write the tutorial in such a way that the text doesn’t need to be updated with each new version.

  48. Deng Wanyu September 30th, 2009 1:17 am

    Hi:
    it is very helpful for me!
    my problem is:
    I upload the txt file by command, but I find the uploaded file is empty. why?

  49. Azuryy October 14th, 2009 6:50 am

    If I don’t open five seperate Cygwin windows, instead, I run start-all.sh, I got: Could not obtail block error.

    but I open five seperated Cygwin windows as said in the tuorial, it does work.

  50. Azuryy October 14th, 2009 7:06 pm

    My Found:

    If you want to run start-all.sh, instead open five seperated Cygwin windows as this toturial said, please do
    hadoop fs -put before you run start-all.sh, if not, you will get “Could not obtail block” error when you run your job.

  51. sam October 22nd, 2009 4:08 pm

    i get this error when i open the mapreduce perspective in eclipse and i dont see the file after localhost->1 in dsf locations the below errors was in the namenode window

    lVersion(org.apache.hadoop.dfs.ClientProtocol, 35) from 127.0.0.1:3282: error: j
    ava.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.ClientP
    rotocol
    java.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.Client
    Protocol
    at org.apache.hadoop.hdfs.server.namenode.NameNode.getProtocolVersion(Na
    meNode.java:98)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
    java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
    09/10/22 15:58:08 INFO ipc.Server: IPC Server handler 4 on 9100, call getProtoco
    lVersion(org.apache.hadoop.dfs.ClientProtocol, 35) from 127.0.0.1:3282: error: j
    ava.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.ClientP
    rotocol
    java.io.IOException: Unknown protocol to name node: org.apache.hadoop.dfs.Client
    Protocol
    at org.apache.hadoop.hdfs.server.namenode.NameNode.getProtocolVersion(Na
    meNode.java:98)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
    java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
    sorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

  52. Ravi October 25th, 2009 10:44 am

    Hi there, your tutorial is excellent. Very good job and I dont say that often.

    So I was trying to setup hbase using your hadoop tutorial. I was able to follow up to step 12 but when I try to execute

    $bin/hbase namenode -format
    : No such file or directory
    bin/hbase: line 45: $’\r’: command not found

    Can you tell me what am I missing?
    Thanks

  53. Ravi October 25th, 2009 12:10 pm

    well after a few internet searches and 1 hour later, I am able to execute it, but now I get this error:
    $ bin/hbase namenode -format
    Exception in thread “main” java.lang.NoClassDefFoundError: namenode

  54. Sharad October 29th, 2009 4:14 am

    Is there an elegant way to stop dfs? Stopping using Ctrl-C seems to corrupt it and bin/hadoop/stop-dfs.sh don’t seem to work (some error message like localhost: cat: cannot open file /dev/fs/C/tmp/hadoop-sk-secondarynamenode.pid : No such file or directory)

    Thanks!

  55. steve November 2nd, 2009 12:01 pm

    Great tutorial!
    I’ve almost got this working, but I’m having trouble connecting to localhost with ssh.
    If I do:
    ssh localhost -v
    the last two lines are:
    Offering public key: /home/user.name/.ssh/id_rsa
    Connection closed by xxx.x.x.x

    Any ideas what is going on?
    I also had to manually add ssh_server to administrators and change the password in order to get the sshd service to run.

    -Steve

  56. RezaMor November 9th, 2009 8:05 pm

    Thanks for your excellent tutorial! However, in the last
    step I got the following error, and I mentioned that two others wrote the same Error as comment for you.
    Would you please answer Me.

    09/11/10 12:53:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    09/11/10 12:53:01 INFO mapred.FileInputFormat: Total input paths to process : 4
    09/11/10 12:53:02 INFO mapred.JobClient: Running job: job_200911101209_0003
    09/11/10 12:53:03 INFO mapred.JobClient: map 0% reduce 0%
    09/11/10 12:53:13 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_0, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    09/11/10 12:53:17 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_1, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    09/11/10 12:53:22 INFO mapred.JobClient: Task Id : attempt_200911101209_0003_m_000000_2, Status : FAILED
    java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:558)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

    java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at TestDriver.main(TestDriver.java:41)

  57. Rill November 24th, 2009 7:58 pm

    I got a problem in eclipse plugin.

    —————————————————-
    Cannot connect to the Map/Reduce location:localhost.
    Failed to get the current user’s information.
    —————————————————-

    user of my windows need password to login.

    Please help me~, thank you!

  58. Jason Venner January 1st, 2010 10:18 am

    The prohadoop website has a lot of information on Hadoop and Hadoop setup as well as a good community of people to ask and answer questions with.

    This particular error java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

    is because the input format for your job is TextInputFormat, rather than KeyValueTextInputFormat
    TextInputFormat provides a LongWritable as a key, which is the input line number, and a Text as the value, which is the input line data.

    KeyValueTextInputFormat provides a Text key, that portion of the input line up to the first TAB character, and a Text value that portion of the input line after the first TAB character.

    Alternatively you can modify the definition of your Map class to accept a LongWritable as the input key type.

  59. Swetha January 4th, 2010 1:17 am

    hello!
    When I run the code I get the below error. I understand there is some change in the path where the job cache files are created; but I don’t know how to change it. Any clue??
    Thanks in advance.

    INFO mapred.JobClient: Task Id : attempt_201001041128_0006_m_000006_1, Status : FAILED
    java.io.FileNotFoundException: File C:/tmp/hadoop-MBS/mapred/local/taskTracker/jobcache/job_201001041128_0006/attempt_201001041128_0006_m_000006_1/work/tmp does not exist.
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
    at org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

Leave a reply