Tuesday, October 23, 2012

Hadoop Quest 1 : ERROR : IBM BIGINSIGHTS on Cloud not accessible from web links on control panel

Using commands

1. I could not get to links of web pages of NameNode status, JobTrackerStatus etc from the web page of IBM instance of Master node . But only one of the linke "BigInsights web console" is taking me to the next web page. What should I do in this regard?

Diagnosis :
I started with Assumption : From the nature of the problem I think if I restart the instance I will be good to go. But first I needed to make sure which part of Hadoop is not working -> NameNode or Jobtracker or Tasktracker? -> All are up and running.

Following is a snapshot of IBM's Biginsight's console of showing that the NameNode is also acting as a secondary NameNode and  Jobtracker



So I started with the UserGuide : http://hadoop.apache.org/docs/stable/hdfs_user_guide.html#Shell+Commands

Info is like :

Shell Commands

Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see File System Shell Guide.

DFSAdmin Command

The bin/hadoop dfsadmin command supports a few HDFS administration related operations. The bin/hadoop dfsadmin -help command lists all the commands currently supported. For e.g.:
  • -report : reports basic statistics of HDFS. Some of this information is also available on the NameNode front page.
  • -safemode : though usually not required, an administrator can manually enter or leave Safemode.
  • -finalizeUpgrade : removes previous backup of the cluster made during last upgrade.
  • -refreshNodes : Updates the set of hosts allowed to connect to namenode. Re-reads the config file to update values defined by dfs.hosts and dfs.host.exclude and reads the entires (hostnames) in those files. Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry defined in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady been marked for decommission. Entires not present in both the lists are decommissioned.
  • -printTopology : Print the topology of the cluster. Display a tree of racks and datanodes attached to the tracks as viewed by the NameNode.
So I used the $hadoop -report to find the report on the HDFS name and data nodes. It showed me a complete detail of all the Hadoop nodes (active + dead) -> this way I was able to diagnose one of the node "Data Node 3" having "I.P. like in xxx.xxx.xxx.37" was dead . I removed it and made another node on its place . "Pretty cool stuff right "

Now I stopped hadoop by stop-all.sh and started all the nodes by start-all.sh but found out that hive is not started , so

  1. I started one more name node 
  2. Installed winscp
  3. Configured winscp for ssh between my laptop and newly created name node.
  4. Copied the /mnt/biginsights/opt/ibm/bidinsights/hive folder to local directory of my laptop
  5. Started one more winscp for ssh between my laptop and old name node.
  6. copied the hive files into the same location as it was earlier
  7. now go to hive directory
  8. go to /conf
  9. vi hive-site.xml
  10. in the site name -> xxx.xxx.xxx.xxx rename to ur namenode xxx.xxx.xxx.xxx 
  11. Start hadoop by start-all.sh
  12. It starts successfully.


But still the sites from the link are not opening up.

SO NOW I looked into the log files : By default the $BIGINSIGHTS_VAR has your biginsights folder so did following

  1. cd $BIGINSIGHTS_VAR/console/log
  2. vi console-wasce.log 
  3. Found following error : 
2012-10-22 15:50:48,218 ERROR [[JobServlet]] Servlet.service() for servlet JobServlet threw exception
java.lang.NullPointerException
        at com.ibm.xap.console.job.JobUtil.jobOperationExceptionHandler(Unknown Source)
        at com.ibm.xap.console.job.JobUtil.createJob(Unknown Source)
        at com.ibm.xap.console.job.JobUtil.handleJobCmd(Unknown Source)
        at com.ibm.xap.console.job.JobOperationHandler.handleJobCmd(Unknown Source)
        at com.ibm.xap.console.job.JobOperationHandler.handleCmd(Unknown Source)
        at com.ibm.xap.console.servlet.JobServlet.doGet(Unknown Source)
        at com.ibm.xap.console.servlet.JobServlet.doPost(Unknown Source)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:713)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:806)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.geronimo.tomcat.valve.DefaultSubjectValve.invoke(DefaultSubjectValve.java:56)
        at org.apache.geronimo.tomcat.GeronimoStandardContext$SystemMethodValve.invoke(GeronimoStandardContext.java:406)
        at org.apache.geronimo.tomcat.valve.GeronimoBeforeAfterValve.invoke(GeronimoBeforeAfterValve.java:47)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555)
        at org.apache.geronimo.tomcat.valve.ThreadCleanerValve.invoke(ThreadCleanerValve.java:40)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:736)


This is the root cause of all problems. Lets figure out whats happening in the connectionhandler.process...

I searched a lot for the jobUtil file but could not find it. So Ultimately I looked into two of the folders under /mnt/BI/opt/ibm/BI/ -> home directory for installation of hadoop BI is biginsights
  1. hadoop-conf -> all file pertaining to hadoop -> hadoop.sh hadoop-conf.xml etc
  2. conf -> biginsights-conf.sh -> all environment variable like $BIGINSIGHTS_VAR AND $ BIGINSIGHTS_HOME
Finally, I looked into the console folder under BIGINSIGHTS_HOME, which has a folder wascs -> contains information about the BIconsole.WAR file -> I think this might have got corrupt or something like that. 

Also I did tests and ran wordcount 2-3 times so no problem => hadoop is fine its webconsole war is corrupt. DONT KNOW WHAT TO DO NEXT....

UPDATE ON 10/23/2012

SO I finally resolved the problem. Apparently I found the following exception in $BIGINSIGHTS_VAR or /mnt/bI/var/ibm/BI/hadoop/logs/hadoop-<username>-namenode-<hostname>.log

2012-10-23 13:39:48,297 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9000, call delete(/hadoop/mapred/system/job_201210221816_0013, true) from 170.224.161.37:36870: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /hadoop/mapred/system/job_201210221816_0013. Name node is in safe mode.
The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 23 seconds.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /hadoop/mapred/system/job_201210221816_0013. Name node is in safe mode.
The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 23 seconds.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(AccessController.java:284)
at javax.security.auth.Subject.doAs(Subject.java:573)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

The solution is simple : This error occurs where some blocks were never reported in.So, I had to forcefully let the namenode leave safemode (hadoop dfsadmin -safemode leave) and then (optionally) run an fsck to delete missing files.

I only run the leave safenode command and than clicked on the console from the BIGINSIGHTS web console. Everything is working now.








1 comment:

  1. This is a recurring error. Thanks for posting the solution here.

    ReplyDelete