Tech Guide - Interview , Programming & Others: 2012

Saturday, November 10, 2012

GOF : Creational 1. Factory Method Design pattern

What is a factory?

A factory is an object for creating other objects.
It is an abstraction of a constructor, and
can be used to implement various allocation schemes. For example, using this definition, singletons implemented by the singleton pattern are formal factories.
A factory object typically has a method for every kind of object it is capable of creating. These methods optionally accept parameters defining how the object is created, and then return the created object.
Advance Use: The factory object might decide to create the object's class (if applicable) dynamically, return it from an object pool, do complex configuration on the object, or other things.
These kinds of objects have proven useful and several design patterns have been developed to implement them in many languages. For example, several "GoF patterns", like the "Factory method pattern", the "Builder" or even the "Singleton" are implementations of this concept. The "Abstract factory pattern" instead is a method to build collections of factories.
Factory objects are common in toolkits and frameworks where library code needs to create objects of types which may be subclassed by applications using the framework. They are also used in test-driven development to allow classes to be put under test.

Use of factory : Where and When

Factories determine the actual concrete type of object to be created,
it is here that the object is actually created.
As the factory only returns an abstract pointer, the client code does not know - and is not burdened by - the actual concrete type of the object which was just created. However, the type of a concrete object is known by the abstract factory. In particular, this means:

The client code has no knowledge whatsoever of the concrete type, so it does not need to include any header files or class declarations relating to the concrete type.
The client code deals only with the abstract type.
Objects of a concrete type are indeed created by the factory, but the client code accesses such objects only through their abstract interface.
Adding new concrete types is done by modifying the client code to use a different factory, a modification which is typically one line in one file. This is significantly easier than modifying the client code to instantiate a new type, which would require changing every location in the code where a new object is created.

Factory method Pattern

Factory methods are static methods that return an instance of the native class. Examples in the JDK include :

LogManager.getLogManager
Pattern.compile
Collections.unmodifiableCollection, Collections.synchronizeCollection , and so on
Calendar.getInstance
several factories are used in the javax.xml.parsers package. e.g. javax.xml.parsers.DocumentBuilderFactory or javax.xml.parsers.SAXParserFactory

Factory methods :

have names, unlike constructors, which can clarify code.
do not need to create a new object upon each invocation - objects can be cached and reused, if necessary.
can return a subtype of their return type - in particular, can return an object whose implementation class is unknown to the caller. This is a very valuable and widely used feature in many frameworks which use interfaces as the return type of static factory methods.
The creation of an object requires access to information or resources that should not be contained within the composing class
Common names for factory methods include getInstance and valueOf. These names are not mandatory - choose whatever makes sense for each case.
When factory methods are used for disambiguation like this, the constructor is often made private to force clients to use the factory methods.
Factory methods encapsulate the creation of objects. This can be useful, if the creation process is very complex; for example, if it depends on settings in configuration files or on user input.

public class Complex
    {
        public double real;
        public double imaginary;
 
        public static Complex FromCartesianFactory(double real, double imaginary ) 
        {
            return new Complex(real, imaginary);
        }
 
        public static Complex FromPolarFactory(double modulus , double angle ) 
        {
            return new Complex(modulus * Math.Cos(angle), modulus * Math.Sin(angle));
        }
 
 
        private Complex (double real, double imaginary)
        {
            this.real = real;
            this.imaginary = imaginary;
        }
    }
 
Complex product = Complex.FromPolarFactory(1,pi);

public class ComplexNumber {

  /**
  * Static factory method returns an object of this class.
  */
  public static ComplexNumber valueOf(float aReal, float aImaginary) {
    return new ComplexNumber(aReal, aImaginary);
  }

  /**
  * Caller cannot see this private constructor.
  *
  * The only way to build a ComplexNumber is by calling the static 
  * factory method.
  */
  private ComplexNumber (float aReal, float aImaginary) {
    fReal = aReal;
    fImaginary = aImaginary;
  }

  private float fReal;
  private float fImaginary;

  //..elided
}

Limitations

There are three limitations associated with the use of the factory method. The first relates to refactoring existing code; the other two relate to extending a class.

The first limitation is that refactoring an existing class to use factories breaks existing clients. For example, if class Complex were a standard class, it might have numerous clients with code like:

Complex c = new Complex(-1, 0);

Once we realize that two different factories are needed, we change the class (to the code shown earlier). But since the constructor is now private, the existing client code no longer compiles.

The second limitation is that, since the pattern relies on using a private constructor, the class cannot be extended. Any subclass must invoke the inherited constructor, but this cannot be done if that constructor is private.

The third limitation is that, if we do extend the class (e.g., by making the constructor protected—this is risky but feasible), the subclass must provide its own re-implementation of all factory methods with exactly the same signatures. For example, if class StrangeComplex extends Complex, then unless StrangeComplex provides its own version of all factory methods, the call
```
StrangeComplex.fromPolar(1, pi);
```
will yield an instance of Complex (the superclass) rather than the expected instance of the subclass. The reflection features of some languages can obviate this issue.

All three problems could be alleviated by altering the underlying programming language to make factories first-class class members (see also Virtual class).^[4]

Thursday, November 8, 2012

Quickies for Quick Results : Jar and JUnit

1.Updating a Jar File

The Jar tool provides a u option which you can use to update the contents of an existing JAR file by modifying its manifest or by adding files.

The basic command for adding files has this format:

jar uf jar-file input-file(s)

In this command:

The u option indicates that you want to update an existing JAR file.
The f option indicates that the JAR file to update is specified on the command line.
jar-file is the existing JAR file that's to be updated.
input-file(s) is a space-deliminated list of one or more files that you want to add to the Jar file.

Also, please remember the following before executing the command :

Any files already in the archive having the same pathname as a file being added will be overwritten.
When creating a new JAR file, you can optionally use the -C option to indicate a change of directory

If using windows command line put the "class file" ( which you want to replace in the jar file ), in the same directory hierarchy as that of the file in the jar file. Otherwise the file would be either added to some other location or will not appear at all.

Also, Use Zip or 7-zip software to look into the jar file. This will help you to

Obtain the fully qualified name of the class file you want to replace. ( using it you will create the folder structure (eg org/apache/hadoop/mapred/abc.class) that you want to create for the file).
After jar command execution you can recheck the timestamp of the file in the new jar file. If it is an old one you need to replace it with a new one.
For more info

2. Running tests using junit (Testcase) :

If you have your test case file in a jar file use the option from following depending on your JUnit's version.

test class name is the fully qualified name of your test class file.

For JUnit 4.X it's really:

java -cp /usr/share/java/junit.jar:{any other jar files/ your jar file where your test case resides} org.junit.runner.JUnitCore [test class name]

But if you are using JUnit 3.X please note the class name is different:

java -cp /usr/share/java/junit.jar:{any other jar files/ your jar file where your test case resides} junit.textui.TestRunner [test class name]

Hadoop Releases, Projects and features in a nutshell

Hadoop Features in HDFS and MR across releases

News on releases : Oct 2012 :

9 October, 2012: Release 2.0.2-alpha available

This is the second (alpha) version in the hadoop-2.x series.

This delivers significant enhancements to HDFS HA. Also it has a significantly more stable version of YARN which, at the time of release, has already been deployed on a 2000 node cluster.

Please see the Hadoop 2.0.2-alpha Release Notes for details.

Latest Hadoop : http://hadoop.apache.org/docs/current/ : 2.0.2

Latest Stable Release : http://hadoop.apache.org/docs/stable/ : 1.0.4

Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).

Avro
A serialization system for efficient, cross-language RPC and persistent data
storage.

MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.

Pig
A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.

Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the run time engine to
MapReduce jobs) for querying the data.

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).

ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.

Sqoop
A tool for efficient bulk transfer of data between structured data stores (such as
relational databases) and HDFS.

Oozie
A service for running and scheduling workflows of Hadoop jobs (including Map-
Reduce, Pig, Hive, and Sqoop jobs).

Note referenece : Hadoop The Definitive GUIDE (3rd Edition) & Hadoop

2. SQL statements interview questions: a must know list

The JOIN concept
JOIN is a query clause that can be used with the SELECT, UPDATE, and DELETE data query statements to simultaneously affect rows from multiple tables. There are several distinct types of JOIN statements that return different data result sets.

Joined tables must each include at least one field in both tables that contain comparable data. For example, if you want to join a Customer table and a Transaction table, they both must contain a common element, such as a CustomerID column, to serve as a key on which the data can be matched. Tables can be joined on multiple columns so long as the columns have the potential to supply matching information. Column names across tables don't have to be the same, although for readability this standard is generally preferred.

Now that we’ve examined the basic theory, let’s take a look at the various types of joins and examples of each.

The basic JOIN statement
A basic JOIN statement has the following format:
SELECT Customer.CustomerID, TransID, TransAmt
FROM Customer JOIN Transaction
ON Customer.CustomerID = Transaction.CustomerID;

In practice, you'd never use the example above because the type of join is not specified. In this case, SQL Server assumes an INNER JOIN. You can get the equivalent to this query by using the statement:
SELECT Customer.CustomerID, TransID, TransAmt
FROM Customer, Transaction;

Like a full Join

However, the example is useful to point out a few noteworthy concepts:

TransID and TransAmt do not require fully qualified names because they exist in only one of the tables. You can use fully qualified names for readability if you wish.
The Customer table is considered to be the “left” table because it was called first. Likewise, theTransaction table is the “right” table.
You can use more than two tables, in which case each one is “naturally” joined to the cumulative result in the order they are listed, unless controlled by other functionality such as “join hints” or parenthesis.
You may use WHERE and ORDER BY clauses with any JOIN statement to limit the scope of your results. Note that these clauses are applied to the results of your JOIN statement.
SQL Server does not recognize the semicolon (;), but I use it in the included examples to denote the end of each statement, as would be expected by most other RDBMSs.

The INNER JOIN drops rows
When you perform an INNER JOIN, only rows that match up are returned. Any time a row from either table doesn’t have corresponding values from the other table, it is disregarded. Because stray rows aren’t included, you don’t have any of the “left” and “right” nonsense to deal with and the order in which you present tables matters only if you have more than two to compare. Since this is a simple concept, here’s a simple example:

SELECT CustomerName, TransDate
FROM Customer INNER JOIN Transaction
ON Customer.CustomerID = Transaction.CustomerID;

If a row in the Transaction table contains a CustomerID that’s not listed in the Customer table, that row will not be returned as part of the result set. Likewise, if the Customer table has a CustomerIDwith no corresponding rows in the Transaction table, the row from the Customer table won’t be returned.

The OUTER JOIN can include mismatched rows
OUTER JOINs, sometimes called “complex joins,” aren’t actually complicated. They are so-called because SQL Server performs two functions for each OUTER JOIN.

The first function performed is an INNER JOIN. The second function includes the rows that the INNER JOIN would have dropped. Which rows are included depends on the type of OUTER JOIN that is used and the order the tables were presented.

There are three types of an OUTER JOIN: LEFT, RIGHT, and FULL. As you’ve probably guessed, the LEFT OUTER JOIN keeps the stray rows from the “left” table (the one listed first in your query statement). In the result set, columns from the other table that have no corresponding data are filled with NULL values. Similarly, the RIGHT OUTER JOIN keeps stray rows from the right table, filling columns from the left table with NULL values. The FULL OUTER JOIN keeps all stray rows as part of the result set. Here is your example:
SELECT CustomerName, TransDate, TransAmt
FROM Customer LEFT OUTER JOIN Transaction
ON Customer.CustomerID = Transaction.CustomerID;

Customer names that have no associated transactions will still be displayed. However, transactions with no corresponding customers will not, because we used a LEFT OUTER JOIN and theCustomer table was listed first.

In SQL Server, the word OUTER is actually optional. The clauses LEFT JOIN, RIGHT JOIN, and FULL JOIN are equivalent to LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN, respectively.

1. SQL statements interview questions: a must know list

1. you will want to list only the different (distinct) values in a table.

The DISTINCT keyword can be used to return only distinct (different) values.

SQL SELECT DISTINCT Syntax

SELECT DISTINCT column_name(s)

FROM table_name

2. SQL UNIQUE Constraint

The UNIQUE constraint uniquely identifies each record in a database table.

The UNIQUE and PRIMARY KEY constraints both provide a guarantee for uniqueness for a column or set of columns.

A PRIMARY KEY constraint automatically has a UNIQUE constraint defined on it.

Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table.

SQL UNIQUE Constraint on CREATE TABLE

The following SQL creates a UNIQUE constraint on the "P_Id" column when the "Persons" table is created:

MySQL:

CREATE TABLE Persons

(

P_Id int NOT NULL,

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255),

CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)

)

ALTER TABLE Persons

ADD CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)

3. The ORDER BY Keyword

The ORDER BY keyword is used to sort the result-set by a specified column.

The ORDER BY keyword sorts the records in ascending order by default.

If you want to sort the records in a descending order, you can use the DESC keyword.

SQL ORDER BY Syntax

SELECT column_name(s)

FROM table_name

ORDER BY column_name(s) ASC|DESC

SELECT * FROM Persons

ORDER BY LastName

4. The LIKE Operator

The LIKE operator is used to search for a specified pattern in a column.

The "Persons" table:

P_Id	LastName	FirstName	Address	City
1	Hansen	Ola	Timoteivn 10	Sandnes
2	Svendson	Tove	Borgvn 23	Sandnes
3	Pettersen	Kari	Storgt 20	Stavanger

We use the following SELECT statement:

SELECT * FROM Persons

WHERE City LIKE '%s'

The result-set will look like this:

P_Id	LastName	FirstName	Address	City
1	Hansen	Ola	Timoteivn 10	Sandnes
2	Svendson	Tove	Borgvn 23	Sandnes

We use the following SELECT statement:

SELECT * FROM Persons

WHERE City LIKE '%tav%'

The result-set will look like this:

P_Id	LastName	FirstName	Address	City
3	Pettersen	Kari	Storgt 20	Stavanger

It is also possible to select the persons living in a city that does NOT contain the pattern "tav" from the "Persons" table, by using the NOT keyword.

We use the following SELECT statement:

SELECT * FROM Persons

WHERE City NOT LIKE '%tav%'

The result-set will look like this:

P_Id	LastName	FirstName	Address	City
1	Hansen	Ola	Timoteivn 10	Sandnes
2	Svendson	Tove	Borgvn 23	Sandnes

Tuesday, October 23, 2012

Hadoop Quest 1 : ERROR : IBM BIGINSIGHTS on Cloud not accessible from web links on control panel

Using commands

1. I could not get to links of web pages of NameNode status, JobTrackerStatus etc from the web page of IBM instance of Master node . But only one of the linke "BigInsights web console" is taking me to the next web page. What should I do in this regard?

Diagnosis :
I started with Assumption : From the nature of the problem I think if I restart the instance I will be good to go. But first I needed to make sure which part of Hadoop is not working -> NameNode or Jobtracker or Tasktracker? -> All are up and running.

Following is a snapshot of IBM's Biginsight's console of showing that the NameNode is also acting as a secondary NameNode and Jobtracker

So I started with the UserGuide : http://hadoop.apache.org/docs/stable/hdfs_user_guide.html#Shell+Commands

Info is like :

Shell Commands

Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see File System Shell Guide.

DFSAdmin Command

The bin/hadoop dfsadmin command supports a few HDFS administration related operations. The bin/hadoop dfsadmin -help command lists all the commands currently supported. For e.g.:

-report : reports basic statistics of HDFS. Some of this information is also available on the NameNode front page.
-safemode : though usually not required, an administrator can manually enter or leave Safemode.
-finalizeUpgrade : removes previous backup of the cluster made during last upgrade.
-refreshNodes : Updates the set of hosts allowed to connect to namenode. Re-reads the config file to update values defined by dfs.hosts and dfs.host.exclude and reads the entires (hostnames) in those files. Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry defined in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady been marked for decommission. Entires not present in both the lists are decommissioned.
-printTopology : Print the topology of the cluster. Display a tree of racks and datanodes attached to the tracks as viewed by the NameNode.

So I used the $hadoop -report to find the report on the HDFS name and data nodes. It showed me a complete detail of all the Hadoop nodes (active + dead) -> this way I was able to diagnose one of the node "Data Node 3" having "I.P. like in xxx.xxx.xxx.37" was dead . I removed it and made another node on its place . "Pretty cool stuff right "

Now I stopped hadoop by stop-all.sh and started all the nodes by start-all.sh but found out that hive is not started , so

I started one more name node
Installed winscp
Configured winscp for ssh between my laptop and newly created name node.
Copied the /mnt/biginsights/opt/ibm/bidinsights/hive folder to local directory of my laptop
Started one more winscp for ssh between my laptop and old name node.
copied the hive files into the same location as it was earlier
now go to hive directory
go to /conf
vi hive-site.xml
in the site name -> xxx.xxx.xxx.xxx rename to ur namenode xxx.xxx.xxx.xxx
Start hadoop by start-all.sh
It starts successfully.

But still the sites from the link are not opening up.

SO NOW I looked into the log files : By default the $BIGINSIGHTS_VAR has your biginsights folder so did following

cd $BIGINSIGHTS_VAR/console/log
vi console-wasce.log
Found following error :

2012-10-22 15:50:48,218 ERROR [[JobServlet]] Servlet.service() for servlet JobServlet threw exception

java.lang.NullPointerException

at com.ibm.xap.console.job.JobUtil.jobOperationExceptionHandler(Unknown Source)

at com.ibm.xap.console.job.JobUtil.createJob(Unknown Source)

at com.ibm.xap.console.job.JobUtil.handleJobCmd(Unknown Source)

at com.ibm.xap.console.job.JobOperationHandler.handleJobCmd(Unknown Source)

at com.ibm.xap.console.job.JobOperationHandler.handleCmd(Unknown Source)

at com.ibm.xap.console.servlet.JobServlet.doGet(Unknown Source)

at com.ibm.xap.console.servlet.JobServlet.doPost(Unknown Source)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:713)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:806)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

at org.apache.geronimo.tomcat.valve.DefaultSubjectValve.invoke(DefaultSubjectValve.java:56)

at org.apache.geronimo.tomcat.GeronimoStandardContext$SystemMethodValve.invoke(GeronimoStandardContext.java:406)

at org.apache.geronimo.tomcat.valve.GeronimoBeforeAfterValve.invoke(GeronimoBeforeAfterValve.java:47)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555)

at org.apache.geronimo.tomcat.valve.ThreadCleanerValve.invoke(ThreadCleanerValve.java:40)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)

at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)

at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)

at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:736)

This is the root cause of all problems. Lets figure out whats happening in the connectionhandler.process...

I searched a lot for the jobUtil file but could not find it. So Ultimately I looked into two of the folders under /mnt/BI/opt/ibm/BI/ -> home directory for installation of hadoop BI is biginsights

hadoop-conf -> all file pertaining to hadoop -> hadoop.sh hadoop-conf.xml etc
conf -> biginsights-conf.sh -> all environment variable like $BIGINSIGHTS_VAR AND $ BIGINSIGHTS_HOME

Finally, I looked into the console folder under BIGINSIGHTS_HOME, which has a folder wascs -> contains information about the BIconsole.WAR file -> I think this might have got corrupt or something like that.

Also I did tests and ran wordcount 2-3 times so no problem => hadoop is fine its webconsole war is corrupt. DONT KNOW WHAT TO DO NEXT....

UPDATE ON 10/23/2012

SO I finally resolved the problem. Apparently I found the following exception in $BIGINSIGHTS_VAR or /mnt/bI/var/ibm/BI/hadoop/logs/hadoop-<username>-namenode-<hostname>.log

2012-10-23 13:39:48,297 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9000, call delete(/hadoop/mapred/system/job_201210221816_0013, true) from 170.224.161.37:36870: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /hadoop/mapred/system/job_201210221816_0013. Name node is in safe mode.

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 23 seconds.

org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /hadoop/mapred/system/job_201210221816_0013. Name node is in safe mode.

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 23 seconds.

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)

at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)

at java.lang.reflect.Method.invoke(Method.java:611)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)

at java.security.AccessController.doPrivileged(AccessController.java:284)

at javax.security.auth.Subject.doAs(Subject.java:573)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

The solution is simple : This error occurs where some blocks were never reported in.So, I had to forcefully let the namenode leave safemode (hadoop dfsadmin -safemode leave) and then (optionally) run an fsck to delete missing files.

I only run the leave safenode command and than clicked on the console from the BIGINSIGHTS web console. Everything is working now.

Friday, October 12, 2012

Papers on Map Reduce

Following are a few papers that might interest you if you are in the field of Machine Learning Data Mining and BIG DATA

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce

This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.

Motivation

Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

Ads Analysis

*Improving ad relevance in sponsored search

*Predicting the Click-Through Rate for Rare/New Ads

*Learning Influence Probabilities in Social Networks

*Mining advertiser-specific user behavior using adfactors

*Extracting user profiles from large scale data

Large-Scale Behavioral Targeting (2009)

Search Advertising using Web Relevance Feedback (2008)

Predicting Ads’ ClickThrough Rate with Decision Rules (2008)

For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out ourprevious blog post.

Bioinformatics/Medical Informatics

*A novel approach to multiple sequence alignment using hadoop data grids

MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)

MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees

Machine Translation

*Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski

Grammar based statistical MT on Hadoop (2009)

Large Language Models in Machine Translation (2008)

Spatial Data Processing

Experiences on Processing Spatial Data with MapReduce

Information Extraction and Text Processing

*Statistical Sentence Chunking Using Map Reduce

Data-intensive text processing with MapReduce

Web-Scale Distributional Similarity and Entity Set Expansion (2009)

The infinite HMM for unsupervised PoS tagging (2009)

Artificial Intelligence/Machine Learning/Data Mining

*LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems

*Stateful Bulk Processing for Incremental Analytics

*Mining dependency in distributed systems through unstructured logs analysis

*Beyond online aggregation: parallel and incremental data mining with online mapreduce

*Learning based opportunistic admission control algorithm for mapreduce as a service

*OWL reasoning with WebPIE: calculating the closure of 100 billion triples

*Scaling ECGA model building via data-intensive computing

*SPARQL basic graph pattern processing with iterative mapreduce

Residual Splash for Optimally Parallelizing Belief Propagation

Stochastic gradient boosted distributed decision trees

Distributed Algorithms for Topic Models

When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing

Cloud Computing Boosts Business Intelligence of Telecommunication Industry

Parallel K-Means Clustering Based on MapReduce

Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce

Parallel algorithms for mining large-scale rich-media data

Scaling Simple and Compact Genetic Algorithms using MapReduce

Scalable Distributed Reasoning using Mapreduce

Scaling Up Classifiers to Cloud Computers (2008)

Search Query Analysis

*Parallelizing Random Walk with Restart for large-scale query recommendation

BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)

AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)

Information Retrieval (Search)

*Automatically Incorporating New Sources in Keyword Search-Based Data Integration

*Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

*Learning URL patterns for webpage de-duplication

*Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser

*MIREX: Mapreduce Information Retrieval Experiments

Efficient Clustering of Web Derived Data Sets

The PageRank algorithm and application on searching of academic papers

A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

On Single-Pass Indexing with MapReduce (2009)

A Data Parallel Algorithm for XML DOM Parsing (2009)

Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web(2008)

Spam & Malware Detection

Characterizing Botnets from Email Spam Records (2008)

- Clustering of emails into spam campaign

- Finding probability that 2 spam messages are sent form same machine

- Estime likelihood of botnets based on common senders in spam campaigns

The Ghost In The Browser Analysis of Web-based Malware (2007)

Image and Video Processing

*Font rendering on a GPU-based raster image processor

MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

- Video Stream Re-Rendering

Map-Reduce Meets Wider Varieties of Applications (2008)

- Location detection in images

Networking

Reducible Complexity in DNS

Simulation

Map-Reduce Meets Wider Varieties of Applications (2008)

- Simulation of earthquakes (geology)

Statistics

*User-based collaborative filtering recommendation algorithms on hadoop

Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)

Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)

MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

- Digg.com story recommendations

Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)

- Measuring Wikipedia Editor similarity

Map-Reduce Meets Wider Varieties of Applications (2008)

- Netflix video recommendation

Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

Numerical Mathematics

*Distributed non-negative matrix factorization for dyadic data analysis on mapreduce

*A mapreduce algorithm for SC

*Multi-GPU Volume Rendering using MapReduce

Mapreduce for Integer Factorization

Sets & Graphs

*Towards scalable RDF graph analytics on MapReduce

*Efficient Parallel Set-Similarity Joins using Mapreduce

*Max-cover algorithm in map-reduce

Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Graph Twiddling in a MapReduce World

DOULION: Counting Triangles in Massive Graphs with a Coin (2009)

Fast counting of triangles in real-world networks: proofs, algorithms and observations(2008)

Who wrote the above papers?

Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.

Government Institutions and Universities: US National Security Agency (NSA)

, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas