Quickly executing multiple hbase commands from shell

We tend to setup our hbase test tables by shell scripts. Hence we have to execute multiple puts and deletes in a row. When you add an hbase shell [command] for each single operation you will have to wait a long time until all commands are processed. A more performant proceeding is to put all commands in one single <<EOF … EOF construct which basically allows to pass a multi line parameter to the shell.

hbase shell <<EOF
put 'shared:ITEMS','item0001','c:value','1.0'
put 'shared:ITEMS','item0002','c:value','2.0'
put 'shared:ITEMS','item0003','c:value','3.0'
put 'shared:ITEMS','item0004','c:value','4.0'
put 'shared:ITEMS','item0005','c:value','5.0'
put 'shared:ITEMS','item0006','c:value','6.0'
EOF

Doing so hbase will only start one shell instead of starting an instance for each single command.

ClassNotFoundException in Spark application using KryoSerializer

We frequently encoutered a ClassNotFoundException in our Java based Spark applications for classes that we verifiably included in our application’s JAR. Furthermore, we used the kryoSerializer (org.apache.spark.serializer.KryoSerializer) for performance reasons.

After some very annoying debugging sessions we found out that we can get rid of the exception by registering the apparently missing classes by adding them to the spark configration item org.apache.spark.serializer.KryoSerializer. This property is a simple comma separated list of full qualified class names. After adding each class the ClassNotFoundException disappeared.

Calculate prime numbers using spark

Hi, for a test I wrote a short java application that calculates prime numbers on a distributed spark cluster between 0 and 1000000. Since spark 2.x examples are rare on the internet I just leave this here. Prime number code is by Oscar Sanchez.

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
 
public class Main {
 
  private static boolean isPrime(long n) {
    for (long i = 2; 2 * i < n; i++) {
      if (n % i == 0) {
        return false;
      }
    }
    return true;
  }
 
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder().appName("PrimeApp").getOrCreate();
    Dataset<Tuple2<Long, Boolean>> rnd = spark.range(0L, 1000000L).map(
      (MapFunction<Long, Tuple2<Long, Boolean>>) x -> new Tuple2<Long, Boolean>(x, isPrime(x)), Encoders.tuple(Encoders.LONG(), Encoders.BOOLEAN()));
    rnd.show(false);
    spark.stop();
  }
 
}

Reading Parquet Files from a Java Application

Recently I came accross the requirement to read a parquet file into a java application and I figured out it is neither well documented nor easy to do so. As a consequence I wrote a short tutorial. The first task is to add your maven dependencies.

<dependencies>
 <dependency>
 <groupId>org.apache.parquet</groupId>
 <artifactId>parquet-hadoop</artifactId>
 <version>1.9.0</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.0</version>
 </dependency>
</dependencies>

To write the java application is easy once you know how to do it. Instead of using the AvroParquetReader or the ParquetReader class that you find frequently when searching for a solution to read parquet files use the class ParquetFileReader instead. The basic setup is to read all row groups and then read all groups recursively.

Continue reading “Reading Parquet Files from a Java Application”

Java client for the hbase REST API

Modern open source projects frequently lack documentation and so does hbase regarding its REST API. Even though REST is covered in the official hbase reference book the guidelines are not complete as I’ll show in the listings below. A second source that you might have come across is a set of three blog posts by Cloudera which is quite good and much more detailled than the official hbase guidelines – but it still does not fully cover then traps you might run into when implementing a java client.

I’d like to show three examples, two for a put and one for a get request. All further operations like delete or post should then be easy to implement once you know how to send requests to the REST server.

Continue reading “Java client for the hbase REST API”

[Java EE] JSF visualization library VisualFaces

I am currently writing my master thesis on Big Data Visualization. To achive stunning visuals I decided to use d3.js which is a (nearly) perfect library to display data in an innovative and fancy way. To be able to use those diagrams easily with Java EE’s JSF 2 technology, I decided to write a tag library (which basically is a component collection) called VisualFaces to use the most impressive example diagrams with just one tag.

So instead of writing hundreds of lines of Javascript code you simply add a

 

to your xhtml page and you’ll get a wonderfull wordcloud.

To use the taglib simply put the visualfaces.jar in the WEB-INF directory of your Java EE webWordCloud project, add the namespace xmlns:j=”https://www.jofre.de/visualfaces” and then all included diagrams can be used in an instant.

I will be adding some more diagrams into the library soon. At the moment I am struggling with putting several diagrams on one site because some Javascript events like mouseoever and properties get overwritten.

The project can be downloaded from Github. To compile it, simply export it as JAR from Eclipse. Below you find a precompiled beta version.

Einen Hadoop-Cluster aufsetzen

Vor einiger Zeit habe ich ein Tutorial von Michael Noll übersetzt in dem es darum ging, eine Hadoop-Instanz zu erstellen. Ein einzelner Knoten dient allerdings eher den Test- und Lernzwecken und es ist natürlich erstrebenswert einen Cluster zu erstellen durch den das verteiltarbeitende System Hadoop seine Stärken erst richtig ausspielen kann.

In diesem Tutorial übersetze ich also den zweiten Teil, der davon handelt, zwei Server zu einem Cluster zusammen zu fügen. Das PDF kann hier heruntergeladen werden:

Viel Spaß damit!

[Big Data] Visualize huge amounts of geo data using WebGL-Globe

For my Master Thesis I’m evaluating several interesting frameworks to visualize Big Data in the browser. Next to the fantastic D3.js I found WebGL quite helpful even if it is not as far developed as its 2D predecessors. Nevertheless, Google shows some impressive examples of what is possible with WebGL on their site chrome-Experiments.com. To give you a first impression on what can be done with those examples I want to show a screenshot I took that visualizes German students who study in a foreign country.
GermanStudentsInForeignCountries

Source code is not more than ~50 lines so that it can be adapted to your personal needs easily.

Hadoop 1.0.4 auf Ubuntu Server 12.04.2 installieren

Die Installation von Hadoop ist nicht ganz trivial und leider auch nicht allzugut dokumentiert. Allerdings existiert ein sehr gutes Tutorial von Michael Noll in dem dieser erklärt wie man einen einfachen Single-Node-Cluster aufsetzt. Dieses habe ich mehr oder weniger ins Deutsche übersetzt und stelle es hier zur Verfügung.

 

HDFS Explorer – Managing the Hadoop file system on a remote machine

Working with hadoop means working with the Hadoop File System (HDFS). Therefore, it is mandatory to read, write and delete files via command line. That can be quite difficult and exhausting when you are not familiar with the common unix and hadoop commands. To handle this task, I wrote a small application that is able to work with an HDFS running on Ubuntu.

So far, the application is able to:

  • Read the HDFS in a treeview
  • Upload / Download files to/from the local machine
  • Delete files on the HDFS
  • Create directories on the HDFS

When there is a need (and if I get enough good feedback 😉 ), I’ll add session management for several file systems as well as the function to start MapReduce jobs from the application (as it can be seen in the lower group box).

A download is about to follow soon!