ClassNotFoundException in Spark application using KryoSerializer

We frequently encoutered a ClassNotFoundException in our Java based Spark applications for classes that we verifiably included in our application’s JAR. Furthermore, we used the kryoSerializer (org.apache.spark.serializer.KryoSerializer) for performance reasons.

After some very annoying debugging sessions we found out that we can get rid of the exception by registering the apparently missing classes by adding them to the spark configration item org.apache.spark.serializer.KryoSerializer. This property is a simple comma separated list of full qualified class names. After adding each class the ClassNotFoundException disappeared.

Calculate prime numbers using spark

Hi, for a test I wrote a short java application that calculates prime numbers on a distributed spark cluster between 0 and 1000000. Since spark 2.x examples are rare on the internet I just leave this here. Prime number code is by Oscar Sanchez.

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
 
public class Main {
 
  private static boolean isPrime(long n) {
    for (long i = 2; 2 * i < n; i++) {
      if (n % i == 0) {
        return false;
      }
    }
    return true;
  }
 
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder().appName("PrimeApp").getOrCreate();
    Dataset<Tuple2<Long, Boolean>> rnd = spark.range(0L, 1000000L).map(
      (MapFunction<Long, Tuple2<Long, Boolean>>) x -> new Tuple2<Long, Boolean>(x, isPrime(x)), Encoders.tuple(Encoders.LONG(), Encoders.BOOLEAN()));
    rnd.show(false);
    spark.stop();
  }
 
}

Reading Parquet Files from a Java Application

Recently I came accross the requirement to read a parquet file into a java application and I figured out it is neither well documented nor easy to do so. As a consequence I wrote a short tutorial. The first task is to add your maven dependencies.

<dependencies>
 <dependency>
 <groupId>org.apache.parquet</groupId>
 <artifactId>parquet-hadoop</artifactId>
 <version>1.9.0</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.0</version>
 </dependency>
</dependencies>

To write the java application is easy once you know how to do it. Instead of using the AvroParquetReader or the ParquetReader class that you find frequently when searching for a solution to read parquet files use the class ParquetFileReader instead. The basic setup is to read all row groups and then read all groups recursively.

Continue reading “Reading Parquet Files from a Java Application”

Java client for the hbase REST API

Modern open source projects frequently lack documentation and so does hbase regarding its REST API. Even though REST is covered in the official hbase reference book the guidelines are not complete as I’ll show in the listings below. A second source that you might have come across is a set of three blog posts by Cloudera which is quite good and much more detailled than the official hbase guidelines – but it still does not fully cover then traps you might run into when implementing a java client.

I’d like to show three examples, two for a put and one for a get request. All further operations like delete or post should then be easy to implement once you know how to send requests to the REST server.

Continue reading “Java client for the hbase REST API”

Java / Spark – Printing RDDs from Spark Streaming

A common challenge when writing functional code in Spark is to simply output logs as we usually do it in ordinary applications where line for line is processed sequentially and a debug message could be printed at an arbitrary place using System.out.println(). Working with Spark’s RDDs is new to most of us and might be confusing at first since it is difficult to track what object type we are currently working with (is it a RDDStream, an RDD or a simple type like String or int) – but infact this tracking is important. If you make use of the print function that is offered by RDDs or RDDStreams you’ll just see more or less useless array like information in the console instead of the objects’ content.

What I commonly do is to resolve the RDD object down to a simple JavaRDD or JavaDStream<object> using e.g. flatmap(x -> x) to flatten streams or lists in a dataset or map() to break down complex objects to a simple data type (e.g. via map(person -> person.getAddress().getStreet())). On RDDs you can then do a foreach(System.out::println). On JavaDStream you can do the following:

public static void printRDD(final JavaRDD<String> s) {
  s.foreach(System.out::println);
}
 
public static void main(final String[] args) {
  SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("testApp");
  try (JavaStreamingContext sc = new JavaStreamingContext(conf, new Duration(5000))) {
    JavaDStream<String> stream = sc.textFileStream("C:/testnumbers/");
    stream.foreachRDD(Main::printRDD);
    sc.start();
    sc.awaitTermination();
  }
}

Here, I simply break down a JavaDStream to an RDD using foreachRDD() and use System.out::println as method reference to print each RDDs content.

Java – Streams vs Collections vs Loops

Trying to figure out what method serves best in java to get the maximum timestamp out of a list I setup a small benchmark to evaluate the runtime of Java Streams, collections and an ordinary for loop. My setup can be seen in the following listing.

 

public static void main(final String[] args) {
 
  List list = new ArrayList();
  Random rnd = new Random();
  for (int i = 0; i < 10; i++) { list.add(rnd.nextInt(10000)); } long t1 = System.currentTimeMillis(); long largestNumber = 0; for (Integer t : list) { if (t > largestNumber) {
      largestNumber = t;
    }
  }
 
  long t2 = System.currentTimeMillis();
  System.out.println("Max (loop) is " + largestNumber + " in " + (t2 - t1) + " ms.");
 
  long t3 = System.currentTimeMillis();
  OptionalInt max = list.stream().mapToInt(Integer::intValue).max();
 
  long t4 = System.currentTimeMillis();
  if (max.isPresent()) {
    System.out.println("Max (Stream) is " + max.getAsInt() + " in " + (t4 - t3) + " ms.");
  }
 
  long t5 = System.currentTimeMillis();
  int collMax = Collections.max(list);
  long t6 = System.currentTimeMillis();
 
  System.out.println("Max (Collections) is " + collMax + " in " + (t6 - t5) + " ms.");
}

Here are the results:

Samples Steams Collections Loop
10.000 68 3 3
50.000 70 5 4
100.000 71 6 5
500.000 76 16 10
1.000.000 108 18 12
1.500.000 89 20 17
5.000.000 94 29 19
10.000.000 103 41 30

The expected linear runtime allows a comparision of these three ways. Although, collections and streams look way better than the loop runtime shows that it still seems to be the fastest way to determine a max value from values in a simple list.
 

Deploy a web application to an embedded Tomcat 8

I’ve found various tutorials on creating an embedded tomcat programatically and deploy a web application to it. Unfortunately, non of the tutorials was up to date using Tomcat 8 and actually non was showing how to deploy anything else but a servlet so I decided to write a short tutorial on how to create the server and deploy either a WAR file or a folder containing a non archived web application.

Create a simple Java project in ecplipse, enable maven and add the following dependencies:

<dependencies>
  <dependency>
    <groupId>org.apache.tomcat</groupId>
    <artifactId>tomcat-catalina</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat</groupId>
    <artifactId>tomcat-util</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat.embed</groupId>
    <artifactId>tomcat-embed-core</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat.embed</groupId>
    <artifactId>tomcat-embed-jasper</artifactId>
    <version>7.0.8</version>
  </dependency>
</dependencies>

Optionally, you can enable the maven-assembly-plugin to package your application as JAR at the end and have maven include all dependencies you specified. As build goal you’ll have to use assembly:single.

<plugin>
  <artifactId>maven-assembly-plugin</artifactId>
  <configuration>
    <archive>
      <manifest>
        <mainClass>de.jofre.embeddedtc.runtime.Main</mainClass>
      </manifest>
    </archive>
    <descriptorRefs>
      <descriptorRef>jar-with-dependencies</descriptorRef>
    </descriptorRefs>
  </configuration>
</plugin>

Now create the class de.jofre.embeddedtc.runtime.Main (The name is arbitrary) and add write the code according to the next listing:

package de.jofre.embedded.runtime;
 
import java.io.File;
import java.util.logging.Logger;
 
import org.apache.catalina.Context;
import org.apache.catalina.LifecycleException;
import org.apache.catalina.startup.Tomcat;
 
public class Main {
  private final static Logger LOGGER = Logger.getLogger(Main.class.getName());
  private final static String mWorkingDir = System.getProperty("java.io.tmpdir");
  private static Tomcat tomcat = null;
 
  public static void main(String[] args) {
 
    tomcat = new Tomcat();
    tomcat.setPort(8080);
    tomcat.setBaseDir(mWorkingDir);
    tomcat.getHost().setAppBase(mWorkingDir);
    tomcat.getHost().setAutoDeploy(true);
    tomcat.getHost().setDeployOnStartup(true);
 
    try {
      tomcat.start();
    } catch (LifecycleException e) {
      LOGGER.severe("Tomcat could not be started.");
      e.printStackTrace();
    }
    LOGGER.info("Tomcat started on " + tomcat.getHost());
 
    // Alternatively, you can specify a WAR file as last parameter in the following call e.g. "C:\\Users\\admin\\Desktop\\app.war"
    Context appContext = Main.getTomcat().addWebapp(Main.getTomcat().getHost(), "/app", "C:\\Users\\admin\\Desktop\\app\\");
    LOGGER.info("Deployed " + appContext.getBaseName() + " as " + appContext.getBaseName());
 
    tomcat.getServer().await();
  }
}

The last question is how the directory in C:\\Users\\admin\\Desktop\\app\\ respectively the C:\\Users\\admin\\Desktop\\app.war looks like. Well, it contains a simple HTML file…

<html><body>Test</body></html>

… and another folder called WEB-INF containing the web.xml with the following content:

<?xml version="1.0" encoding="ISO-8859-1"?>
 
<!DOCTYPE web-app 
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" 
    "http://java.sun.com/dtd/web-app_2_3.dtd">
 
<web-app>
 
    <display-name>Test App</display-name>
    <description>A test app</description>
 
	<welcome-file-list>
			<welcome-file>index.html</welcome-file>
	</welcome-file-list>
 
    <session-config>
      <session-timeout>30</session-timeout>
    </session-config>
 
</web-app>

Now if you start the java application you call http://localhost:8080/app and the content of index.html should be displayed. Hope this helps you!

Java Fehler – Archive for required library could not be read or is not a valid ZIP file

Arbeitet man mit einem Team an einem Java-Projekt, dass Maven nutzt, kann es vorkommen, dass bestimmte Abhängigkeiten referenziert werden, ohne, dass sie auf der Platte vorhanden sind. Eclipse (oder das entsprechende Maven-Plugin) tendiert in dem Fall dazu die Fehlermeldung Archive for required library could not be read or is not a valid ZIP file auszugeben. Dies geschieht auf Basis der Annahme, dass die referenzierten Bibliotheken nicht vorliegen.

Um dieses Problem zu lösen, müssen Sie lediglich alle Ordner in Ihrem Maven-Repository löschen, dieses befindet sich bei mir (Ich verwende das Maven-Plugin für Eclipse) im Ordner C:\Users\padmalcom\.m2\repository. Starten Sie im Anschluss Eclipse neu und Maven wird die benötigten Dependencies beim nächsten Start nachladen. Unter Umständen ist es nötig, dass Sie über einen Rechtsklick auf das betroffene Projekt die Option Maven -> Update Project… auswählen. Der Fehler sollte dann im Regelfall verschwunden sein.

[Java EE] JSF visualization library VisualFaces

I am currently writing my master thesis on Big Data Visualization. To achive stunning visuals I decided to use d3.js which is a (nearly) perfect library to display data in an innovative and fancy way. To be able to use those diagrams easily with Java EE’s JSF 2 technology, I decided to write a tag library (which basically is a component collection) called VisualFaces to use the most impressive example diagrams with just one tag.

So instead of writing hundreds of lines of Javascript code you simply add a

 

to your xhtml page and you’ll get a wonderfull wordcloud.

To use the taglib simply put the visualfaces.jar in the WEB-INF directory of your Java EE webWordCloud project, add the namespace xmlns:j=”https://www.jofre.de/visualfaces” and then all included diagrams can be used in an instant.

I will be adding some more diagrams into the library soon. At the moment I am struggling with putting several diagrams on one site because some Javascript events like mouseoever and properties get overwritten.

The project can be downloaded from Github. To compile it, simply export it as JAR from Eclipse. Below you find a precompiled beta version.

[Java] 100 common words in German, English and French

To exclude common words from word and tag clouds and determine them in text mining procedures it is neccessary to determine them beforehand. The following Java class will help you to find them in the languages German, English and French. Thanks to the University of Leipzig who created the collection.

Update: I added some common words to the german list that I found doing my research and removed and removed some duplicates.

package de.jofre.commonwords;
 
public class CommonWords {
 
	// By Jonas Freiknecht
 
	// Source: http://wortschatz.uni-leipzig.de/
 
	// There also exists a web service from the University of Leipzig
	// to determine common words:
	// http://wortschatz.uni-leipzig.de/axis/servlet/ServiceOverviewServlet
 
	public static boolean contains(String _str, String[] _list) {
		for (int i = 0; i &lt; _list.length; i++) {
			if (_list[i].equalsIgnoreCase(_str)) {
				return true;
			}
		}
		return false;
	}
 
		public final static String[] COMMON_WORDS_GERMAN = { "der", "die", "und",
			"in", "den", "von", "zu", "das", "mit", "sich", "des", "auf",
			"für", "ist", "im", "dem", "nicht", "ein", "für", "eine", "als",
			"auch", "es", "an", "werden", "aus", "er", "hat", "daß", "sie",
			"nach", "wird", "bei", "einer", "du", "um", "am", "sind", "noch",
			"wie", "einem", "über", "einen", "ob", "so", "dessen", "zum", "war",
			"haben", "nur", "oder", "aber", "vor", "zur", "bis", "mehr",
			"durch", "man", "sein", "wurde", "sei", "RT", "bin", "hatte",
			"kann", "gegen", "vom", "können", "schon", "wenn", "habe", "seine",
			"Mark", "ihre", "dann", "unter", "wir", "soll", "ich", "eines",
			"ins", "Jahr", "zwei", "Jahren", "diese", "dieser", "wieder",
			"keine", "willst", "seiner", "worden", "Und", "will", "zwischen",
			"extra", "immer", "Millionen", "Ein", "was", "sagte", "ihr",
			"jetzt", "kennen", "sagen", "armer", "arme", "gerne", "kenne",
			"meine", "hoffe", "sehen", "achso", "reicht", "dabei", 
			"gehst", "alles", "selbst", "neuen", "neue", "liebe",
			"feiern", "letzte", "macht", "könnte", "keiner", "glaub", "glaube",
			"gehen", "euren", "passt", "passe", "passen", "findet",
			"eigentlich", "reden", "machen", "liebt", "halbe", "dieses",
			"finde", "sitze", "machen", "halbe", "sonst", "heute", "brauch",
			"drauf", "total", "meint", "denkt", "lässt", "hätte",
			"damals", "lange", "dachte", "wirst", "hören", "kennt",
			"bitte", "treffen", "würde", "fängt", "länger", "könnt",
			"sitzt"};
 
	public final static String[] COMMON_WORDS_ENGLISH = { "the", "of", "to",
			"and", "a", "in", "for", "is", "The", "that", "on", "said", "with",
			"be", "was", "by", "as", "are", "at", "from", "it", "has", "an",
			"have", "will", "or", "its", "he", "not", "were", "which", "this",
			"but", "can", "more", "his", "been", "would", "about", "their",
			"also", "they", "million", "had", "than", "up", "who", "In", "one",
			"you", "new", "A", "I", "other", "year", "all", "two", "S", "But",
			"It", "company", "into", "U", "Mr.", "system", "some", "when",
			"out", "last", "only", "after", "first", "time", "says", "He",
			"years", "market", "no", "over", "we", "could", "if", "people",
			"percent", "such", "This", "most", "use", "because", "any", "data",
			"there", "them", "government", "may", "software", "so", "New",
			"now", "many" };
 
	public final static String[] COMMON_WORDS_FRENCH = { "de", "la", "le",
			"et", "les", "des", "en", "un", "du", "une", "que", "est", "pour",
			"qui", "dans", "a", "par", "plus", "pas", "au", "sur", "ne", "se",
			"Le", "ce", "il", "sont", "La", "Les", "ou", "avec", "son", "Il",
			"aux", "d'un", "En", "cette", "d'une", "ont", "ses", "mais",
			"comme", "on", "tout", "nous", "sa", "Mais", "fait", "été",
			"aussi", "leur", "bien", "peut", "ces", "y", "deux", "A", "ans",
			"l", "encore", "n'est", "marché", "d", "Pour", "donc", "cours",
			"qu'il", "moins", "sans", "C'est", "Et", "si", "entre", "Un", "Ce",
			"faire", "elle", "c'est", "peu", "vous", "Une", "prix", "On",
			"dont", "lui", "également", "Dans", "effet", "pays", "cas", "De",
			"millions", "Belgique", "BEF", "mois", "leurs", "taux", "années",
			"temps", "groupe" };
 
}