Unit testing spark 2.x applications in Java

When you specific spark questions you might get the impression that java is the black sheep of the languages supported. Nearly all answers refer to scala or python. So it is with unit testing hence I am writing this post. I will show how to create a local context (what is pretty well documented) and how to read parquet files (or other formats like csv or json – the process is the same) from a source directory within your project. This way you can unit test your classes containing spark functions without connection to another file system or resource negotiator.

In the following example it is important to register the folder src/test/resources as class path / source code folder.

The annotations beforeClass and afterClass define methods that are called once the class is loaded the first respectively the last time.

package de.jofre.spark.tests;
import static org.assertj.core.api.Assertions.assertThat;
import org.apache.spark.SparkConf;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.feature.SQLTransformer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.experimental.categories.Category;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import de.jofre.test.categories.UnitTest;
public class SparkSessionAndFileReadTest {
	private static Dataset<Row> df;
	private static SparkSession spark;
	private static final Logger logger = LoggerFactory.getLogger(SparkSessionAndFileReadTest.class);
	public static void beforeClass() {
		spark = SparkSession.builder().master("local[*]").config(new SparkConf().set("fs.defaultFS", "file:///"))
		df = spark.read().parquet("src/test/resources/tests/part-r-00001-myfile.snappy.parquet");
		logger.info("Created spark context and dataset with {} rows.", df.count());
	public void buildClippingTransformerTest() {
		logger.info("Testing the spark sorting function.");
		Dataset<Row> sorted = df.sort("id");
	public static void afterClass() {
		if (spark != null) {

Permanently add a proxy to MiKTex

MiKTeX is a Tex distribution that is required when translating latex documents to target formats like e.g. PDF. One task of such a distribution is the package management of plugins that are used in your document. MiKTeX downloads such packages when they are first used or updated and hence requires an internet connection as long as you do not have those packages on a portable medium.

If you sit behind a proxy – like me – you have to configure MiKTeX to use this proxy. What I did for a long (annoying) time was to enter the proxy URL in the MiKTeX Update tool and check Authentication required. This enforces the tool to ask you every single time to enter your proxy credentials. A better way is to uncheck Authentication required and specify user and password directly in the URL. If you do so, you should never be asked again to enter you user password combination.

TensorFlow For Poets – Retrain Inception behind a Proxy

I tried to retrain Google’s Inception as described here. I failed since I use a proxy what has not been considered when implementing the retrain.py script.

So what I did to solve it is to find the following line in the script:

filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress)

… and add the following code above the line:

proxy = urllib.request.ProxyHandler({'http': r'http://user:password@proxy.domain.de:8080'})
auth = urllib.request.HTTPBasicAuthHandler()
opener = urllib.request.build_opener(proxy, auth, urllib.request.HTTPHandler)

Now tensorflow can download the required files.

Java / Spark – Printing RDDs from Spark Streaming

A common challenge when writing functional code in Spark is to simply output logs as we usually do it in ordinary applications where line for line is processed sequentially and a debug message could be printed at an arbitrary place using System.out.println(). Working with Spark’s RDDs is new to most of us and might be confusing at first since it is difficult to track what object type we are currently working with (is it a RDDStream, an RDD or a simple type like String or int) – but infact this tracking is important. If you make use of the print function that is offered by RDDs or RDDStreams you’ll just see more or less useless array like information in the console instead of the objects’ content.

What I commonly do is to resolve the RDD object down to a simple JavaRDD or JavaDStream<object> using e.g. flatmap(x -> x) to flatten streams or lists in a dataset or map() to break down complex objects to a simple data type (e.g. via map(person -> person.getAddress().getStreet())). On RDDs you can then do a foreach(System.out::println). On JavaDStream you can do the following:

public static void printRDD(final JavaRDD<String> s) {
public static void main(final String[] args) {
  SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("testApp");
  try (JavaStreamingContext sc = new JavaStreamingContext(conf, new Duration(5000))) {
    JavaDStream<String> stream = sc.textFileStream("C:/testnumbers/");

Here, I simply break down a JavaDStream to an RDD using foreachRDD() and use System.out::println as method reference to print each RDDs content.

500€ bei Paypal gewonnen… oder doch nicht?

Gerade habe ich folgende Mail von Paypal bekommen.








Ich habe also 500€ gewonnen! Wow! Derartigen Spam erkennt man ja recht schnell an irgendwelchen haarsträubenden Bedingungen mit die mit dem angeblichen Gewinn einhergehen, aber hier sind diese nirgends zu finden. Außerdem sagt ja der Satz “Sie können Ihren Gewinn ab sofort einlösen” eigentlich aus, dass ich tatsächlich 500€ gewonnen hätte. Außerdem hat Paypal mir das Geld angeblich schon aufs Konto geschickt (“Denn dort haben wir Ihnen die 500€ gutgeschrieben.”). Gleich mal nachgeschaut – nichts. Welche Überraschung! Interessanterweise ist sogar die offizielle Hotline besetzt, ob da wohl noch mehr Leute anrufen und fragen wollen wo denn ihr Gewinn abgeblieben ist?

Liebes Marketing-Team von Paypal. Seid ihr schon so verzweifelt, dass keiner mehr eure tollen Werbeangebote, die ihr alle 2 Tage per Mail rumschickt, anschaut, dass ihr schon mit solchen Aktionen für Aufmerksamkeit sorgen müsst? Verdient ihr denn noch nicht genug über eure recht hohen Verkaufsprovisionen?

Eigentlich bin ich ja gegen derartige Posts, aber da es zu der Mail noch nichts im Netz zu finden gibt, muss ja einer mal den Anfang machen und ein bisschen was darüber schreiben. Sonst geht das noch ewig so weiter 🙂

P.S. Achso, könnte natürlich auch eine Fishing-Mail sein. Allerdings stimmt das Zertifikat der verlinkten Seiten mit dem von Paypal überein.