Java / Spark – Printing RDDs from Spark Streaming

A common challenge when writing functional code in Spark is to simply output logs as we usually do it in ordinary applications where line for line is processed sequentially and a debug message could be printed at an arbitrary place using System.out.println(). Working with Spark’s RDDs is new to most of us and might be confusing at first since it is difficult to track what object type we are currently working with (is it a RDDStream, an RDD or a simple type like String or int) – but infact this tracking is important. If you make use of the print function that is offered by RDDs or RDDStreams you’ll just see more or less useless array like information in the console instead of the objects’ content.

What I commonly do is to resolve the RDD object down to a simple JavaRDD or JavaDStream<object> using e.g. flatmap(x -> x) to flatten streams or lists in a dataset or map() to break down complex objects to a simple data type (e.g. via map(person -> person.getAddress().getStreet())). On RDDs you can then do a foreach(System.out::println). On JavaDStream you can do the following:

public static void printRDD(final JavaRDD<String> s) {
  s.foreach(System.out::println);
}
 
public static void main(final String[] args) {
  SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("testApp");
  try (JavaStreamingContext sc = new JavaStreamingContext(conf, new Duration(5000))) {
    JavaDStream<String> stream = sc.textFileStream("C:/testnumbers/");
    stream.foreachRDD(Main::printRDD);
    sc.start();
    sc.awaitTermination();
  }
}

Here, I simply break down a JavaDStream to an RDD using foreachRDD() and use System.out::println as method reference to print each RDDs content.

Posted in Computer, Uncategorized | Tagged , | Leave a comment

Java – Streams vs Collections vs Loops

Trying to figure out what method serves best in java to get the maximum timestamp out of a list I setup a small benchmark to evaluate the runtime of Java Streams, collections and an ordinary for loop. My setup can be seen in the following listing.

 

public static void main(final String[] args) {
 
  List list = new ArrayList();
  Random rnd = new Random();
  for (int i = 0; i < 10; i++) { list.add(rnd.nextInt(10000)); } long t1 = System.currentTimeMillis(); long largestNumber = 0; for (Integer t : list) { if (t > largestNumber) {
      largestNumber = t;
    }
  }
 
  long t2 = System.currentTimeMillis();
  System.out.println("Max (loop) is " + largestNumber + " in " + (t2 - t1) + " ms.");
 
  long t3 = System.currentTimeMillis();
  OptionalInt max = list.stream().mapToInt(Integer::intValue).max();
 
  long t4 = System.currentTimeMillis();
  if (max.isPresent()) {
    System.out.println("Max (Stream) is " + max.getAsInt() + " in " + (t4 - t3) + " ms.");
  }
 
  long t5 = System.currentTimeMillis();
  int collMax = Collections.max(list);
  long t6 = System.currentTimeMillis();
 
  System.out.println("Max (Collections) is " + collMax + " in " + (t6 - t5) + " ms.");
}

Here are the results:

Samples Steams Collections Loop
10.000 68 3 3
50.000 70 5 4
100.000 71 6 5
500.000 76 16 10
1.000.000 108 18 12
1.500.000 89 20 17
5.000.000 94 29 19
10.000.000 103 41 30

The expected linear runtime allows a comparision of these three ways. Although, collections and streams look way better than the loop runtime shows that it still seems to be the fastest way to determine a max value from values in a simple list.
 

Posted in Computer | Tagged | Leave a comment

Apache Ranger 0.5 build errors – How to fix them

I recently came accross some maven errors while building Apache Ranger 0.5 leading to a failure while building the HDFS Security Plugin and the Storm Security Plugin.

To get rid of the error just change the version of the maven-assembly-plugin in Ranger’s main pom.xml from

<version>2.2-beta-5</version>

to

<version>2.3</version>

…, add the following repository definitions to the main pom.xml

<repository>
<id>conjars</id>
<name>Concurrent Conjars repository</name>
<url>http://conjars.org/repo</url>
<layout>default</layout>
</repository>
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>

…, add the following dependency to the storm-agent/pom.xml:

<dependency>
<groupId>ring</groupId>
<artifactId>ring-jetty-adapter</artifactId>
<version>0.3.11</version>
</dependency>

Now you should be able to build ranger. If you are facing issues with a missing pom.xml for the dependency tomcat:common-el then add the following repository to the main pom.xml:

<repository>
<id>conjars</id>
<name>Concurrent Conjars repository</name>
<url>http://conjars.org/repo</url>
<layout>default</layout>
</repository>

The absence of the pom.xml on the default maven repository should only be a temporary issue but it took me some time to figure it out so that you can avoid it easily by addind a second repo that provides the missing file.

Posted in Computer | Tagged | Leave a comment

StayOnTop – Application to keep any application window always on top

alwaysontopFor former windows versions there was an application (I don’t remember the name) which was able to keep any window on top of the others. That was pretty helpful while copying text or watching movies during work. For Windows 7 I did not find a free one so I wrote one myself. It is a .NET application so you need the runtime. The application starts as tray icon. When you click it with the right mouse button it shows all processes that have a visible interface. If you click it the window is set on top, if you click it again the window is sent to the background again. The application is not too smart and does not remember the top most windows if you close it. Just set and reset the window again if you accidently closed the application. Hope you like it and consider it as helpful as I do.

You can download it here for free.

Posted in Computer | Leave a comment

Wichtelwald – Contest entry for Devmania 2015

Last weekend we wrote a game for the over-night contest of the Devmania game jam in Mainz. Collect wood and snowballs at day time to be able to defend your hut at night against the evil goblins. The gameplay is a funny mixture of isometric collectable and first person tower defence.

Download it here.

Posted in Computer | Tagged | Leave a comment

HDFS – setgid and setuid still not possible in Hadoop 2.7.1

I faced the requirement to set the setgid and/or the setuid for directories in a customer’s HDFS so that I spent the last 10 minutes searching the internet for news on a possibility to do so. Long story short: results were bad. Last time I tried with Hadoop 2.2.0 without success, today I tried with 2.7.1 but there were no changes at all. So commands like the following are NOT possible:

hdfs dfs -chmod g+s /tmp
hdfs dfs -chmod u+s /tmp

As a matter of fact, files created in HDFS receive the group of the parent directory and hence act as if the setgid flag is set. The uid a newly created referes always to the user that created the file.

Update: When mounting an HDFS as NFS Gateway the behavior is different since the file system acts as if the setgid was NOT set. Which is strange since same file operations return different results depending on on which interface the operation is executed on. Furthermore, the NFS Gateway does not take secondary groups into account when evaluating permissions to write into the mounted HDFS. The only options to achive an allowance is a user mapping as described at the bottom of the page of the Hadoop Permissions Guide.

If you face the need to make use of the flags think about using a posix compliant alternative to HDFS like Spectrum Scale/GPFS.

 

setuid_setgid

Posted in Computer | Tagged | Leave a comment

Short Eclipse proxy guidelines

If you consider yourself as developer and if you use Eclipse as IDE you might at least once faced that problem that you sat behind a proxy in some customers network and need to use to connect to the internet downloading either maven dependencies or plugins for your IDE. Since I struggle with the configuration of the proxy in nearly every project I decided to write these few facts down in case they might help you, too.

In Eclipse → Window → Preferences → General → Network Connections you find the following view. The combobox named Active Provider lists three options to configure a proxy. Native uses the native OS settings, Manual allows you to enter a proxy host, port, user name and password yourself and Direct connects directly to the desired URL.

ProxyEclipse

As a first step you might want to edit the proxy settings for a manual setup and enter the proxy server properties as provided by your network administrator. Then follow these two simple rules for your later work in Eclipse:

  • When downloading software from an online repository or the Eclipse marketplace – Set the proxy to Manual
  • When downloading maven dependencies – Set to proxy to Native

That’s it!

 

Posted in Computer | Tagged , , | Leave a comment

Deploy a web application to an embedded Tomcat 8

I’ve found various tutorials on creating an embedded tomcat programatically and deploy a web application to it. Unfortunately, non of the tutorials was up to date using Tomcat 8 and actually non was showing how to deploy anything else but a servlet so I decided to write a short tutorial on how to create the server and deploy either a WAR file or a folder containing a non archived web application.

Create a simple Java project in ecplipse, enable maven and add the following dependencies:

<dependencies>
  <dependency>
    <groupId>org.apache.tomcat</groupId>
    <artifactId>tomcat-catalina</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat</groupId>
    <artifactId>tomcat-util</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat.embed</groupId>
    <artifactId>tomcat-embed-core</artifactId>
    <version>8.0.21</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tomcat.embed</groupId>
    <artifactId>tomcat-embed-jasper</artifactId>
    <version>7.0.8</version>
  </dependency>
</dependencies>

Optionally, you can enable the maven-assembly-plugin to package your application as JAR at the end and have maven include all dependencies you specified. As build goal you’ll have to use assembly:single.

<plugin>
  <artifactId>maven-assembly-plugin</artifactId>
  <configuration>
    <archive>
      <manifest>
        <mainClass>de.jofre.embeddedtc.runtime.Main</mainClass>
      </manifest>
    </archive>
    <descriptorRefs>
      <descriptorRef>jar-with-dependencies</descriptorRef>
    </descriptorRefs>
  </configuration>
</plugin>

Now create the class de.jofre.embeddedtc.runtime.Main (The name is arbitrary) and add write the code according to the next listing:

package de.jofre.embedded.runtime;
 
import java.io.File;
import java.util.logging.Logger;
 
import org.apache.catalina.Context;
import org.apache.catalina.LifecycleException;
import org.apache.catalina.startup.Tomcat;
 
public class Main {
  private final static Logger LOGGER = Logger.getLogger(Main.class.getName());
  private final static String mWorkingDir = System.getProperty("java.io.tmpdir");
  private static Tomcat tomcat = null;
 
  public static void main(String[] args) {
 
    tomcat = new Tomcat();
    tomcat.setPort(8080);
    tomcat.setBaseDir(mWorkingDir);
    tomcat.getHost().setAppBase(mWorkingDir);
    tomcat.getHost().setAutoDeploy(true);
    tomcat.getHost().setDeployOnStartup(true);
 
    try {
      tomcat.start();
    } catch (LifecycleException e) {
      LOGGER.severe("Tomcat could not be started.");
      e.printStackTrace();
    }
    LOGGER.info("Tomcat started on " + tomcat.getHost());
 
    // Alternatively, you can specify a WAR file as last parameter in the following call e.g. "C:\\Users\\admin\\Desktop\\app.war"
    Context appContext = Main.getTomcat().addWebapp(Main.getTomcat().getHost(), "/app", "C:\\Users\\admin\\Desktop\\app\\");
    LOGGER.info("Deployed " + appContext.getBaseName() + " as " + appContext.getBaseName());
 
    tomcat.getServer().await();
  }
}

The last question is how the directory in C:\\Users\\admin\\Desktop\\app\\ respectively the C:\\Users\\admin\\Desktop\\app.war looks like. Well, it contains a simple HTML file…

<html><body>Test</body></html>

… and another folder called WEB-INF containing the web.xml with the following content:

<?xml version="1.0" encoding="ISO-8859-1"?>
 
<!DOCTYPE web-app 
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" 
    "http://java.sun.com/dtd/web-app_2_3.dtd">
 
<web-app>
 
    <display-name>Test App</display-name>
    <description>A test app</description>
 
	<welcome-file-list>
			<welcome-file>index.html</welcome-file>
	</welcome-file-list>
 
    <session-config>
      <session-timeout>30</session-timeout>
    </session-config>
 
</web-app>

Now if you start the java application you call http://localhost:8080/app and the content of index.html should be displayed. Hope this helps you!

Posted in Computer | Tagged , | 6 Comments

Unix – id does not show group names from groups in LDAP

I recently faced the challenge that group names are not displayed calling e.g. id jonas when managing my groups in LDAP which is connected via SSSD. My fault was that I forgot to add the following lines to my domain in /etc/sssd/sssd.conf:

ldap_group_object_class = posixGroup
ldap_group_search_base = ou=groups,dc=example,dc=com
ldap_group_name = cn
ldap_group_member = memberUid

Hope this hint helps here and there.

Posted in Computer | Tagged | Leave a comment

Auflösen von Oracle TNS Namen schlägt fehlt

Ich arbeite seit einigen Tagen mehr oder weniger freiwillig mit Oracle und bin fast verrückt geworden, als ich daran gescheitert bin, über sqlplus eine einfache Verbindung zu einer Datenbank herzustellen. Das übliche Vorgehen ist, dass man in der Datei %ORACLE_HOME%/network/admin/tnsnames.ora einen Alias für eine Verbindung anlegt und diese dann bei der Verbindungsherstellung angibt. Ein solcher Alias sieht wie folgt aus:

myAlias=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=myhost.jofre.de)(PORT=1234))(CONNECT_DATA=(SERVICE_NAME=mysrv)(SERVER=DEDICATED)))

Um nun über Command-Line und sqlplus eine Verbindung zu der Datenbank herzustellen, führt man folgenden Befehl aus:

sqlplus user/password@myAlias

Nun hatte ich immer das Problem, dass mir dabei die Fehlermeldung TNS-03505 Failed to resolve name angezeigt wurde. D.h. dass Oracle den Alias myAlias nicht auflösen konnte, da der Inhalt von tnsnames.ora scheinbar nicht bekannt war. Des Rätsels Lösung war nun die Umgebungsvariable TNS_ADMIN zu setzen, sodass diese auf tnsnames.ora zeigt.

set TNS_ADMIN=%ORACLE_HOME%/network/admin/tnsnames.ora

Natürlich muss ORACLE_HOME im Vorfeld ebenso gesetzt worden sein. Hoffe das hilft dem ein oder anderen.

Posted in Computer, Studium | Tagged | Leave a comment