Calculate prime numbers using spark

Hi, for a test I wrote a short java application that calculates prime numbers on a distributed spark cluster between 0 and 1000000. Since spark 2.x examples are rare on the internet I just leave this here. Prime number code is by Oscar Sanchez.

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
 
public class Main {
 
  private static boolean isPrime(long n) {
    for (long i = 2; 2 * i < n; i++) {
      if (n % i == 0) {
        return false;
      }
    }
    return true;
  }
 
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder().appName("PrimeApp").getOrCreate();
    Dataset<Tuple2<Long, Boolean>> rnd = spark.range(0L, 1000000L).map(
      (MapFunction<Long, Tuple2<Long, Boolean>>) x -> new Tuple2<Long, Boolean>(x, isPrime(x)), Encoders.tuple(Encoders.LONG(), Encoders.BOOLEAN()));
    rnd.show(false);
    spark.stop();
  }
 
}

Java client for the hbase REST API

Modern open source projects frequently lack documentation and so does hbase regarding its REST API. Even though REST is covered in the official hbase reference book the guidelines are not complete as I’ll show in the listings below. A second source that you might have come across is a set of three blog posts by Cloudera which is quite good and much more detailled than the official hbase guidelines – but it still does not fully cover then traps you might run into when implementing a java client.

I’d like to show three examples, two for a put and one for a get request. All further operations like delete or post should then be easy to implement once you know how to send requests to the REST server.

Continue reading “Java client for the hbase REST API”

Apache Ranger 0.5 build errors – How to fix them

I recently came accross some maven errors while building Apache Ranger 0.5 leading to a failure while building the HDFS Security Plugin and the Storm Security Plugin.

To get rid of the error just change the version of the maven-assembly-plugin in Ranger’s main pom.xml from

<version>2.2-beta-5</version>

to

<version>2.3</version>

…, add the following repository definitions to the main pom.xml

<repository>
<id>conjars</id>
<name>Concurrent Conjars repository</name>
<url>http://conjars.org/repo</url>
<layout>default</layout>
</repository>
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>

…, add the following dependency to the storm-agent/pom.xml:

<dependency>
<groupId>ring</groupId>
<artifactId>ring-jetty-adapter</artifactId>
<version>0.3.11</version>
</dependency>

Now you should be able to build ranger. If you are facing issues with a missing pom.xml for the dependency tomcat:common-el then add the following repository to the main pom.xml:

<repository>
<id>conjars</id>
<name>Concurrent Conjars repository</name>
<url>http://conjars.org/repo</url>
<layout>default</layout>
</repository>

The absence of the pom.xml on the default maven repository should only be a temporary issue but it took me some time to figure it out so that you can avoid it easily by addind a second repo that provides the missing file.

HDFS – setgid and setuid still not possible in Hadoop 2.7.1

I faced the requirement to set the setgid and/or the setuid for directories in a customer’s HDFS so that I spent the last 10 minutes searching the internet for news on a possibility to do so. Long story short: results were bad. Last time I tried with Hadoop 2.2.0 without success, today I tried with 2.7.1 but there were no changes at all. So commands like the following are NOT possible:

hdfs dfs -chmod g+s /tmp
hdfs dfs -chmod u+s /tmp

As a matter of fact, files created in HDFS receive the group of the parent directory and hence act as if the setgid flag is set. The uid a newly created referes always to the user that created the file.

Update: When mounting an HDFS as NFS Gateway the behavior is different since the file system acts as if the setgid was NOT set. Which is strange since same file operations return different results depending on on which interface the operation is executed on. Furthermore, the NFS Gateway does not take secondary groups into account when evaluating permissions to write into the mounted HDFS. The only options to achive an allowance is a user mapping as described at the bottom of the page of the Hadoop Permissions Guide.

If you face the need to make use of the flags think about using a posix compliant alternative to HDFS like Spectrum Scale/GPFS.

 

setuid_setgid

Einen Hadoop-Cluster aufsetzen

Vor einiger Zeit habe ich ein Tutorial von Michael Noll übersetzt in dem es darum ging, eine Hadoop-Instanz zu erstellen. Ein einzelner Knoten dient allerdings eher den Test- und Lernzwecken und es ist natürlich erstrebenswert einen Cluster zu erstellen durch den das verteiltarbeitende System Hadoop seine Stärken erst richtig ausspielen kann.

In diesem Tutorial übersetze ich also den zweiten Teil, der davon handelt, zwei Server zu einem Cluster zusammen zu fügen. Das PDF kann hier heruntergeladen werden:

Viel Spaß damit!

Hadoop 1.0.4 auf Ubuntu Server 12.04.2 installieren

Die Installation von Hadoop ist nicht ganz trivial und leider auch nicht allzugut dokumentiert. Allerdings existiert ein sehr gutes Tutorial von Michael Noll in dem dieser erklärt wie man einen einfachen Single-Node-Cluster aufsetzt. Dieses habe ich mehr oder weniger ins Deutsche übersetzt und stelle es hier zur Verfügung.

 

HDFS Explorer – Managing the Hadoop file system on a remote machine

Working with hadoop means working with the Hadoop File System (HDFS). Therefore, it is mandatory to read, write and delete files via command line. That can be quite difficult and exhausting when you are not familiar with the common unix and hadoop commands. To handle this task, I wrote a small application that is able to work with an HDFS running on Ubuntu.

So far, the application is able to:

  • Read the HDFS in a treeview
  • Upload / Download files to/from the local machine
  • Delete files on the HDFS
  • Create directories on the HDFS

When there is a need (and if I get enough good feedback 😉 ), I’ll add session management for several file systems as well as the function to start MapReduce jobs from the application (as it can be seen in the lower group box).

A download is about to follow soon!

[Hadoop] hadoop.job.ugi nicht in Lokations-Optionen vorhanden

Ein weiteres Problem über das ist beim Einrichten des Eclipse-Plugins gestolpert bin, ist, dass die Eigenschaft hadoop.job.ugi die laut dem Yahoo-Tutorial gesetzt werden soll nicht vorhanden ist.

Diese erscheint erst, wenn einmal eine erfolgreiche Verbindung zum Filesystem hergestellt werden konnte (Sollten Sie damit Probleme haben, lesen Sie bitte hier weiter), vorher nicht!

[Hadoop] Auf HDFS auf Virtual Machine über Eclipse zugreifen

Ich beschäftige mich in letzter Zeit ein wenig mit dem Thema BigData. Yahoo stellt dafür ein gutes Tutorial bereit, allerdings klappt es nicht wie beschrieben eine Verbindung aus Eclipse auf eine Virtual Machine herzustellen auf der das entsprechende Filesystem liegt (HDFS). Eclipse kommentiert dieses Unvermögen mit einem Simplen Error: null.

Dieser Fehler tritt auf, wenn Eclipse auf einem Windowssystem läuft und der Trick ist es: Starte Eclipse aus Cygwin!

Der Befehl cygstart “C:\eclipse\eclipse.exe” sollte reichen und die Verbindung kommt zustande.