[Java] 100 common words in German, English and French

To exclude common words from word and tag clouds and determine them in text mining procedures it is neccessary to determine them beforehand. The following Java class will help you to find them in the languages German, English and French. Thanks to the University of Leipzig who created the collection.

Update: I added some common words to the german list that I found doing my research and removed and removed some duplicates.

package de.jofre.commonwords;
 
public class CommonWords {
 
	// By Jonas Freiknecht
 
	// Source: http://wortschatz.uni-leipzig.de/
 
	// There also exists a web service from the University of Leipzig
	// to determine common words:
	// http://wortschatz.uni-leipzig.de/axis/servlet/ServiceOverviewServlet
 
	public static boolean contains(String _str, String[] _list) {
		for (int i = 0; i < _list.length; i++) {
			if (_list[i].equalsIgnoreCase(_str)) {
				return true;
			}
		}
		return false;
	}
 
		public final static String[] COMMON_WORDS_GERMAN = { "der", "die", "und",
			"in", "den", "von", "zu", "das", "mit", "sich", "des", "auf",
			"für", "ist", "im", "dem", "nicht", "ein", "für", "eine", "als",
			"auch", "es", "an", "werden", "aus", "er", "hat", "daß", "sie",
			"nach", "wird", "bei", "einer", "du", "um", "am", "sind", "noch",
			"wie", "einem", "über", "einen", "ob", "so", "dessen", "zum", "war",
			"haben", "nur", "oder", "aber", "vor", "zur", "bis", "mehr",
			"durch", "man", "sein", "wurde", "sei", "RT", "bin", "hatte",
			"kann", "gegen", "vom", "können", "schon", "wenn", "habe", "seine",
			"Mark", "ihre", "dann", "unter", "wir", "soll", "ich", "eines",
			"ins", "Jahr", "zwei", "Jahren", "diese", "dieser", "wieder",
			"keine", "willst", "seiner", "worden", "Und", "will", "zwischen",
			"extra", "immer", "Millionen", "Ein", "was", "sagte", "ihr",
			"jetzt", "kennen", "sagen", "armer", "arme", "gerne", "kenne",
			"meine", "hoffe", "sehen", "achso", "reicht", "dabei", 
			"gehst", "alles", "selbst", "neuen", "neue", "liebe",
			"feiern", "letzte", "macht", "könnte", "keiner", "glaub", "glaube",
			"gehen", "euren", "passt", "passe", "passen", "findet",
			"eigentlich", "reden", "machen", "liebt", "halbe", "dieses",
			"finde", "sitze", "machen", "halbe", "sonst", "heute", "brauch",
			"drauf", "total", "meint", "denkt", "lässt", "hätte",
			"damals", "lange", "dachte", "wirst", "hören", "kennt",
			"bitte", "treffen", "würde", "fängt", "länger", "könnt",
			"sitzt"};
 
	public final static String[] COMMON_WORDS_ENGLISH = { "the", "of", "to",
			"and", "a", "in", "for", "is", "The", "that", "on", "said", "with",
			"be", "was", "by", "as", "are", "at", "from", "it", "has", "an",
			"have", "will", "or", "its", "he", "not", "were", "which", "this",
			"but", "can", "more", "his", "been", "would", "about", "their",
			"also", "they", "million", "had", "than", "up", "who", "In", "one",
			"you", "new", "A", "I", "other", "year", "all", "two", "S", "But",
			"It", "company", "into", "U", "Mr.", "system", "some", "when",
			"out", "last", "only", "after", "first", "time", "says", "He",
			"years", "market", "no", "over", "we", "could", "if", "people",
			"percent", "such", "This", "most", "use", "because", "any", "data",
			"there", "them", "government", "may", "software", "so", "New",
			"now", "many" };
 
	public final static String[] COMMON_WORDS_FRENCH = { "de", "la", "le",
			"et", "les", "des", "en", "un", "du", "une", "que", "est", "pour",
			"qui", "dans", "a", "par", "plus", "pas", "au", "sur", "ne", "se",
			"Le", "ce", "il", "sont", "La", "Les", "ou", "avec", "son", "Il",
			"aux", "d'un", "En", "cette", "d'une", "ont", "ses", "mais",
			"comme", "on", "tout", "nous", "sa", "Mais", "fait", "été",
			"aussi", "leur", "bien", "peut", "ces", "y", "deux", "A", "ans",
			"l", "encore", "n'est", "marché", "d", "Pour", "donc", "cours",
			"qu'il", "moins", "sans", "C'est", "Et", "si", "entre", "Un", "Ce",
			"faire", "elle", "c'est", "peu", "vous", "Une", "prix", "On",
			"dont", "lui", "également", "Dans", "effet", "pays", "cas", "De",
			"millions", "Belgique", "BEF", "mois", "leurs", "taux", "années",
			"temps", "groupe" };
 
}

2 thoughts on “[Java] 100 common words in German, English and French”

  1. Ist leider vieles doppelt wegen unterschiedlicher Groß-/Kleinschreibung (obwohl die contains-Methode dann case-insensitive ist)…

  2. Da hast du allerdings recht, vielleicht überarbeite ich die Liste nochmal und nehme ein paar mehr Wörter rein.

Leave a Reply to flo Cancel reply

Your email address will not be published. Required fields are marked *