Publication details

Extracting Semantic Relations from Wikipedia using Spark (Hans Ole Hatzel), Bachelor's Thesis, School: Universität Hamburg, 2017-02-02
Publication details

Abstract

In this work, the full text of both the German and the English Wikipedia were used for two subtasks. 1. Finding Compound Words 2. Finding Semantic Associations of Words The approach to the first task was to find all nouns in the Wikipedia and evaluate which of those form compounds with any other nouns that were found. PySpark was used to work through the whole Wikipedia dataset and the performance the part-of-speech tagging operation on the whole dataset was good. In this way, a huge list of nouns was created which could then be used to check it for compound words. As this involved checking each noun against every other noun the performance was not acceptable, with the analysis of the whole English Wikipedia taking over 200 hours. The data generated from the first subtasks was then for the task of both generating and solving CRA tasks. CRA tasks could be generated at a large scale. CRA tasks were solved with an accuracy of up to 33%. The second subtask was able to cluster words based on their semantics. It was established that this clustering works to some extend and that the vectors representing the words therefor have some legitimacy. The second subtask's results could be used to perform further analysis on how the difficulty of CRA tasks behaves with how words are related to each other.

BibTeX

@misc{ESRFWUSH17,
	author	 = {Hans Ole Hatzel},
	title	 = {{Extracting Semantic Relations from Wikipedia using Spark}},
	advisors	 = {Julian Kunkel},
	year	 = {2017},
	month	 = {02},
	school	 = {Universität Hamburg},
	howpublished	 = {{Online \url{https://wr.informatik.uni-hamburg.de/_media/research:theses:hans_ole_hatzel_extracting_semantic_relations_from_wikipedia_using_spark.pdf}}},
	type	 = {Bachelor's Thesis},
	abstract	 = {In this work, the full text of both the German and the English Wikipedia were used for two subtasks. 1. Finding Compound Words 2. Finding Semantic Associations of Words The approach to the first task was to find all nouns in the Wikipedia and evaluate which of those form compounds with any other nouns that were found. PySpark was used to work through the whole Wikipedia dataset and the performance the part-of-speech tagging operation on the whole dataset was good. In this way, a huge list of nouns was created which could then be used to check it for compound words. As this involved checking each noun against every other noun the performance was not acceptable, with the analysis of the whole English Wikipedia taking over 200 hours. The data generated from the first subtasks was then for the task of both generating and solving CRA tasks. CRA tasks could be generated at a large scale. CRA tasks were solved with an accuracy of up to 33\%. The second subtask was able to cluster words based on their semantics. It was established that this clustering works to some extend and that the vectors representing the words therefor have some legitimacy. The second subtask's results could be used to perform further analysis on how the difficulty of CRA tasks behaves with how words are related to each other.},
}