Project of building a data “pipeline” taking into account Big Data aspects
The University of Bielsko-Biala realizes the project as part of BSc thesis. The subject of the thesis is to get acquainted with the available technologies that enable working on large data sets referred to as Big Data and to design a data pipeline based on specific assumptions. The process of creating a data stream covered the processes of obtaining, processing, analysing, applying and storing data. When designing the data pipeline, the author of the work used Google Cloud Platform tools. The Google Storage component made it possible to access the data storage container. The Google Dataproc service made it possible to create and configure the Hadoop and Spark cluster. In the created example, the author showed the transformation process of data obtained from the Twitter social network, extracting from its data on the customer’s opinion about the surveyed company. The next stage of work concerned the analysis of the obtained data, for this purpose the Google Data Studio platform was used, with the help of which charts, reports and statistics were created, enabling an in-depth analysis of the studied example. The obtained data was compared with the information available on the market.

work supervisor: Marcin Bernaś (mbernas@ath.bielsko.pl)