hadoop – IBIGWORLD Job Hub

Erasmus+ project iBIGworld: Transnational Meeting M4 has been conducted at the University of Library Studies and Information Technologies in Bulgaria

By Vasyl Martsenyuk | September 30, 2022 | Comments 0 Comment

On September 15-16, 2022 Transnational Meeting M4 was held in Sofia (Bulgaria). The Meeting was conducted on the basis of the University of Library Studies and Information Technologies https://www.unibit.bg/

The fourth project meeting was organized by the University of Library Studies and Information Technologies (ULSIT) in Bulgaria. The purpose of this meeting was to summarize the project accomplishments, discuss the experience from the second training activity and the second multiplier event, and finalize the evaluations of the fulfillment of the objectives. The plan for post-project output sustainability was confirmed and signed.

The participants from the University of Nis (UNi), the University of Bielsko-Biala (UBB), and the University of Library Studies and Information Technology (ULSIT) took an active part in the meeting.

Photo. Participants of M4 at the University of Library Studies and Information Technologies..

The link to the agenda of the meeting is below

https://docs.google.com/document/d/1HBQPKkZn4NRatXKeBlq-O3wKfKtiy0FF/edit?usp=sharing&ouid=101046534126028994548&rtpof=true&sd=true

At first, the local dissemination reports were presented and confirmed by Steering Team.

Then the task performance reports were presented and discussed, paying an attention to the sequential order of the intellectual outputs:

the requirements for Big Data training course (O2) follow from the survey analysis of O1;
the framework of Big Data training course is constructed on the matrix competencies-topics from O2;
guidelines for teachers-student-business at stage O4 are based on the training course O3.

Prof. Vasyl Martsenyuk presented the current state of the project outcomes. He focused attention on the rules of reporting the intellectual outcomes of the projects. Namely, the responsibility for preparing corresponding intellectual outputs lies on the leading organizations: O1 – UBB, O2 – TSNUK, O3 – ULSIT, and O4 – UNi. Also, he remarked on the necessity of arrangement of financial reporting for the project, especially timesheets for the intellectual outputs. The real discussion was devoted to the good practice of reporting multiplier events E1, E2, E3, and E4.

Further, he presented to the participants the sustainability plan for the purpose to postpone the dissemination and exploitation activities of the project to the period after completing the financing of the project by the EC. The link to the sustainability plan is below https://docs.google.com/document/d/1AYHF9lYoeRAN0twpQ6eJA6Oty6bzkMtO/edit?usp=sharing&ouid=101046534126028994548&rtpof=true&sd=true

Prof. Georgi Dimitrov has drawn attention to the preparation and shaping of the final evaluations of the project results. The participants of the meeting remarked on the significant contribution of the team of ULSIT and the leading role when developing the framework of the Big Data training course.

Prof. Dejan Rancic has presented the activities fulfilled by the team of UNi. The other participants of the meeting have highly estimated the contribution of the UNi team to the piloting stage of the project, especially when organizing C2 and C3 training activities.

Financial management report was discussed. The participants have agreed on the order of the preparation and filling in of the final financial report with the help of the Mobility Tool platform.

Local dissemination results have been also presented.

Finally, to-do list was offered and confirmed.

Photo: Coordinators of the projects’ teams at M4: prof. Dejan Rancic (UNi), prof. Georgi Dimitrov (ULSIT), prof. Vasyl Martsenyuk (UBB).

Photo: participants of M4 from UBB (Marcin Bernas and Vasyl Martsenyuk) and ULSIT (Eugenia Kovatcheva) at the territory of the rectorate of ULSIT.

Project of building a data “pipeline” taking into account Big Data aspects

By admin | March 17, 2022

The University of Bielsko-Biala realizes the project as part of BSc thesis. The subject of the thesis is to get acquainted with the available technologies that enable working on large data sets referred to as Big Data and to design a data pipeline based on specific assumptions. The process of creating a data stream covered the processes of obtaining, processing, analysing, applying and storing data. When designing the data pipeline, the author of the work used Google Cloud Platform tools. The Google Storage component made it possible to access the data storage container. The Google Dataproc service made it possible to create and configure the Hadoop and Spark cluster. In the created example, the author showed the transformation process of data obtained from the Twitter social network, extracting from its data on the customer’s opinion about the surveyed company. The next stage of work concerned the analysis of the obtained data, for this purpose the Google Data Studio platform was used, with the help of which charts, reports and statistics were created, enabling an in-depth analysis of the studied example. The obtained data was compared with the information available on the market.

work supervisor: Marcin Bernaś (mbernas@ath.bielsko.pl)

Big Data nowadays

By Marcin Bernaś | March 22, 2021

The Big Data is a relatively new area of research and it merges various areas like cloud computing, data science and Artificial Intelligence. Its definition it was proposed in 2012 [1], where large Volume of data, Variety in kind was processed taking under consideration its Velocity (thus 3V). Nowadays definition evolved to 5V [2] and takes under consideration both veracity (treated as quality of captured data) and its value (as usefulness).

Over the last decade the Big data processing pattern was established and takes under consideration the following elements: ingest (data collection stage), store (managing and storing data – also in real time), process (managing data), analyse (obtaining vital information) and insight (data consumption in form of information or data for further applications).

The process usually starts with data collection (data are meeting 5V definition). The data are usually processed as data logs (e.g., Flume), bulk data (e.g., Sqoop), messages (e.g., Kafka), dataflow (e.g., NiFi). Then large data are processed using computing engine as batches (e.g., MapReduce) or streams (e.g., Flink, Spark, Storm, Flink). Data (structured or not) are analysed using machine Learning methods (e.g., Caffe, Tensorflow, Python), statistic approach (SparkR, R) and then visualized (e.g. Tableau, GraphX). It is worth to keep in mind that created solution is constantly changing and solution should be updated (e.g., Oozie, Kepler, Apache NiFi). The obtained data can be managed by various solutions e.g., Apache Falcon, Apache Atlas, Apache Sentry, Apache Hive. The important issue is also data security (e.g., Apache Metron or Apache Knox) or new technology that changes ways and types of data like InfiniBand or 5G.

The Big data has over 10 years and it is reaching new heights, thanks to the vast adaptation and companies that are delivering new tools. Looking at the summary [3] the number of technology and solutions is overwhelming (look hire).

During our research we are looking for competencies required by the international and local market. Based on our analysis and trends [4] we identified classic opensource Hadoop, Spark and Storm solutions and technologies that gain popularity. Our research is focusing on open-source solutions that could be used on dedicated infrastructure or Big data cloud services provided by leading platforms like AWS, Microsoft Azure or Big Query by Google.

In our research we keep in mind that the market is flooded with new mechanism and pipelines constantly to allow to tackle with big data in simpler and unified way. The solutions tend create solutions which simplifies the Big Data analysis and make it easier to use. There are several solutions [3], which shows current trends:

• visual analytical tools that allow to focus on data analytics using simple calculations or point-and-click approach, while gaining support in big data storage, real-time management and security. The services allowing this are Arcadia Enterprise 4.0, AtScale 5.0 or Dataguise DgSecure 6.0.5;
• frameworks that allow to create application based on Big Data using DevOps capabilities and big data transformation support. They allow to utilize known languages as R, Python or SQL. They are Attunity Compose 3.0, Cazena Data Science Sandbox as a Service or Lucidworks Fusion 3. Some solutions like Couchbase suite are directed for web, mobile, and Internet of Things (IoT) applications based.
• solutions that are helping to provide data as service for applications. They are using pipelines like Microsoft Azure or Hadoop ecosystem and change them into information platform (Paxata Spring ’17, Pentaho 7.0 or Qubole Data Service).

References
[1] Wu, X., Zhu, X., Wu, G.-Q. and Ding, W. (2014) Data Mining with Big Data. IEEE Transactions on Knowledge and Data Engineering, 26, 97-107.
https://doi.org/10.1109/TKDE.2013.109
[2] Nagorny K., Lima – Monteiro, P. Barata J., Colombo A.W., Big Data analysis in smart manufacturing. Int.J.Commun.Netw.Syst.Sci.10(2017)31–58
[3] The Big data technology map: http://mattturck.com/wp-content/uploads/2020/09/2020-Data-and-AI-Landscape-Matt-Turck-at-FirstMark-v1.pdf
[4] Yesheng Cui and Sami Kara and Ka C. Cha . Manufacturing big data ecosystem: A systematic literature review. Robotics and Computer-Integrated Manufacturing, 62: 101861, 2020.
[5] Article online: https://www.readitquik.com/articles/digital-transformation/10-big-data-advances-that-are-changing-the-game/