During the 25th to 26th of October, IRIS attended the Big Data Congress, held in Barcelona (Spain) and organized by Eurecat. The organizers were able to combine real business experiences of implementation case studies together with talks by industry and academic (applied research) leaders who explained Big Data and Machine Learning for the general public. In this congress, a diversity of large and small enterprises presented some of the latest technology and approaches they are using for Big Data processing and Machine Learning analytics.
The increasing demand for online high volume multi-format data and information processing is driving a new generation of data repositories, data processing and advanced analytics lead by the IT industry and end-user stakeholders. Key technologies include NoSQL databases and Machine Learning techniques which drive decision support system front-end dashboards. Also, more companies are now obtaining access to big data storage and advanced analytics capability through the use of third-party cloud solutions such as those offered by Google and Amazon.
Some key takeaways from the Congress/expo were:
- Data Quality: data pre-processing may be boring but its absolutely necessary before we can get the best out of our analytics solutions.
- Cloud-based solutions: pay as you go renting of ready-made big data storage and analytics (in the photo a Google Cloud Platform schema is shown from the presentation given by Albert Climent of Pervasive Technologies).
- Building in-house expertise vs support through external collaborations: there is currently a shortage of trained and experienced data scientists and software engineers.
- Machine learning: why use it and when to use it
Use cases: examples of IT big data infrastructure and data processing (e.g. Zurich Insurance, Eurecat, UPC, BiGeek, Deloitte), machine learning and analytics approaches (e.g. Amazon, Wallapop, Glovo).
One issue is whether to build your own IT infrastructure for big data or to pay as you go with a Cloud provider such as Amazon or Google. With the cloud solutions, you have the big data platform underneath and also access to machine learning library and analytics DSS. Several presentations gave preference to the Cloud renting approach. However, the final decision on whether to go for an in-house or a Cloud rental solution can be influenced by circumstances. For example, the case when a technology company (e.g. IRIS) is subcontracted by a third party to build a customized Cloud infrastructure solution.
Another issue is how to form a team or department for big data processing. One key aspect is attracting talented persons; another aspect is structuring the big data team into (i) IT infrastructure specialists, (ii) machine learning specialists, and (iii) persons dedicated to data pre-processing (making sure the right data is in the data lake) and processing (especially to guarantee data quality). However, it was also emphasized that putting people into ‘silos’ should be avoided as cross-disciplinary collaboration is also a key success factor for big data processing.
Dr. Mark Torrent of Eurecat presented a framework for offering services to SMEs, which embodies an ecosystem of activities, partners and customer companies. Two initiatives organized by Eurecat which foment participation and networking between companies and entities are the ‘Data & Analytics Club’ and ‘Big Data Catalonia’.
The following are summaries of selected presentations given on two key themes: Big Data Infrastructures and online AI/Machine Learning.
Big Data Infrastructures
- Deloitte presented their comprehensive platform architecture for big data capture and real-time analytics which at a first level includes Internet technology (e.g. face recognition, analytical Wi-Fi, Bluetooth beacons, geo-positioning, …) and ‘touch points’ (intelligent sales points, cell phone, wearables, web). Then an intelligent content player interfaces with machine learning algorithms (descriptive, classifiers, predictive) and AI (for processing of images, video and natural language). Finally, the Big Data and real-time level are divided into batch analytics and real-time analytics (Spark, Scala, …).
- Oscar Romero of the UPC-ESSI explained the drawbacks of building the analytics data representation up-front into the data structures of the big data repository. As an improvement, essentially he proposed first loading the data (data lake) and later doing the data modelling. Oscar further developed this idea by adding a Catalog to the NoSQL database.
- BIGeek presented the architecture they had developed for the 2getherbank financial platform. 2gether is a new concept of bank oriented towards social use and goes beyond purely financial transactions. Their system interfaces with Blockchain uses Rabbit at the front end for big data incoming stream processing, and back-end data repositories including Cassandra NoSQL (big data), Mongo DB (user management), Neo4j (for graph processing) and Tableau as the analytics software.
Online AI/Machine Learning
- Glovo presented their project ‘optimizing notification algorithms’ whose objective is to optimize the delivery logistics for a pizza delivery business. First, a distance-based algorithm selects a courier, then an order notification algorithm estimates the preparation and delivery (distance based) time.
- Hugo Zaragoza of Amazon explained how this company is using machine learning to evaluate fresh product quality in real time in their warehouse for online delivery. Without machine learning, experts need to inspect each batch individually which give arise to subjectivity and high variance in quality evaluation. To evaluate if the fruit is fresh, humans use three senses: vision, touch and smell. With machine learning and a vision-assisted vendor technology, the system learns to emulate human decisions by trial and error and improves over time. In the photo, we see Dr Hugo Zaragoza of Amazon Barcelona explaining how Amazon is using machine learning in different application areas.
- Wallapop, on the other hand, presented their solution for translating the texts of their online classified ads automatically into different languages. In Wallapop’s system, two important aspects were text understanding (e.g. keyword extraction using Bayesian approaches in the face of poor text quality) and image analysis (deep learning using Caffe, mxnet and TensorFlow).
At IRIS we are also working on several Big Data and embedded in-line Machine Learning projects. For example, (i) a big data software and hardware cloud platform for processing high volume streaming multimedia data, (ii) an embedded machine learning algorithm for modelling grain silo loading and predictive indicators for grain quality monitoring and (iii) a machine learning automatic parameter selection system for customized production of on-demand medical devices (Openmind*).