Luca Canali Homepage

Blog

Blog at externaltable.blogspot.com and contributing to db-blog.web.cern.ch

ATLAS DCS Analysis with Apache Spark
Kepler's Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding
Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization
Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope
Performance Comparison of 5 JDKs on Apache Spark
Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models
Exploratory Notebooks for Deep Learning and Data Tools: A Beginner's Guide
CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers
Making Histograms with Apache Spark and Other SQL Engines
Can High Energy Physics Analysis Profit from Apache Spark APIs?
Apache Spark 3.0 Memory Monitoring Improvements
Distributed Deep Learning for Physics with TensorFlow and Kubernetes
Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo
A Performance Dashboard for Apache Spark
SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads
Performance Analysis of a CPU-Intensive Workload in Apache Spark
Apache Spark and CERN Open Data Analysis, an Example
Diving into Spark and Parquet Workloads, by Example
On Measuring Apache Spark Workload Metrics for Performance Troubleshooting
Spark notes (hosted on GitHub):
- Miscellaneous Spark commands, tips, info
- Spark performance dashboard config details
- Spark workload measurements with sparkMeasure
- Spark executor memory
- Spark and Parquet
- Apache Spark - HBase Connector
- Spark for_High_Energy_Physics
- Spark DataFrame Histograms
- Flame Graph, tools on Linux for profiling Spark
- Read/analyze Spark EventLog with Spark SQL
- Tools for Linux memory_performance measurements
- Spark SQL, a fun UDF_example with Mandelbrot set
- Linux_OS_CPU_Disk_Network monitoring tools
- Tools_for Apache Parquet_diagnostics
- MapInArrow for Python UDF
- Spark and OpenSearch
- Example of a Scala project for Spark
Posters and reports:
- Spark Executors' Memory Configuration, Office Poster
- Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
- Machine learning pipelines with Apache Spark and Intel BigDL
- Physics data analysis and data reduction at scale with Apache Spark
- Physics data processing and machine learning in the cloud

Presentations, Talks and Videos

Building an Apache Spark Performance Lab: Tools and Techniques for Optimization, CERN, April 2024. [pptx | PDF | sparkMeasure demo | TPCDS-PySpark demo | Spark-Dashboard demo]
Introduction to Apache Spark APIs for Data Processing: Training course on Apache Spark, November 2022.
Basic Physics Analyses Implemented Using Apache Spark, PyHEP 2022, September 14, 2022. [pptx | PDF | Video]
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit 2021, May 26, 2021. [pptx | PDF | Video | Demo (mp4)]
What is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI Summit Europe 2020, November 18, 2020. [pptx | PDF | Video]
Big Data Tools and Pipelines for Machine Learning in HEP, CERN EP-IT Data Science seminar, December 4, 2019. [pptx | PDF]
Performance Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019, Amsterdam, October 17, 2019. [pptx | PDF | Video]
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo, Spark Summit Europe 2019, Amsterdam, October 16, 2019. [pptx | PDF | Video]
Big Data In HEP - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September 24th, 2019. [pptx | PDF ]
Apache Spark for RDBMS Practitioners, Spark Summit Europe 2018, London, October 4, 2018. [pptx | PDF | Video]
Data Analytics – Use Cases, Platforms, Services @ CERN IT, ITMM Meeting, CERN, March 5^th, 2018. [pptx | PDF]
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methods, Spark Summit Europe 2017, Dublin, October 26^th, 2017. [pptx | PDF | Video]
Overview of Big Data Solutions and Services at CERN, CERN Knowledge Transfer Forum, CERN, September 29^th, 2017. [pptx | PDF]
Hadoop and Spark Ecosystem for Data Analytics, Experience and Outlook, WLCG GDB meeting, CERN, September 13^th, 2017. [pptx | PDF]
Data Analytics and CERN IT Hadoop Service, CERN openlab Technical Workshop, CERN, December 9^th, 2016. [pptx | PDF]
Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016, Brussels, October 26^th, 2016. [pptx | PDF | Video]
Integration of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11^th, 2016. [pptx | PDF]
Stack Traces and Flame Graphs for Oracle Troubleshooting, UKOUG Tech15 Super Sunday, Birmingham, December 6^th, 2015. [pptx | PDF]
Modern Linux Tools for Oracle Troubleshooting, Swiss Oracle User Group (SOUG) event, Prangins (CH), May 21^st, 2015. [PDF]
Database Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP), April 11^th, 2015. [pptx | PDF]
Modern Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool, December 9^th, 2014. [pptx | PDF]
A Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9^th, 2014. [pptx | PDF]
Introduction on Data for Physics at CERN and Deep Dive into Oracle ASM, Enkitec E4 2014, Dallas (TX), June 2014. [pptx | PDF]
A Latency Picture is Worth a Thousand Storage Metrics, Hotsos 2014, Dallas (TX), March 4^th, 2014. [pptx | PDF]
Lost Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4^th, 2013. [pptx | PDF]
Storage Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2^nd, 2013. [pptx | PDF]
Active Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4^th, 2012. [pptx | PDF]
Testing Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG Conference 2011, Birmingham, December 6^th, 2011. [pptx | PDF]
CERN IT-DB Deployment, Status, Outlook, ESA-GAIA DB Workshop, ISDC, Geneva, March 2011. [pptx | PDF]
Click here for a list including talks prior to 2011

Repositories

Repositories at https://github.com/LucaCanali

SparkMeasure
A tool for performance troubleshooting of Apache Spark.
Spark Performance Dashboard
Deploy an Apache Spark performance dashboard using container technology (Dockerfile and Helm chart).
SparkPlugins
Code and examples of how to use Spark Plugins, including plugins to extend Spark metrics systems with custom monitoring probes.
SparkTraining
Training material for course "Introduction to Apache Spark APIs for Data Processing": https://sparktraining.web.cern.ch/
Spark Notes
A collection of Apache Spark notes, tips, and techniques covering a variety of topics.
Spark for Physics
Jupyter notebooks with examples of High Energy Physics analyses using Spark.
SparkHistograms
Python and Scala packages for generating histograms with Spark.
CPU workload generator and test kit (Python)
A Python tool for CPU stress testing.
CPU and memory workload generator and test kit (Rust)
A Rust-based CPU stress testing tool.
Spark CPU/Memory load testing
Tools for conducting CPU and memory performance tests with Spark.
TPCDS-PySpark
A Python package to streamline running TPCDS benchmarks with PySpark.
Data_Analyses/Kepler
Notebooks demonstrating Kepler's Mars orbit analysis using the modern Python ecosystem and AI-assisted coding.
Notebook Examples
Example notebooks for Deep Learning, Data Tools, and AI Tools.
Linux tracing scripts
Scripts and tools for troubleshooting and performance analysis in Linux.
SparkDLTrigger
Code, notebooks, and links to the datasets accompanying the article "Machine Learning Pipelines with Modern Big DataTools for High Energy Physics".
PerfSheet4
A tool to query and visualize Oracle AWR data using Excel pivot charts.
PerfSheet.js
A tool to extract and visualize Oracle AWR time series data in the browser using JavaScript and dynamic pivot charts.
PyLatencyMap
A tool for heat map visualization on the CLI.
Stack Profiling
Tools and scripts for stack profiling: Userspace, Kernel, OS state and optionally Oracle wait events.
Oracle DBA scripts
A collection of DBA scripts for old-school CLI Oracle troubleshooting and performance monitoring.
OraLatencyMap
A performance widget running on SQL*plus (Oracle's CLI) to collect and visualize latency histograms for Oracle wait events using heat maps.

Packages on PyPi

SparkMeasure – A Python API for the sparkMeasure tool used in performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark performance metrics, and—while the core is written in Scala—this package seamlessly integrates with PySpark, Jupyter Notebook, and other Python environments. Learn more on GitHub.
SparkHistogram – Contains helper functions for generating data histograms using the Spark DataFrame API. This package offers an efficient way to visualize and analyze data distributions. Explore further details and source code on its GitHub documentation.
TPCDS_PySpark – A TPC-DS workload generator written in Python for PySpark. Designed to run at scale, it helps you build your own Apache Spark Performance Lab, run performance benchmarks, and learn troubleshooting techniques for Spark.
Test_CPU_parallel (Python) – Generates CPU-intensive load using parallel threads. It executes a CPU-burning loop concurrently with configurable parallelism, providing measurements of execution time under load.
For more details, see the source code and documentation on GitHub.
See also the Rust version of the package: test_cpu_parallel @ crates.io

Container Images on DockerHub

Spark-Dashboard – A container image designed to deploy an Apache Spark performance dashboard. It comes prepackaged with Grafana, InfluxDB, and the configurations required to ingest Spark metrics, along with prebuilt Grafana dashboards for visualizing Spark performance. Check out the Spark Performance Dashboard project for more details.
Test_cpu_parallel – A Rust-based container image that generates CPU-intensive load on Linux by executing a CPU-burning loop concurrently with configurable parallelism. For further documentation, see the Test_CPU_parallel_Rust project repository.
Test_cpu_parallel.py – The Python version for generating CPU-intensive load on Linux, offering similar functionality as its Rust counterpart. More information and source code can be found on the Test_CPU_parallel_Python GitHub repository.

Publications

Advancing ATLAS DCS Data Analysis with a Modern Data Platform, Luca Canali, Andrea Formica, Michelle Ann Solis, submitted to EPJ Web of Conferences (2025), arXiv:2501.13543
Towards a new conditions data infrastructure in ATLAS, Evgeny Alexandrov, Luca Canali, Davide Costanzo, Andrea Formica, Elizabeth J. Gallas, Mikhail Mineev, Nurcan Ozturk, Shaun Roe, Vakho Tsulaia and Marcelo Vogel, EPJ Web of Conferences 295, 01013 (2024)
The ATLAS Event Picking Service and Its Evolution, E. Alexandrov, I. Alexandrov, D. Barberis, L. Canali, E. Cherepanova, E. Gallas, S. Gonzalez de la Hoz, F. Prokoshin, G. Rybkin, J. Sal Cairols, J. Sanchez, M. Villaplana Perez and A. Yakovlev, Physics of Particles and Nuclei, Vol55, No. 3 (2024)
Deployment and Operation of the ATLAS EventIndex for LHC Run 3, Elizabeth J. Gallas, Evgeny Alexandrov, Igor Alexandrov, Dario Barberis, Luca Canali, Elizaveta Cherepanova, Alvaro Fernandez Casani, Carlos Garcia Montoro, Santiago Gonzalez de la Hoz, Alexander Iakovlev et al. (5 more), EPJ Web of Conferences 295, 01018 (2024)
The ATLAS EventIndex - A BigData Catalogue for All ATLAS Experiment Events, D. Barberis et al., Comput Softw Big Sci 7, 2 (2023)
Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics, Matteo Migliorini, Riccardo Castellotti, Luca Canali and Marco Zanetti, Comput Softw Big Sci 4, 8 (2020)
ScienceBox: Converging to Kubernetes containers in production for on-premises and hybrid clouds for CERNBox, SWAN, and EOS, Enrico Bocchi, Luca Canali, Diogo Castro, Prasanth Kothuri, Hugo Gonzalez Labrador, Maciej Malawski, Jakub T. Moscicki and Piotr Mrowczynski, EPJ Web of Conferences 245, 07047 (2020)
Using Big Data Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214, 06030 (2019)
Evolution of the Hadoop Platform and Ecosystem for High Energy Physics, Z. Baranowski et al., EPJ Web of Conferences 214, 04058 (2019)
A prototype for the evolution of ATLAS EventIndex based on Apache Kudu storage, Z. Baranowski et al., EPJ Web of Conferences 214, 04057 (2019)
Big Data Tools and Cloud Services for High Energy Physics Analysis in TOTEM Experiment, V. Avati et al., 2018, Proceeding of: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion)
CMS Analysis and Data Reduction with Apache Spark, O. Gutsche et al., 2018, J. Phys.: Conf. Ser. 1085 042030
A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017, J. Phys.: Conf. Ser. 898 062020
Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017, J. Phys.: Conf. Ser. 898 042055
An Oracle-based event index for ATLAS, Elizabeth J. Gallas, Gancho Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei Dumitru and Andrea Formica, 2017, J. Phys.: Conf. Ser. 898 042033
Scale Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek, Luca Canali, Daniel Lanza Garcia and Kacper Surdy, 2015, J. Phys.: Conf. Ser. 664(4) 042002
Evolution of Database Replication Technologies for WLCG, Zbigniew Baranowski, Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov and Luca Canali, 2015, J. Phys.: Conf. Ser. 664(4) 042032
Sequential data access with Oracle and Hadoop: a performance comparison, Zbigniew Baranowski, Luca Canali and Eric Grancher, 2014, J. Phys.: Conf. Ser. 513 042001
ATLAS database application enhancements using Oracle 11g, G. Dimitrov, Luca Canali, M. Blaszczyk and R. Sorokoletov, 2012, J. Phys.: Conf. Ser. 396 052027
ATLAS Data Management Accounting with Hadoop Pig and HBase, Mario Lassnig, Vincent Garonne, Gancho Dimitrov and Luca Canali, 2012, J. Phys.: Conf. Ser. 396 052044
Structured storage in ATLAS Distributed Data Management: use cases and experiences, Mario Lassnig, Vincent Garonne, Angelos Molfetas, Thomas Beermann, Gancho Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra Chinzer, 2012, J. Phys.: Conf. Ser. 396 052045
Advanced Technologies for Scalable ATLAS Conditions Database Access, R. Basset, L. Canali, G. Dimitrov, M. Girone, R. Hawkings, P Nevski, A Valassi, A Vaniachine, F Viegas, R Walker and A Wong, 2010, J. Phys.: Conf. Ser. 219 042025

Older Blog Entries

2016

2015

2014

2013

2012

Miscellaneous Resources

Contact details
YouTube channel @LucaDataEngineering
ASM_Internals (pdf) - ASM metadata and related X$ tables
ASM_Utilities (pdf) - ASM support utilities and metadata management (e.g. kfed, amdu)
Research gate
Zenodo:
PGP public key: gpg2 --keyserver hkp://pool.sks-keyservers.net --recv-keys EF1D88DB