Blog
Blog at externaltable.blogspot.com
and contributing to db-blog.web.cern.ch
- ATLAS DCS Analysis with Apache Spark
- Kepler's Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding
- Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization
- Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope
- Performance Comparison of 5 JDKs on Apache Spark
- Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models
- Exploratory Notebooks for Deep Learning and Data Tools: A Beginner's Guide
- CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers
- Making Histograms with Apache Spark and Other SQL Engines
- Can High Energy Physics Analysis Profit from Apache Spark APIs?
- Apache Spark 3.0 Memory Monitoring Improvements
- Distributed Deep Learning for Physics with TensorFlow and Kubernetes
- Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo
- A Performance Dashboard for Apache Spark
- SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads
- Performance Analysis of a CPU-Intensive Workload in Apache Spark
- Apache Spark and CERN Open Data Analysis, an Example
- Diving into Spark and Parquet Workloads, by Example
- On Measuring Apache Spark Workload Metrics for Performance Troubleshooting
-
Spark notes (hosted on GitHub):
- Miscellaneous Spark commands, tips, info
- Spark performance dashboard config details
- Spark workload measurements with sparkMeasure
- Spark executor memory
- Spark and Parquet
- Apache Spark - HBase Connector
- Spark for_High_Energy_Physics
- Spark DataFrame Histograms
- Flame Graph, tools on Linux for profiling Spark
- Read/analyze Spark EventLog with Spark SQL
- Tools for Linux memory_performance measurements
- Spark SQL, a fun UDF_example with Mandelbrot set
- Linux_OS_CPU_Disk_Network monitoring tools
- Tools_for Apache Parquet_diagnostics
- MapInArrow for Python UDF
- Spark and OpenSearch
- Example of a Scala project for Spark
-
Posters and reports:
- Spark Executors' Memory Configuration, Office Poster
- Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
- Machine learning pipelines with Apache Spark and Intel BigDL
- Physics data analysis and data reduction at scale with Apache Spark
- Physics data processing and machine learning in the cloud
Presentations, Talks and Videos
- Building an Apache Spark Performance Lab: Tools and Techniques for Optimization, CERN, April 2024. [pptx | PDF | sparkMeasure demo | TPCDS-PySpark demo | Spark-Dashboard demo]
- Introduction to Apache Spark APIs for Data Processing: Training course on Apache Spark, November 2022.
- Basic Physics Analyses Implemented Using Apache Spark, PyHEP 2022, September 14, 2022. [pptx | PDF | Video]
- Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit 2021, May 26, 2021. [pptx | PDF | Video | Demo (mp4)]
- What is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI Summit Europe 2020, November 18, 2020. [pptx | PDF | Video]
- Big Data Tools and Pipelines for Machine Learning in HEP, CERN EP-IT Data Science seminar, December 4, 2019. [pptx | PDF]
- Performance Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019, Amsterdam, October 17, 2019. [pptx | PDF | Video]
- Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo, Spark Summit Europe 2019, Amsterdam, October 16, 2019. [pptx | PDF | Video]
- Big Data In HEP - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September 24th, 2019. [pptx | PDF ]
- Apache Spark for RDBMS Practitioners, Spark Summit Europe 2018, London, October 4, 2018. [pptx | PDF | Video]
- Data Analytics – Use Cases, Platforms, Services @ CERN IT, ITMM Meeting, CERN, March 5th, 2018. [pptx | PDF]
- Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methods, Spark Summit Europe 2017, Dublin, October 26th, 2017. [pptx | PDF | Video]
- Overview of Big Data Solutions and Services at CERN, CERN Knowledge Transfer Forum, CERN, September 29th, 2017. [pptx | PDF]
- Hadoop and Spark Ecosystem for Data Analytics, Experience and Outlook, WLCG GDB meeting, CERN, September 13th, 2017. [pptx | PDF]
- Data Analytics and CERN IT Hadoop Service, CERN openlab Technical Workshop, CERN, December 9th, 2016. [pptx | PDF]
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016, Brussels, October 26th, 2016. [pptx | PDF | Video]
- Integration of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11th, 2016. [pptx | PDF]
- Stack Traces and Flame Graphs for Oracle Troubleshooting, UKOUG Tech15 Super Sunday, Birmingham, December 6th, 2015. [pptx | PDF]
- Modern Linux Tools for Oracle Troubleshooting, Swiss Oracle User Group (SOUG) event, Prangins (CH), May 21st, 2015. [PDF]
- Database Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP), April 11th, 2015. [pptx | PDF]
- Modern Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool, December 9th, 2014. [pptx | PDF]
- A Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9th, 2014. [pptx | PDF]
- Introduction on Data for Physics at CERN and Deep Dive into Oracle ASM, Enkitec E4 2014, Dallas (TX), June 2014. [pptx | PDF]
- A Latency Picture is Worth a Thousand Storage Metrics, Hotsos 2014, Dallas (TX), March 4th, 2014. [pptx | PDF]
- Lost Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4th, 2013. [pptx | PDF]
- Storage Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2nd, 2013. [pptx | PDF]
- Active Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4th, 2012. [pptx | PDF]
- Testing Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG Conference 2011, Birmingham, December 6th, 2011. [pptx | PDF]
- CERN IT-DB Deployment, Status, Outlook, ESA-GAIA DB Workshop, ISDC, Geneva, March 2011. [pptx | PDF]
- Click here for a list including talks prior to 2011
Repositories
Repositories at https://github.com/LucaCanali
-
SparkMeasure
A tool for performance troubleshooting of Apache Spark.
-
Spark Performance Dashboard
Deploy an Apache Spark performance dashboard using container technology (Dockerfile and Helm chart).
-
SparkPlugins
Code and examples of how to use Spark Plugins, including plugins to extend Spark metrics systems with custom monitoring probes.
-
SparkTraining
Training material for course "Introduction to Apache Spark APIs for Data Processing": https://sparktraining.web.cern.ch/
-
Spark Notes
A collection of Apache Spark notes, tips, and techniques covering a variety of topics.
-
Spark for Physics
Jupyter notebooks with examples of High Energy Physics analyses using Spark.
-
SparkHistograms
Python and Scala packages for generating histograms with Spark.
-
CPU Load Test Kit (Python)
A Python tool for CPU stress testing.
-
CPU Load Test Kit (Rust)
A Rust-based CPU stress testing tool.
-
Spark CPU/Memory Load Testing
Tools for conducting CPU and memory performance tests with Spark.
-
TPCDS-PySpark
A Python package to streamline running TPCDS benchmarks with PySpark.
-
Data_Analyses/Kepler
Notebooks demonstrating Kepler's Mars orbit analysis using the modern Python ecosystem and AI-assisted coding.
-
Notebook Examples
Example notebooks for Deep Learning, Data Tools, and AI Tools.
-
Linux tracing scripts
Scripts and tools for troubleshooting and performance analysis in Linux.
-
SparkDLTrigger
Code, notebooks, and links to the datasets accompanying the article "Machine Learning Pipelines with Modern Big DataTools for High Energy Physics".
-
PerfSheet4
A tool to query and visualize Oracle AWR data using Excel pivot charts.
-
PerfSheet.js
A tool to extract and visualize Oracle AWR time series data in the browser using JavaScript and dynamic pivot charts.
-
PyLatencyMap
A tool for heat map visualization on the CLI.
-
Stack Profiling
Tools and scripts for stack profiling: Userspace, Kernel, OS state and optionally Oracle wait events.
-
Oracle DBA scripts
A collection of DBA scripts for old-school CLI Oracle troubleshooting and performance monitoring.
-
OraLatencyMap
A performance widget running on SQL*plus (Oracle's CLI) to collect and visualize latency histograms for Oracle wait events using heat maps.
Packages on PyPi
- SparkMeasure – A Python API for the sparkMeasure tool used in performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark performance metrics, and—while the core is written in Scala—this package seamlessly integrates with PySpark, Jupyter Notebook, and other Python environments. Learn more on GitHub.
- SparkHistogram – Contains helper functions for generating data histograms using the Spark DataFrame API. This package offers an efficient way to visualize and analyze data distributions. Explore further details and source code on its GitHub documentation.
- TPCDS_PySpark – A TPC-DS workload generator written in Python for PySpark. Designed to run at scale, it helps you build your own Apache Spark Performance Lab, run performance benchmarks, and learn troubleshooting techniques for Spark.
- Test_CPU_parallel – Generates CPU-intensive load using parallel threads. It executes a CPU-burning loop concurrently with configurable parallelism, providing measurements of execution time under load. For more details, see the source code and documentation on GitHub.
Container Images on DockerHub
- Spark-Dashboard – A container image designed to deploy an Apache Spark performance dashboard. It comes prepackaged with Grafana, InfluxDB, and the configurations required to ingest Spark metrics, along with prebuilt Grafana dashboards for visualizing Spark performance. Check out the Spark Performance Dashboard project for more details.
- Test_cpu_parallel – A Rust-based container image that generates CPU-intensive load on Linux by executing a CPU-burning loop concurrently with configurable parallelism. For further documentation, see the Test_CPU_parallel_Rust project repository.
- Test_cpu_parallel.py – The Python version for generating CPU-intensive load on Linux, offering similar functionality as its Rust counterpart. More information and source code can be found on the Test_CPU_parallel_Python GitHub repository.
Publications
- Advancing ATLAS DCS Data Analysis with a Modern Data Platform, Luca Canali, Andrea Formica, Michelle Ann Solis, submitted to EPJ Web of Conferences (2025), arXiv:2501.13543
- Towards a new conditions data infrastructure in ATLAS, Evgeny Alexandrov, Luca Canali, Davide Costanzo, Andrea Formica, Elizabeth J. Gallas, Mikhail Mineev, Nurcan Ozturk, Shaun Roe, Vakho Tsulaia and Marcelo Vogel, EPJ Web of Conferences 295, 01013 (2024)
- The ATLAS Event Picking Service and Its Evolution, E. Alexandrov, I. Alexandrov, D. Barberis, L. Canali, E. Cherepanova, E. Gallas, S. Gonzalez de la Hoz, F. Prokoshin, G. Rybkin, J. Sal Cairols, J. Sanchez, M. Villaplana Perez and A. Yakovlev, Physics of Particles and Nuclei, Vol55, No. 3 (2024)
- Deployment and Operation of the ATLAS EventIndex for LHC Run 3, Elizabeth J. Gallas, Evgeny Alexandrov, Igor Alexandrov, Dario Barberis, Luca Canali, Elizaveta Cherepanova, Alvaro Fernandez Casani, Carlos Garcia Montoro, Santiago Gonzalez de la Hoz, Alexander Iakovlev et al. (5 more), EPJ Web of Conferences 295, 01018 (2024)
- The ATLAS EventIndex - A BigData Catalogue for All ATLAS Experiment Events, D. Barberis et al., Comput Softw Big Sci 7, 2 (2023)
- Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics, Matteo Migliorini, Riccardo Castellotti, Luca Canali and Marco Zanetti, Comput Softw Big Sci 4, 8 (2020)
- ScienceBox: Converging to Kubernetes containers in production for on-premises and hybrid clouds for CERNBox, SWAN, and EOS, Enrico Bocchi, Luca Canali, Diogo Castro, Prasanth Kothuri, Hugo Gonzalez Labrador, Maciej Malawski, Jakub T. Moscicki and Piotr Mrowczynski, EPJ Web of Conferences 245, 07047 (2020)
- Using Big Data Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214, 06030 (2019)
- Evolution of the Hadoop Platform and Ecosystem for High Energy Physics, Z. Baranowski et al., EPJ Web of Conferences 214, 04058 (2019)
- A prototype for the evolution of ATLAS EventIndex based on Apache Kudu storage, Z. Baranowski et al., EPJ Web of Conferences 214, 04057 (2019)
- Big Data Tools and Cloud Services for High Energy Physics Analysis in TOTEM Experiment, V. Avati et al., 2018, Proceeding of: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion)
- CMS Analysis and Data Reduction with Apache Spark, O. Gutsche et al., 2018, J. Phys.: Conf. Ser. 1085 042030
- A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017, J. Phys.: Conf. Ser. 898 062020
- Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017, J. Phys.: Conf. Ser. 898 042055
- An Oracle-based event index for ATLAS, Elizabeth J. Gallas, Gancho Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei Dumitru and Andrea Formica, 2017, J. Phys.: Conf. Ser. 898 042033
- Scale Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek, Luca Canali, Daniel Lanza Garcia and Kacper Surdy, 2015, J. Phys.: Conf. Ser. 664(4) 042002
- Evolution of Database Replication Technologies for WLCG, Zbigniew Baranowski, Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov and Luca Canali, 2015, J. Phys.: Conf. Ser. 664(4) 042032
- Sequential data access with Oracle and Hadoop: a performance comparison, Zbigniew Baranowski, Luca Canali and Eric Grancher, 2014, J. Phys.: Conf. Ser. 513 042001
- ATLAS database application enhancements using Oracle 11g, G. Dimitrov, Luca Canali, M. Blaszczyk and R. Sorokoletov, 2012, J. Phys.: Conf. Ser. 396 052027
- ATLAS Data Management Accounting with Hadoop Pig and HBase, Mario Lassnig, Vincent Garonne, Gancho Dimitrov and Luca Canali, 2012, J. Phys.: Conf. Ser. 396 052044
- Structured storage in ATLAS Distributed Data Management: use cases and experiences, Mario Lassnig, Vincent Garonne, Angelos Molfetas, Thomas Beermann, Gancho Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra Chinzer, 2012, J. Phys.: Conf. Ser. 396 052045
- Advanced Technologies for Scalable ATLAS Conditions Database Access, R. Basset, L. Canali, G. Dimitrov, M. Girone, R. Hawkings, P Nevski, A Valassi, A Vaniachine, F Viegas, R Walker and A Wong, 2010, J. Phys.: Conf. Ser. 219 042025
Older Blog Entries
2016
- IPython/Jupyter SQL Magic Functions for PySpark
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs
- How to Build a Neural Network Scoring Engine in PL/SQL
- IPython/Jupyter Notebooks for Oracle
- Linux BPF/bcc for Oracle Tracing
- IPython Notebooks for Querying Apache Impala
- SystemTap Guru Mode and Oracle SQL Parsing
- PerfSheet.js: Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts
- Linux Perf Probes for Oracle Tracing
2015
- Extended Stack Profiling - Ideas, Tools and Comments
- Slides of the CERN Talks at UKOUG Tech15
- Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs
- Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations
- Add Color to Your SQL
- Diagnose High-Latency I/O Operations Using SystemTap
- Heat Map Visualization of Latency Histograms for NetApp C-Mode
- Event Histogram Metric and Oracle 12c
- Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap
- Latest Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization
2014
- Talks at UKOUG TECH 2014 with CERN Speakers
- Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap
- SystemTap into Oracle for Fun and Profit
- Scaling up Cardinality Estimates in 12.1.0.2
- ASM Metadata, Internals and Diagnostic Utilities
- Oracle Optimizer Investigated with Flame Graphs
- Flame Graphs for Oracle
- A Closer Look at CALIBRATE_IO
- Recent Updates of OraLatencyMap and PyLatencyMap
- Wait Event History Sampling, an Experiment in Oracle Performance Analysis
- Clusterware 12c and Restricted Service Registration for RAC
2013
- How to Recover Files from a Dropped ASM Disk Group
- UKOUG Tech13, Latency Investigations and Lost Writes
- Daylight Saving Time Change and AWR Data Mining
- Getting Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More Sources
- PyLatencyMap, a Performance Tool for Latency Data Visualization
- DTrace Explorations of Oracle Wait Events on Linux and Solaris
- OraLatencyMap v1.1 and Testing I/O with SLOB 2
- Oracle Events' Latency Visualization and Heat Maps in SQL*plus
- Testing Lost Writes with Oracle and Data Guard
- AWR Analytics and Oracle Performance Visualization with PerfSheet4
2012
- Active Data Guard and UKOUG 2012
- Command-Line DBA Scripts
- How to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct Read
- Recursive Subquery Factoring, Oracle SQL and Physics
- Listener.ora and Oraagent in RAC 11gR2
- Purging Cursors From the Library Cache Using Full_hash_value
- Kerberos Authentication and Proxy Users
- Hash Collisions in Oracle: SQL Signature and SQL_ID
- SQL Signature, Text Normalization and MD5 Hash
- SQL Patch and Force Match
- V$EVENT_HISTOGRAM_METRIC
- Performance Metrics Views
- Of I/O Latency, Skew and Histograms 2/2
- Of I/O Latency, Skew and Histograms 1/2
Miscellaneous Resources
- Contact details
- YouTube channel @LucaDataEngineering
- ASM_Internals (pdf) - ASM metadata and related X$ tables
- ASM_Utilities (pdf) - ASM support utilities and metadata management (e.g. kfed, amdu)
- Research gate
- Zenodo:
- PGP public key:
gpg2 --keyserver hkp://pool.sks-keyservers.net --recv-keys EF1D88DB