Presentations, Talks and Videos:
- Building an Apache Spark Performance Lab: Tools and Techniques for
Optimization, CERN, April 2024, pptx, PDF, sparkMeasure demo, TPCDS-PySpark demo, Spark-Dashboard demo
- Introduction to Apache Spark APIs for Data Processing, training
course on Apache Spark, November 2022, PDFs_and_Videos, Notebooks
- Basic Physics Analyses
Implemented Using Apache Spark, PyHEP 2022, September 14th, 2022, pptx, PDF, PDF_extended, Notebooks, Video
- Monitor
Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit
2021, May 26th, 2021, pptx,
PDF,
Video, Demo (mp4)
- What
is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI
Summit Europe 2020, November 18th, 2020, pptx,
PDF,
Video
- Big Data Tools and Pipelines
for Machine Learning in HEP, CERN EP-IT Data science seminar, December
4th, 2019, pptx,
PDF
- Performance
Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019,
Amsterdam, October 17th, 2019, pptx,
PDF,
Video
- Deep
Learning Pipelines for High Energy Physics using Apache Spark with
Distributed Keras on Analytics Zoo, Spark Summit Europe 2019,
Amsterdam, October 16th, 2019, pptx,
PDF,
Video
- Big Data
In HEP - Physics Data Analysis, Machine learning and Data Reduction at
Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September
24th, 2019, pptx,
PDF
- Apache
Spark for RDBMS Practitioners, Spark Summit Europe 2018, London,
October 4th, 2018, pptx,
PDF,
Video
- Data
Analytics – Use Cases, Platforms, Services @ CERN IT, ITMM Meeting, CERN,
March 5th, 2018, pptx, PDF
- Apache
Spark Performance Troubleshooting at Scale, Challenges, Tools, and
Methods, Spark Summit Europe 2017, Dublin, October 26th, 2017, pptx,
PDF,
Video
- Overview
of Big Data Solutions and Services at CERN, CERN Knowledge Transfer Forum,
CERN, September 29th, 2017, slides: pptx, PDF
- Hadoop
and Spark Ecosystem for Data Analytics, Experience and Outlook, WLCG GDB
meeting, CERN, September 13th, 2017, slides: pptx,
PDF
- Data
Analytics and CERN IT Hadoop Service, CERN
openlab Technical Workshop, CERN, December 9th, 2016,
slides pptx, PDF
- Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016,
Brussels, October 26th, 2016, slides: pptx, PDF, Video
- Integration
of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11th,
2016, slides: pptx,
PDF
- Stack
Traces and Flame Graphs for Oracle Troubleshooting, UKOUG
Tech15 Super Sunday, Birmingham, December 6th, 2015,
slides: pptx,
PDF
- Modern
Linux Tools for Oracle Troubleshooting, Swiss Oracle User Group (SOUG) event, Prangins (CH), May 21st,
2015, slides PDF
- Database
Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP),
April 11th, 2015, slides pptx
- Modern
Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool,
December 9th, 2014, slides pptx, PDF
- A
Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9th,
2014, slides pptx, PDF
- Introduction
on Data for Physics at CERN and Deep Dive into Oracle ASM, Enkitec E4 2014, Dallas
(TX), June 2014, pptx
- A Latency
Picture is Worth a Thousand Storage Metrics, Hotsos 2014, Dallas (TX),
March 4th, 2014, pptx,
pdf
- Lost
Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4th,
2013, pptx
- Storage
Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2nd, 2013,
pptx
- Active
Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4th,
2012, pptx
- Testing
Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG
Conference 2011, Birmingham, December 6th, 2011, pptx
- CERN
IT-DB Deployment, Status, Outlook, ESA-GAIA DB Workshop, ISDC,
Geneva, March 2011, pptx
- Click here for a list including talks prior to
2011
- SparkMeasure
- A tool for performance
troubleshooting of Apache Spark workloads.
- Spark Performance Dashboard
- Notes and code for deploying an Apache Spark performance dashboard
using container technology (Dockerfile and Helm chart).
- SparkPlugins
- Code and examples of how to use Spark Plugin extensions with
Apache Spark 3.0 to extend the Spark metrics systems with custom
monitoring probes for OS, I/O and external applications.
- SparkTraining
- Miscellaneous
- Notes
on Apache Spark, with tips and techniques on and around using Spark
- Spark
for Physics, Jupyter notebooks with examples of High Energy Physics
analyses using Spark
- SparkHistograms,
Python and Scala packages for generating histograms with Spark
- Performance
testing, notes, scripts and resources dedicated on load testing and
performance measurements, includes
- Jupyter notebooks with examples of to read from Oracle,
Trino/Presto, PostgreSQL, YugabyteDB
- SparkDLTrigger
- Notebook Examples
- Example notebooks for
Deep Learning, Data Tools, and AI Tools.
- Linux tracing scripts
- Scripts and tools for troubleshooting and performance analysis in
Linux.
- PerfSheet4
- PerfSheet4 is a tool to query
and visualize Oracle AWR data using Excel pivot charts
- PerfSheet.js
- PerfSheet.js is a tool
to extract and visualize Oracle AWR time series data in the browser using
JavaScript and dynamic pivot charts.
- PyLatencyMap
- PyLatencyMap is a tool
for heat map visualization on the CLI.
- Stack Profiling
- Tools and scripts for
stack profiling: Userspace, Kernel, OS state and optionally Oracle wait
events.
- Oracle DBA scripts
- A collection of DBA
scripts for old-school CLI Oracle troubleshooting and performance
monitoring.
- OraLatencyMap
- OraLatencyMap is a performance widget
running on SQL*plus (Oracle's CLI) to collect and visualize latency
histograms for Oracle wait events using heat maps.
Packages on PyPi
·
SparkMeasure - SparkMeasure is a tool for
performance troubleshooting of Apache Spark workloads.
o
It simplifies the collection and analysis of Spark performance metrics.
The bulk of sparkMeasure is written in Scala. This package contains the Python
API for sparkMeasure and is intended to work in conjunction with PySpark. Use
from PySpark, or in Jupyter notebook environments, or in general as a tool to
instrument Spark jobs in your Python code. Link to sparkMeasure GitHub page and documentation
·
SparkHistogram - Sparkhistogram contains helper
functions for generating data histograms with the Spark DataFrame API.
o Link
to SparkHistogram source code and documentation
·
TPCDS_PySpark – TPCDS_PySpark is a TPC-DS
workload generator written in Python and designed to run at scale using Apache
Spark. Use it to build your own Apache Spark Performance Lab, run performance
benchmarking and learn about troubleshooting Spark.
·
Test_CPU_parallel - Use
test_CPU_parallel to generate CPU-intensive load on a system, running multiple
threads in parallel.
o The tool runs a CPU-burning loop
concurrently on the system, with configurable parallelism. The tool outputs a
measurement of the CPU-burning loop execution time as a function of load. Link
to test-CPU-parallel source code and documentation
Images on DockerHub
·
Spark-Dashboard
– Spark-dashboard is a container image to deploy an Apache Spark performance
dashboard, it packaged Grafana, InfluxDB, the configuration for ingesting Spark
metrics, and prebuilt Grafana dashboards for Spark performance visualization.
See the project home at Spark Performance Dashboard
·
Test_cpu_parallel
– Use test_cpu_parallel to generate CPU-intensive load on Linux, Rust version.
See the project home Test_CPU_parallel_Rust
·
Test_cpu_parallel.py
– Use test_cpu_parallel.py to generate CPU-intensive load on Linux, Python
version. See the project home Test_CPU_parallel_Python
·
Building an Apache Spark
Performance Lab: Tools and Techniques for Spark Optimization
·
Enhancing Apache Spark Performance
with Flame Graphs: A Practical Example Using Grafana Pyroscope
·
Performance Comparison of 5 JDKs on
Apache Spark
·
Building a Semantic Search Engine
and RAG Applications with Vector Databases and Large Language Models
·
Exploratory Notebooks for Deep
Learning and Data Tools: A Beginner's Guide
·
CPU Load Testing Exercises: Tools
and Analysis for Oracle Database Servers
·
Making histograms with Apache Spark
and other SQL engines
·
Can High Energy Physics Analysis
Profit from Apache Spark APIs?
·
Apache
Spark 3.0 Memory Monitoring Improvements
·
Distributed
Deep Learning for Physics with TensorFlow and Kubernetes
·
Machine
Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and
Analytics Zoo
·
A
Performance Dashboard for Apache Spark
·
SparkMeasure,
a tool for performance troubleshooting of Apache Spark workloads
·
Performance
Analysis of a CPU-Intensive Workload in Apache Spark
·
Apache
Spark and CERN Open Data Analysis, an Example
·
Diving
into Spark and Parquet Workloads, by Example
·
On
Measuring Apache Spark Workload Metrics for Performance Troubleshooting
·
Spark
notes (hosted on GitHub):
o
Miscellaneous
Spark commands, tips, info
o
Spark
performance dashboard config details
o
Spark
workload measurements with sparkMeasure
o
Spark
executor memory
o
Spark
and Parquet
o
Apache
Spark – HBase Connector
o
Spark
for_High_Energy_Physics
o
Spark
Histrograms
o
Flame
Graph, tools on Linux for profiling Spark
o
Read/analyze
Spark EventLog with Spark SQL
o
Tools
for Linux memory_performance measurements
o
Spark
SQL, a fun UDF_example with Mandelbrot set
o
Linux_OS_CPU_Disk_Network
monitoring tools
o
Tools_for
Apache Parquet_diagnostics
o
MapInArrow
for Python UDF
o
Spark
and OpenSearch
o
Example
of a Scala project for Spark
·
Posters and reports:
o
Spark Executors’
Memory Configuration, Office Poster
o
Data
Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
o
Machine
learning pipelines with Apache Spark and Intel BigDL
o
Physics
data analysis and data reduction at scale with Apache Spark
o
Physics
data processing and machine learning in the cloud
Older blog entries: (2016) IPython/Jupyter
SQL Magic Functions for PySpark, Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, How
to Buld a Neural Network Scoring Engine in PL/SQL, IPython/Jupyter
Notebooks for Oracle, Linux
BPF/bcc for Oracle Tracing, IPython
Notebooks for Querying Apache Impala, SystemTap
Guru Mode and Oracle SQL Parsing, PerfSheet.js:
Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts, Linux
Perf Probes for Oracle Tracing (2015) Extended
Stack Profiling - Ideas, Tools and Comments, Slides
of the CERN Talks at UKOUG Tech15, Oracle
Wait Events Investigated with Extended Stack Profiling and Flame Graphs, Linux
Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations, Add
Color to Your SQL, Diagnose
High-Latency I/O Operations Using SystemTap, Heat
Map Visualization of Latency Histograms for NetApp C-Mode, Event
Histogram Metric and Oracle 12c, Heat
Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Latest
Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization (2014) Talks
at UKOUG TECH 2014 with CERN Speakers, Life
of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, SystemTap
into Oracle for Fun and Profit, Scaling
up Cardinality Estimates in 12.1.0.2, ASM
Metadata, Internals and Diagnostic Utilities, Oracle
Optimizer Investigated with Flame Graphs, Flame
Graphs for Oracle, A
Closer Look at CALIBRATE_IO, Recent
Updates of OraLatencyMap and PyLatencyMap, Wait
Event History Sampling, an Experiment in Oracle Performance Analysis, Clusterware
12c and Restricted Service Registration for RAC (2013) How
to Recover Files from a Dropped ASM Disk Group, UKOUG
Tech13, Latency Investigations and Lost Writes, Daylight
Saving Time Change and AWR Data Mining, Getting
Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More
Sources, PyLatencyMap,
a Performance Tool for Latency Data Visualization, DTrace
Explorations of Oracle Wait Events on Linux and Solaris, OraLatencyMap
v1.1 and Testing I/O with SLOB 2, Oracle
Events' Latency Visualization and Heat Maps in SQL*plus, Testing
Lost Writes with Oracle and Data Guard, AWR
Analytics and Oracle Performance Visualization with PerfSheet4 (2012) Active
Data Guard and UKOUG 2012, Command-Line
DBA Scripts, How
to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct
Read, Recursive
Subquery Factoring, Oracle SQL and Physics, Listener.ora
and Oraagent in RAC 11gR2, Purging
Cursors From the Library Cache Using Full_hash_value, Kerberos
Authentication and Proxy Users, Hash
Collisions in Oracle: SQL Signature and SQL_ID, SQL
Signature, Text Normalization and MD5 Hash, SQL
Patch and Force Match, V$EVENT_HISTOGRAM_METRIC,
Performance
Metrics Views, Of
I/O Latency, Skew and Histograms 2/2, Of
I/O Latency, Skew and Histograms 1/2
Publications:
- Advancing ATLAS DCS Data
Analysis with a Modern Data Platform, Luca Canali, Andrea Formica,
Michelle Ann Solis, preprint (2025)
- Towards a new
conditions data infrastructure in ATLAS, Evgeny Alexandrov, Luca
Canali, Davide Costanzo, Andrea Formica, Elizabeth J. Gallas, Mikhail
Mineev, Nurcan Ozturk, Shaun Roe, Vakho Tsulaia and Marcelo Vogel, EPJ Web
of Conferences 295, 01013 (2024)
- The ATLAS Event Picking Service and Its
Evolution, E. Alexandrov, I. Alexandrov, D. Barberis, L. Canali E.
Cherepanova, E. Gallas, S. Gonzalez de la Hoz, F. Prokoshin, G. Rybkin, J.
Sal Cairols, J. Sanchez, M. Villaplana Perez, A. Yakovlev, Physics of
Particles and Nuclei, Vol55, No. 3 (2024)
- Deployment and
Operation of the ATLAS EventIndex for LHC Run 3, Elizabeth J. Gallas,
Evgeny Alexandrov, Igor Alexandrov, Dario Barberis, Luca Canali, Elizaveta
Cherepanova, Alvaro Fernandez Casani, Carlos Garcia Montoro, Santiago
Gonzalez de la Hoz, Alexander Iakovlev et al. (5 more), EPJ Web of
Conferences 295, 01018 (2024)
- The ATLAS EventIndex - A BigData Catalogue
for All ATLAS Experiment Events, D. Barberis et al., Comput Softw Big Sci 7, 2 (2023)
- Machine Learning Pipelines with Modern Big
Data Tools for High Energy Physics, Matteo Migliorini, Riccardo
Castellotti, Luca Canali, Marco Zanetti, Comput Softw Big Sci 4, 8 (2020)
- ScienceBox
Converging to Kubernetes containers in production for on-premises and
hybrid clouds for CERNBox, SWAN, and EOS, Enrico Bocchi, Luca Canali,
Diogo Castro, Prasanth Kothuri, Hugo Gonzalez Labrador, Maciej Malawski,
Jakub T. Mościcki and Piotr Mrowczynski, EPJ Web of Conferences 245,
07047 (2020)
- Using Big Data
Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214, 06030 (2019)
- Evolution of the
Hadoop Platform and Ecosystem for High Energy Physics, Z. Baranowski et al., EPJ Web of Conferences 214,
04058 (2019)
- A prototype for the
evolution of ATLAS EventIndex based on Apache Kudu storage, Z.
Baranowski et al., EPJ Web of
Conferences 214, 04057 (2019)
- Big Data Tools and Cloud
Services for High Energy Physics Analysis in TOTEM Experiment, V.
Avati et al., 2018, Proceeding
of: 2018 IEEE/ACM International Conference on Utility and Cloud Computing
Companion (UCC Companion)
- CMS
Analysis and Data Reduction with Apache Spark, O. Gutsche et al. 2018 J. Phys.: Conf. Ser.1085
042030
- A
study of data representation in Hadoop to optimize data storage and search
performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca
Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017 J.
Phys.: Conf. Ser. 898 062020
- Integration
of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca
Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017 J. Phys.: Conf.
Ser. 898 042055
- An
Oracle-based event index for ATLAS, Elizabeth J Gallas, Gancho
Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei
Dumitru, Andrea Formica, 2017 J. Phys.: Conf. Ser. 898
042033
- Scale
Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek,
Luca Canali, Daniel Lanza Garcia, Kacper Surdy, 2015 J. Phys.: Conf.
Ser. 664(4) 042002
- Evolution
of Database Replication Technologies for WLCG, Zbigniew Baranowski,
Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov, Luca Canali,
2015 J. Phys.: Conf. Ser. 664(4) 042032
- Sequential
data access with Oracle and Hadoop: a performance comparison, Zbigniew
Baranowski, Luca Canali and Eric Grancher, 2014 J.
Phys.: Conf. Ser. 513 042001
- ATLAS
database application enhancements using Oracle 11g, G Dimitrov, L
Canali, M Blaszczyk and R Sorokoletov, 2012 J.
Phys.: Conf. Ser. 396 052027
- ATLAS
Data Management Accounting with Hadoop Pig and HBase, Mario
Lassnig, Vincent Garonne, Gancho Dimitrov and Luca
Canali, 2012 J. Phys.: Conf. Ser. 396 052044
- Structured
storage in ATLAS Distributed Data Management: use cases and experiences,
Mario Lassnig, Vincent Garonne, Angelos Molfetas, Thomas Beermann, Gancho
Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra Chinzer, 2012 J.
Phys.: Conf. Ser. 396 052045
- Advanced
Technologies for Scalable ATLAS Conditions Database Access, R
Basset, L Canali, G Dimitrov, M Girone, R
Hawkings, P Nevski, A Valassi, A Vaniachine, F
Viegas, R Walker and A Wong, 2010 J. Phys.: Conf. Ser.
219 042025
Miscellaneous:
Last updated, January
2025