Presentations, Talks and Videos:
- Building an Apache Spark Performance Lab: Tools and Techniques for
Optimization, CERN, April 2024, pptx, PDF, sparkMeasure demo, TPCDS-PySpark demo, Spark-Dashboard demo
- Introduction to Apache Spark APIs for Data Processing, training
course on Apache Spark, November 2022, PDFs_and_Videos, Notebooks
- Basic Physics Analyses
Implemented Using Apache Spark, PyHEP 2022, September 14th,
2022, pptx, PDF, PDF_extended, Notebooks, Video
- Monitor
Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit 2021, May 26th, 2021, pptx,
PDF,
Video, Demo (mp4)
- What
is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI Summit Europe 2020, November 18th,
2020, pptx,
PDF,
Video
- Big Data Tools and Pipelines
for Machine Learning in HEP, CERN EP-IT Data science seminar, December
4th, 2019, pptx,
PDF
- Performance
Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019,
Amsterdam, October 17th, 2019, pptx,
PDF,
Video
- Deep
Learning Pipelines for High Energy Physics using Apache Spark with
Distributed Keras on Analytics Zoo, Spark Summit Europe 2019,
Amsterdam, October 16th, 2019, pptx,
PDF,
Video
- Big Data
In HEP - Physics Data Analysis, Machine learning and Data Reduction at
Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September
24th, 2019, pptx,
PDF
- Apache
Spark for RDBMS Practitioners, Spark Summit Europe 2018, London,
October 4th, 2018, pptx,
PDF,
Video
- Data
Analytics – Use Cases, Platforms, Services @ CERN IT, ITMM Meeting, CERN,
March 5th, 2018, pptx, PDF
- Apache
Spark Performance Troubleshooting at Scale, Challenges, Tools, and
Methods, Spark Summit Europe 2017, Dublin, October 26th, 2017, pptx,
PDF,
Video
- Overview
of Big Data Solutions and Services at CERN, CERN Knowledge Transfer Forum,
CERN, September 29th, 2017, slides: pptx, PDF
- Hadoop
and Spark Ecosystem for Data Analytics, Experience and Outlook, WLCG GDB
meeting, CERN, September 13th, 2017, slides: pptx,
PDF
- Data
Analytics and CERN IT Hadoop Service, CERN
openlab Technical Workshop, CERN, December 9th, 2016,
slides pptx, PDF
- Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016,
Brussels, October 26th, 2016, slides: pptx, PDF, Video
- Integration
of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11th,
2016, slides: pptx,
PDF
- Stack
Traces and Flame Graphs for Oracle Troubleshooting, UKOUG
Tech15 Super Sunday, Birmingham, December 6th, 2015,
slides: pptx,
PDF
- Modern
Linux Tools for Oracle Troubleshooting, Swiss Oracle User Group (SOUG) event, Prangins (CH), May 21st,
2015, slides PDF
- Database
Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP),
April 11th, 2015, slides pptx
- Modern
Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool,
December 9th, 2014, slides pptx, PDF
- A
Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9th,
2014, slides pptx, PDF
- Introduction
on Data for Physics at CERN and Deep Dive into Oracle ASM, Enkitec
E4 2014, Dallas (TX), June 2014, pptx
- A Latency
Picture is Worth a Thousand Storage Metrics, Hotsos
2014, Dallas (TX), March 4th, 2014, pptx, pdf
- Lost
Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4th,
2013, pptx
- Storage
Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2nd, 2013,
pptx
- Active
Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4th,
2012, pptx
- Testing
Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG
Conference 2011, Birmingham, December 6th, 2011, pptx
- CERN
IT-DB Deployment, Status, Outlook, ESA-GAIA DB Workshop, ISDC,
Geneva, March 2011, pptx
- Click here for a list including talks prior to
2011
- SparkMeasure
- A tool for performance
troubleshooting of Apache Spark workloads.
- SparkTraining
- Spark Performance Dashboard
- Notes and code for deploying an Apache Spark performance dashboard
using container technology (Dockerfile and Helm
chart).
- SparkPlugins
- Code and examples of how to use Spark Plugin
extensions with Apache Spark 3.0 to extend the Spark metrics systems with
custom monitoring probes for OS, I/O and external applications.
- SparkDLTrigger
- Miscellaneous
- Notes
on Apache Spark, with tips and techniques on and around using Spark
- Spark
for Physics, Jupyter notebooks with examples of High Energy Physics
analyses using Spark
- SparkHistograms, Python and Scala packages for
generating histograms with Spark
- Performance
testing, notes, scripts and resources dedicated on load testing and
performance measurements, includes
- Jupyter notebooks with examples of to read from Oracle,
Trino/Presto, PostgreSQL, YugabyteDB
- Notebook Examples
- Example notebooks for
Deep Learning, Data Tools, and AI Tools.
- Linux tracing scripts
- Scripts and tools for troubleshooting and performance analysis in
Linux.
- PerfSheet4
- PerfSheet4 is a tool to
query and visualize Oracle AWR data using Excel pivot charts
- PerfSheet.js
- PerfSheet.js is a tool
to extract and visualize Oracle AWR time series data in the browser using
JavaScript and dynamic pivot charts.
- PyLatencyMap
- PyLatencyMap is a tool
for heat map visualization on the CLI.
- Stack Profiling
- Tools and scripts for
stack profiling: Userspace, Kernel, OS state
and optionally Oracle wait events.
- Oracle DBA scripts
- A collection of DBA
scripts for old-school CLI Oracle troubleshooting and performance
monitoring.
- OraLatencyMap
- OraLatencyMap is a performance widget
running on SQL*plus (Oracle's CLI) to collect and visualize latency
histograms for Oracle wait events using heat maps.
Packages on PyPi
·
SparkMeasure - SparkMeasure
is a tool for performance troubleshooting of Apache Spark workloads.
o
It simplifies the collection and analysis of Spark performance metrics.
The bulk of sparkMeasure is written in Scala. This
package contains the Python API for sparkMeasure and
is intended to work in conjunction with PySpark. Use from PySpark, or in
Jupyter notebook environments, or in general as a tool to instrument Spark jobs
in your Python code. Link to sparkMeasure GitHub page and
documentation
·
SparkHistogram - Sparkhistogram
contains helper functions for generating data histograms with the Spark
DataFrame API.
o Link
to SparkHistogram source code and documentation
·
TPCDS_PySpark – TPCDS_PySpark is a
TPC-DS workload generator written in Python and designed to run at scale using
Apache Spark. Use it to build your own Apache Spark Performance Lab, run
performance benchmarking and learn about troubleshooting Spark.
·
Test_CPU_parallel
- Use test_CPU_parallel to generate CPU-intensive
load on a system, running multiple threads in parallel.
o The tool runs a CPU-burning loop
concurrently on the system, with configurable parallelism. The tool outputs a
measurement of the CPU-burning loop execution time as a function of load. Link
to test-CPU-parallel source code and documentation
Images on DockerHub
·
Spark-Dashboard
– Spark-dashboard is a container image to deploy an Apache Spark performance
dashboard, it packaged Grafana, InfluxDB, the configuration
for ingesting Spark metrics, and prebuilt Grafana dashboards for Spark
performance visualization. See the project home at Spark Performance Dashboard
·
Test_cpu_parallel – Use test_cpu_parallel
to generate CPU-intensive load on Linux, Rust version. See the project home Test_CPU_parallel_Rust
·
Test_cpu_parallel.py
– Use test_cpu_parallel.py to generate CPU-intensive load on Linux, Python
version. See the project home Test_CPU_parallel_Python
·
Building an Apache Spark
Performance Lab: Tools and Techniques for Spark Optimization
·
Enhancing Apache Spark Performance
with Flame Graphs: A Practical Example Using Grafana Pyroscope
·
Performance Comparison of 5 JDKs on
Apache Spark
·
Building a Semantic Search Engine
and RAG Applications with Vector Databases and Large Language Models
·
Exploratory Notebooks for Deep
Learning and Data Tools: A Beginner's Guide
·
CPU Load Testing Exercises: Tools
and Analysis for Oracle Database Servers
·
Making histograms with Apache Spark
and other SQL engines
·
Can High Energy Physics Analysis
Profit from Apache Spark APIs?
·
Apache
Spark 3.0 Memory Monitoring Improvements
·
Distributed
Deep Learning for Physics with TensorFlow and Kubernetes
·
Machine
Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and
Analytics Zoo
·
A
Performance Dashboard for Apache Spark
·
SparkMeasure, a tool for performance troubleshooting of
Apache Spark workloads
·
Performance
Analysis of a CPU-Intensive Workload in Apache Spark
·
Apache
Spark and CERN Open Data Analysis, an Example
·
Diving
into Spark and Parquet Workloads, by Example
·
On
Measuring Apache Spark Workload Metrics for Performance Troubleshooting
·
Spark
notes (hosted on GitHub):
o
Miscellaneous
Spark commands, tips, info
o
Spark
performance dashboard config details
o
Spark
workload measurements with sparkMeasure
o
Spark
executor memory
o
Spark
and Parquet
o
Apache
Spark – HBase Connector
o
Spark
for_High_Energy_Physics
o
Spark
Histrograms
o
Flame
Graph, tools on Linux for profiling Spark
o
Read/analyze
Spark EventLog with Spark SQL
o
Tools
for Linux memory_performance measurements
o
Spark
SQL, a fun UDF_example with Mandelbrot set
o
Linux_OS_CPU_Disk_Network monitoring tools
o
Tools_for Apache Parquet_diagnostics
o
MapInArrow for Python UDF
o
Spark
and OpenSearch
o
Example
of a Scala project for Spark
·
Posters and reports:
o
Spark Executors’
Memory Configuration, Office Poster
o
Data
Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
o
Machine
learning pipelines with Apache Spark and Intel BigDL
o
Physics
data analysis and data reduction at scale with Apache Spark
o
Physics
data processing and machine learning in the cloud
Older blog entries: (2016) IPython/Jupyter SQL Magic Functions for PySpark, Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, How
to Buld a Neural Network Scoring Engine in PL/SQL,
IPython/Jupyter Notebooks for Oracle, Linux
BPF/bcc for Oracle Tracing, IPython Notebooks for Querying Apache Impala, SystemTap Guru Mode and Oracle SQL Parsing, PerfSheet.js:
Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts, Linux
Perf Probes for Oracle Tracing (2015) Extended
Stack Profiling - Ideas, Tools and Comments, Slides
of the CERN Talks at UKOUG Tech15, Oracle
Wait Events Investigated with Extended Stack Profiling and Flame Graphs, Linux
Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations, Add
Color to Your SQL, Diagnose
High-Latency I/O Operations Using SystemTap, Heat
Map Visualization of Latency Histograms for NetApp C-Mode, Event
Histogram Metric and Oracle 12c, Heat
Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Latest
Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization (2014) Talks
at UKOUG TECH 2014 with CERN Speakers, Life
of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap,
SystemTap into Oracle for Fun and Profit, Scaling
up Cardinality Estimates in 12.1.0.2, ASM
Metadata, Internals and Diagnostic Utilities, Oracle
Optimizer Investigated with Flame Graphs, Flame
Graphs for Oracle, A
Closer Look at CALIBRATE_IO, Recent
Updates of OraLatencyMap and PyLatencyMap,
Wait
Event History Sampling, an Experiment in Oracle Performance Analysis, Clusterware 12c and Restricted Service Registration for RAC (2013) How
to Recover Files from a Dropped ASM Disk Group, UKOUG
Tech13, Latency Investigations and Lost Writes, Daylight
Saving Time Change and AWR Data Mining, Getting
Started with PyLatencyMap: Latency Heat Maps for
Oracle, DTrace and More Sources, PyLatencyMap, a Performance Tool for Latency Data
Visualization, DTrace Explorations of Oracle Wait Events on Linux and
Solaris, OraLatencyMap v1.1 and Testing I/O with SLOB 2, Oracle
Events' Latency Visualization and Heat Maps in SQL*plus, Testing
Lost Writes with Oracle and Data Guard, AWR
Analytics and Oracle Performance Visualization with PerfSheet4 (2012) Active
Data Guard and UKOUG 2012, Command-Line
DBA Scripts, How
to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct
Read, Recursive
Subquery Factoring, Oracle SQL and Physics, Listener.ora and Oraagent in RAC
11gR2, Purging
Cursors From the Library Cache Using Full_hash_value,
Kerberos
Authentication and Proxy Users, Hash
Collisions in Oracle: SQL Signature and SQL_ID, SQL
Signature, Text Normalization and MD5 Hash, SQL
Patch and Force Match, V$EVENT_HISTOGRAM_METRIC,
Performance
Metrics Views, Of
I/O Latency, Skew and Histograms 2/2, Of
I/O Latency, Skew and Histograms 1/2
Publications:
- Towards a new
conditions data infrastructure in ATLAS, Evgeny Alexandrov, Luca
Canali, Davide Costanzo, Andrea Formica, Elizabeth J. Gallas, Mikhail Mineev, Nurcan Ozturk, Shaun Roe, Vakho Tsulaia and
Marcelo Vogel, EPJ Web of Conferences 295, 01013 (2024)
- The ATLAS Event Picking Service and Its
Evolution, E. Alexandrov, I. Alexandrov, D. Barberis, L. Canali E.
Cherepanova, E. Gallas, S. Gonzalez de la Hoz, F. Prokoshin, G. Rybkin, J.
Sal Cairols, J. Sanchez, M. Villaplana Perez, A.
Yakovlev, Physics of Particles and Nuclei, Vol55, No. 3 (2024)
- Deployment and
Operation of the ATLAS EventIndex for LHC Run 3, Elizabeth J. Gallas,
Evgeny Alexandrov, Igor Alexandrov, Dario Barberis, Luca Canali, Elizaveta
Cherepanova, Alvaro Fernandez Casani, Carlos Garcia Montoro, Santiago
Gonzalez de la Hoz, Alexander Iakovlev et al. (5 more), EPJ Web of
Conferences 295, 01018 (2024)
- The ATLAS EventIndex - A BigData
Catalogue for All ATLAS Experiment Events, D. Barberis et al., Comput Softw
Big Sci 7, 2
(2023)
- Machine Learning Pipelines with Modern Big
Data Tools for High Energy Physics, Matteo Migliorini, Riccardo
Castellotti, Luca Canali, Marco Zanetti, Comput Softw
Big Sci 4, 8
(2020)
- ScienceBox Converging to Kubernetes containers in
production for on-premises and hybrid clouds for CERNBox,
SWAN, and EOS, Enrico Bocchi, Luca Canali, Diogo Castro, Prasanth
Kothuri, Hugo Gonzalez Labrador, Maciej Malawski, Jakub T. Mościcki and Piotr Mrowczynski, EPJ Web of
Conferences 245, 07047 (2020)
- Using Big Data
Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214, 06030 (2019)
- Evolution of the
Hadoop Platform and Ecosystem for High Energy Physics, Z. Baranowski et al., EPJ Web of Conferences 214,
04058 (2019)
- A prototype for the
evolution of ATLAS EventIndex based on Apache Kudu storage, Z.
Baranowski et al., EPJ Web of
Conferences 214, 04057 (2019)
- Big Data Tools and
Cloud Services for High Energy Physics Analysis in TOTEM Experiment,
V. Avati et al., 2018,
Proceeding of: 2018 IEEE/ACM International Conference on Utility and Cloud
Computing Companion (UCC Companion)
- CMS
Analysis and Data Reduction with Apache Spark, O. Gutsche et al. 2018 J. Phys.: Conf. Ser.1085
042030
- A
study of data representation in Hadoop to optimize data storage and search
performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca
Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017 J.
Phys.: Conf. Ser. 898 062020
- Integration
of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca
Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017 J. Phys.: Conf.
Ser. 898 042055
- An
Oracle-based event index for ATLAS, Elizabeth J Gallas, Gancho
Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei
Dumitru, Andrea Formica, 2017 J. Phys.: Conf. Ser. 898
042033
- Scale
Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek,
Luca Canali, Daniel Lanza Garcia, Kacper Surdy, 2015 J. Phys.: Conf.
Ser. 664(4) 042002
- Evolution
of Database Replication Technologies for WLCG, Zbigniew Baranowski,
Lorena Lobato Pardavila, Marcin Blaszczyk,
Gancho Dimitrov, Luca Canali, 2015 J. Phys.: Conf. Ser. 664(4)
042032
- Sequential
data access with Oracle and Hadoop: a performance comparison, Zbigniew
Baranowski, Luca Canali and Eric Grancher, 2014 J.
Phys.: Conf. Ser. 513 042001
- ATLAS
database application enhancements using Oracle 11g, G Dimitrov, L
Canali, M Blaszczyk and R Sorokoletov,
2012 J. Phys.: Conf. Ser. 396 052027
- ATLAS
Data Management Accounting with Hadoop Pig and HBase, Mario
Lassnig, Vincent Garonne, Gancho Dimitrov and Luca
Canali, 2012 J. Phys.: Conf. Ser. 396 052044
- Structured
storage in ATLAS Distributed Data Management: use cases and experiences,
Mario Lassnig, Vincent Garonne, Angelos Molfetas,
Thomas Beermann, Gancho Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra
Chinzer, 2012 J. Phys.: Conf. Ser. 396
052045
- Advanced
Technologies for Scalable ATLAS Conditions Database Access, R
Basset, L Canali, G Dimitrov, M Girone, R
Hawkings, P Nevski, A Valassi, A Vaniachine, F Viegas, R
Walker and A Wong, 2010 J. Phys.: Conf. Ser. 219 042025
Miscellaneous:
Last updated, June 2024