Presentations
and Talks:
- Introduction
to Apache Spark APIs for Data Processing,
training course on Apache Spark, November 2022, PDFs_and_Videos, Notebooks
- Basic Physics Analyses Implemented Using Apache Spark,
PyHEP 2022, September 14th, 2022, pptx, PDF, PDF_extended, Notebooks, Video
- Monitor
Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit
2021, May 26th, 2021, pptx,
PDF,
Video, Demo (mp4)
- What
is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI
Summit Europe 2020, November 18th, 2020, pptx,
PDF,
Video
- Big Data
Tools and Pipelines for Machine Learning in HEP, CERN EP-IT Data
science seminar, December 4th, 2019, pptx,
PDF
- Performance
Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019,
Amsterdam, October 17th, 2019, pptx,
PDF,
Video
- Deep
Learning Pipelines for High Energy Physics using Apache Spark with
Distributed Keras on Analytics Zoo, Spark Summit Europe 2019,
Amsterdam, October 16th, 2019, pptx,
PDF,
Video
- Big Data
In HEP - Physics Data Analysis, Machine learning and Data Reduction at
Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September
24th, 2019, pptx,
PDF
- Apache
Spark for RDBMS Practitioners, Spark Summit Europe 2018, London,
October 4th, 2018, pptx,
PDF,
Video
- Data Analytics – Use Cases, Platforms, Services @ CERN
IT, ITMM Meeting, CERN, March 5th, 2018, pptx, PDF
- Apache Spark Performance Troubleshooting at Scale,
Challenges, Tools, and Methods, Spark Summit Europe 2017, Dublin, October
26th, 2017, pptx,
PDF,
Video
- Overview of Big Data Solutions and Services at CERN,
CERN Knowledge Transfer Forum, CERN, September 29th, 2017,
slides: pptx,
PDF
- Hadoop and Spark Ecosystem for Data Analytics,
Experience and Outlook, WLCG GDB meeting, CERN, September 13th,
2017, slides: pptx,
PDF
- Data
Analytics and CERN IT Hadoop Service, CERN
openlab Technical Workshop, CERN, December 9th, 2016,
slides pptx, PDF
- Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016,
Brussels, October 26th, 2016, slides: pptx, PDF, Video
- Integration
of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11th,
2016, slides: pptx,
PDF
- Stack
Traces and Flame Graphs for Oracle Troubleshooting, UKOUG
Tech15 Super Sunday, Birmingham, December 6th, 2015,
slides: pptx,
PDF
- Modern Linux Tools for Oracle Troubleshooting, Swiss
Oracle User Group (SOUG) event, Prangins
(CH), May 21st, 2015, slides PDF
- Database
Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP),
April 11th, 2015, slides pptx
- Modern
Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool,
December 9th, 2014, slides pptx, PDF
- A
Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9th,
2014, slides pptx, PDF
- Introduction on Data for Physics at CERN and Deep Dive
into Oracle ASM, Enkitec
E4 2014, Dallas (TX), June 2014, pptx
- A Latency
Picture is Worth a Thousand Storage Metrics, Hotsos 2014, Dallas (TX),
March 4th, 2014, pptx
- Lost
Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4th,
2013, pptx
- Storage
Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2nd, 2013,
pptx
- Active
Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4th,
2012, pptx
- Testing
Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG
Conference 2011, Birmingham, December 6th, 2011, pptx
- CERN IT-DB Deployment, Status, Outlook, ESA-GAIA
DB Workshop, ISDC, Geneva, March 2011, pptx
- Click here for a list including talks prior to
2011
- SparkMeasure
- A tool for performance troubleshooting of Apache
Spark workloads.
- SparkTraining
- Spark Performance
Dashboard
- Dockerfile and Helm
chart for deploying an Apache Spark performance dashboard.
- SparkPlugins
- Code and examples of
how to use Spark Plugin extensions with Apache Spark 3.0 to extend the
Spark metrics systems with custom monitoring probes for OS, I/O and
external applications.
- SparkDLTrigger
- Miscellaneous
- Notes on Apache Spark
and related tech
- Jupyter notebooks
with examples of Spark for Physics.
- SparkHistograms
a Python package for generating histograms with Spark
- Jupyter notebooks
with examples of to read from Oracle, Trino/Presto, PostgreSQL,
YugabyteDB
- Linux tracing scripts
- Scripts and tools for
troubleshooting and performance analysis in Linux.
- PerfSheet4
- PerfSheet4 is a tool to query and visualize
Oracle AWR data using Excel pivot charts
- PerfSheet.js
- PerfSheet.js is a tool to extract and visualize
Oracle AWR time series data in the browser using JavaScript and dynamic
pivot charts.
- PyLatencyMap
- PyLatencyMap is a tool for heat map
visualization on the CLI.
- Stack Profiling
- Tools and scripts for stack profiling:
Userspace, Kernel, OS state and optionally Oracle wait events.
- Oracle DBA scripts
- A collection of DBA scripts for old-school CLI
Oracle troubleshooting and performance monitoring.
- OraLatencyMap
- OraLatencyMap is a performance widget running on
SQL*plus (Oracle's CLI) to collect and visualize latency histograms for
Oracle wait events using heat maps.
·
Making histograms with Apache Spark
and other SQL engines
·
Can High Energy Physics Analysis
Profit from Apache Spark APIs?
·
Apache
Spark 3.0 Memory Monitoring Improvements
·
Distributed
Deep Learning for Physics with TensorFlow and Kubernetes
·
Machine
Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and
Analytics Zoo
·
A
Performance Dashboard for Apache Spark
·
SparkMeasure,
a tool for performance troubleshooting of Apache Spark workloads
·
Performance
Analysis of a CPU-Intensive Workload in Apache Spark
·
Apache
Spark and CERN Open Data Analysis, an Example
·
Diving
into Spark and Parquet Workloads, by Example
·
On
Measuring Apache Spark Workload Metrics for Performance Troubleshooting
·
Spark
notes (hosted on GitHub):
o Miscellaneous
Spark commands, tips, info
o Spark
performance dashboard config details
o Spark
workload measurements with sparkMeasure
o Spark
executor memory
o Apache
Spark – HBase Connector
o Spark
for_High_Energy_Physics
o Spark
Histrograms
o Flame
Graph, tools on Linux for profiling Spark
o Read/analyze
Spark EventLog with Spark SQL
o Tools
for Linux memory_performance measurements
o Spark
SQL, a fun UDF_example with Mandelbrot set
o Linux_OS_CPU_Disk_Network
monitoring tools
o Tools_for
Apache Parquet_diagnostics
o MapInArrow
for Python UDF
o Example
of a Scala project for Spark
·
Posters
and reports:
o Spark Executors’
Memory Configuration, Office Poster
o Data
Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
o Machine
learning pipelines with Apache Spark and Intel BigDL
o Physics
data analysis and data reduction at scale with Apache Spark
o Physics
data processing and machine learning in the cloud
Older
blog entries: (2016) IPython/Jupyter
SQL Magic Functions for PySpark, Apache
Spark 2.0 Performance Improvements Investigated With Flame Graphs, How
to Buld a Neural Network Scoring Engine in PL/SQL, IPython/Jupyter
Notebooks for Oracle, Linux
BPF/bcc for Oracle Tracing, IPython
Notebooks for Querying Apache Impala, SystemTap
Guru Mode and Oracle SQL Parsing, PerfSheet.js:
Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts, Linux
Perf Probes for Oracle Tracing (2015) Extended
Stack Profiling - Ideas, Tools and Comments, Slides
of the CERN Talks at UKOUG Tech15, Oracle
Wait Events Investigated with Extended Stack Profiling and Flame Graphs, Linux
Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations, Add
Color to Your SQL, Diagnose
High-Latency I/O Operations Using SystemTap, Heat
Map Visualization of Latency Histograms for NetApp C-Mode, Event
Histogram Metric and Oracle 12c, Heat
Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Latest
Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization (2014) Talks
at UKOUG TECH 2014 with CERN Speakers, Life
of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, SystemTap
into Oracle for Fun and Profit, Scaling
up Cardinality Estimates in 12.1.0.2, ASM
Metadata, Internals and Diagnostic Utilities, Oracle
Optimizer Investigated with Flame Graphs, Flame
Graphs for Oracle, A
Closer Look at CALIBRATE_IO, Recent
Updates of OraLatencyMap and PyLatencyMap, Wait
Event History Sampling, an Experiment in Oracle Performance Analysis, Clusterware
12c and Restricted Service Registration for RAC (2013) How
to Recover Files from a Dropped ASM Disk Group, UKOUG
Tech13, Latency Investigations and Lost Writes, Daylight
Saving Time Change and AWR Data Mining, Getting
Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More
Sources, PyLatencyMap,
a Performance Tool for Latency Data Visualization, DTrace
Explorations of Oracle Wait Events on Linux and Solaris, OraLatencyMap
v1.1 and Testing I/O with SLOB 2, Oracle
Events' Latency Visualization and Heat Maps in SQL*plus, Testing
Lost Writes with Oracle and Data Guard, AWR
Analytics and Oracle Performance Visualization with PerfSheet4 (2012) Active
Data Guard and UKOUG 2012, Command-Line
DBA Scripts, How
to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct
Read, Recursive
Subquery Factoring, Oracle SQL and Physics, Listener.ora
and Oraagent in RAC 11gR2, Purging
Cursors From the Library Cache Using Full_hash_value, Kerberos
Authentication and Proxy Users, Hash
Collisions in Oracle: SQL Signature and SQL_ID, SQL
Signature, Text Normalization and MD5 Hash, SQL
Patch and Force Match, V$EVENT_HISTOGRAM_METRIC,
Performance
Metrics Views, Of
I/O Latency, Skew and Histograms 2/2, Of
I/O Latency, Skew and Histograms 1/2
Publications:
- Machine Learning
Pipelines with Modern Big Data Tools for High Energy Physics, Matteo Migliorini,
Riccardo Castellotti, Luca Canali, Marco Zanetti, Comput Softw
Big Sci 4, 8 (2020).
- ScienceBox
Converging to Kubernetes containers in production for on-premises and
hybrid clouds for CERNBox, SWAN, and EOS, Enrico Bocchi, Luca Canali,
Diogo Castro, Prasanth Kothuri, Hugo Gonzalez Labrador, Maciej Malawski,
Jakub T. Mościcki and Piotr Mrowczynski, EPJ Web of Conferences 245,
07047 (2020)
- Using
Big Data Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214,
06030 (2019)
- Evolution
of the Hadoop Platform and Ecosystem for High Energy Physics, Z.
Baranowski et al., EPJ Web of
Conferences 214, 04058 (2019)
- A
prototype for the evolution of ATLAS EventIndex based on Apache Kudu
storage, Z. Baranowski et al.,
EPJ Web of Conferences 214, 04057 (2019)
- Big
Data Tools and Cloud Services for High Energy Physics Analysis in TOTEM
Experiment, V. Avati et al.,
2018, Proceeding of: 2018 IEEE/ACM International Conference on Utility and
Cloud Computing Companion (UCC Companion)
- CMS
Analysis and Data Reduction with Apache Spark, O. Gutsche et al. 2018 J. Phys.: Conf. Ser.1085 042030
- A
study of data representation in Hadoop to optimize data storage and search
performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca
Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017 J.
Phys.: Conf. Ser. 898 062020
- Integration
of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca
Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017 J. Phys.: Conf.
Ser. 898 042055
- An
Oracle-based event index for ATLAS, Elizabeth J Gallas, Gancho
Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei
Dumitru, Andrea Formica, 2017 J. Phys.: Conf. Ser. 898
042033
- Scale
Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek,
Luca Canali, Daniel Lanza Garcia, Kacper Surdy, 2015 J. Phys.: Conf.
Ser. 664(4) 042002
- Evolution
of Database Replication Technologies for WLCG, Zbigniew Baranowski,
Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov, Luca Canali,
2015 J. Phys.: Conf. Ser. 664(4) 042032
- Sequential
data access with Oracle and Hadoop: a performance comparison, Zbigniew
Baranowski, Luca Canali and Eric Grancher, 2014 J.
Phys.: Conf. Ser. 513 042001
- ATLAS
database application enhancements using Oracle 11g, G Dimitrov, L
Canali, M Blaszczyk and R Sorokoletov, 2012 J.
Phys.: Conf. Ser. 396 052027
- ATLAS
Data Management Accounting with Hadoop Pig and HBase, Mario
Lassnig, Vincent Garonne, Gancho Dimitrov and Luca
Canali, 2012 J. Phys.: Conf. Ser. 396 052044
- Structured
storage in ATLAS Distributed Data Management: use cases and experiences,
Mario Lassnig, Vincent Garonne, Angelos Molfetas, Thomas Beermann, Gancho
Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra Chinzer, 2012 J.
Phys.: Conf. Ser. 396 052045
- Advanced
Technologies for Scalable ATLAS Conditions Database Access, R
Basset, L Canali, G Dimitrov, M Girone, R Hawkings, P
Nevski, A Valassi, A Vaniachine, F Viegas, R
Walker and A Wong, 2010 J. Phys.: Conf. Ser. 219 042025
Miscellaneous:
Last updated, November 2022