Luca.Canali@cern.ch - @LucaCanaliDB - Home Page

Presentations, Talks and Videos:

Github icon Repositories at https://github.com/LucaCanali

   

A blue and yellow cubes

Description automatically generated Packages on PyPi

·         SparkMeasure - SparkMeasure is a tool for performance troubleshooting of Apache Spark workloads.

o   It simplifies the collection and analysis of Spark performance metrics. The bulk of sparkMeasure is written in Scala. This package contains the Python API for sparkMeasure and is intended to work in conjunction with PySpark. Use from PySpark, or in Jupyter notebook environments, or in general as a tool to instrument Spark jobs in your Python code. Link to sparkMeasure GitHub page and documentation

·         SparkHistogram - Sparkhistogram contains helper functions for generating data histograms with the Spark DataFrame API.

o   Link to SparkHistogram source code and documentation

·         TPCDS_PySparkTPCDS_PySpark is a TPC-DS workload generator written in Python and designed to run at scale using Apache Spark. Use it to build your own Apache Spark Performance Lab, run performance benchmarking and learn about troubleshooting Spark.

·         Test_CPU_parallel - Use test_CPU_parallel to generate CPU-intensive load on a system, running multiple threads in parallel.

o   The tool runs a CPU-burning loop concurrently on the system, with configurable parallelism. The tool outputs a measurement of the CPU-burning loop execution time as a function of load. Link to test-CPU-parallel source code and documentation

 

 

Rishat Sultanov | Space Images on DockerHub

 

·         Spark-Dashboard – Spark-dashboard is a container image to deploy an Apache Spark performance dashboard, it packaged Grafana, InfluxDB, the configuration for ingesting Spark metrics, and prebuilt Grafana dashboards for Spark performance visualization. See the project home at Spark Performance Dashboard

·         Test_cpu_parallel – Use test_cpu_parallel to generate CPU-intensive load on Linux, Rust version. See the project home Test_CPU_parallel_Rust

·         Test_cpu_parallel.py – Use test_cpu_parallel.py to generate CPU-intensive load on Linux, Python version. See the project home Test_CPU_parallel_Python

 

Blogger icon Blog at http://externaltable.blogspot.com
             
and contributing to http://db-blog.web.cern.ch

·         Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

·         Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

·         Performance Comparison of 5 JDKs on Apache Spark

·         Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

·         Exploratory Notebooks for Deep Learning and Data Tools: A Beginner's Guide

·         CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers

·         Making histograms with Apache Spark and other SQL engines

·         Can High Energy Physics Analysis Profit from Apache Spark APIs?

·         Apache Spark 3.0 Memory Monitoring Improvements

·         Distributed Deep Learning for Physics with TensorFlow and Kubernetes

·         Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo

·         A Performance Dashboard for Apache Spark

·         SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads

·         Performance Analysis of a CPU-Intensive Workload in Apache Spark

·         Apache Spark and CERN Open Data Analysis, an Example

·         Diving into Spark and Parquet Workloads, by Example

·         On Measuring Apache Spark Workload Metrics for Performance Troubleshooting

·         Spark notes (hosted on GitHub):

o    Miscellaneous Spark commands, tips, info

o    Spark performance dashboard config details

o    Spark workload measurements with sparkMeasure

o    Spark executor memory

o    Spark and Parquet

o    Apache Spark – HBase Connector

o    Spark for_High_Energy_Physics

o    Spark Histrograms

o    Flame Graph, tools on Linux for profiling Spark

o    Read/analyze Spark EventLog with Spark SQL

o    Tools for Linux memory_performance measurements

o    Spark SQL, a fun UDF_example with Mandelbrot set

o    Linux_OS_CPU_Disk_Network monitoring tools

o    Tools_for Apache Parquet_diagnostics

o    MapInArrow for Python UDF

o    Spark and OpenSearch

o    Example of a Scala project for Spark

·         Posters and reports:

o    Spark Executors’ Memory Configuration, Office Poster

o    Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

o    Machine learning pipelines with Apache Spark and Intel BigDL

o    Physics data analysis and data reduction at scale with Apache Spark

o    Physics data processing and machine learning in the cloud

Older blog entries: (2016) IPython/Jupyter SQL Magic Functions for PySpark, Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs, How to Buld a Neural Network Scoring Engine in PL/SQL, IPython/Jupyter Notebooks for Oracle, Linux BPF/bcc for Oracle Tracing, IPython Notebooks for Querying Apache Impala, SystemTap Guru Mode and Oracle SQL Parsing, PerfSheet.js: Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts, Linux Perf Probes for Oracle Tracing (2015) Extended Stack Profiling - Ideas, Tools and Comments, Slides of the CERN Talks at UKOUG Tech15, Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs, Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations, Add Color to Your SQL, Diagnose High-Latency I/O Operations Using SystemTap, Heat Map Visualization of Latency Histograms for NetApp C-Mode, Event Histogram Metric and Oracle 12c, Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Latest Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization (2014) Talks at UKOUG TECH 2014 with CERN Speakers, Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, SystemTap into Oracle for Fun and Profit, Scaling up Cardinality Estimates in 12.1.0.2, ASM Metadata, Internals and Diagnostic Utilities, Oracle Optimizer Investigated with Flame Graphs, Flame Graphs for Oracle, A Closer Look at CALIBRATE_IO, Recent Updates of OraLatencyMap and PyLatencyMap, Wait Event History Sampling, an Experiment in Oracle Performance Analysis, Clusterware 12c and Restricted Service Registration for RAC (2013) How to Recover Files from a Dropped ASM Disk Group, UKOUG Tech13, Latency Investigations and Lost Writes, Daylight Saving Time Change and AWR Data Mining, Getting Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More Sources, PyLatencyMap, a Performance Tool for Latency Data Visualization, DTrace Explorations of Oracle Wait Events on Linux and Solaris, OraLatencyMap v1.1 and Testing I/O with SLOB 2, Oracle Events' Latency Visualization and Heat Maps in SQL*plus, Testing Lost Writes with Oracle and Data Guard, AWR Analytics and Oracle Performance Visualization with PerfSheet4 (2012) Active Data Guard and UKOUG 2012, Command-Line DBA Scripts, How to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct Read, Recursive Subquery Factoring, Oracle SQL and Physics, Listener.ora and Oraagent in RAC 11gR2, Purging Cursors From the Library Cache Using Full_hash_value, Kerberos Authentication and Proxy Users, Hash Collisions in Oracle: SQL Signature and SQL_ID, SQL Signature, Text Normalization and MD5 Hash, SQL Patch and Force Match, V$EVENT_HISTOGRAM_METRIC, Performance Metrics Views, Of I/O Latency, Skew and Histograms 2/2, Of I/O Latency, Skew and Histograms 1/2


Publications:


Miscellaneous:


Last updated, June 2024