This project aims to configure and optimize a database system to handle both transactional and analytical workloads using a GitHub-based dataset. For the analytical queries we also compared the performance of the queries with the one on DuckDB.
The project was developed using virtual machines on Google Cloud, where we automated the environment setup using machine images and custom scripts to ensure consistency across tests.
For the Transactional Workload, we analyzed execution plans to identify bottlenecks and implemented several improvements:
-
Redundancy: Created specialized indices and materialized views to speed up common queries like GetRepositoryInfo and SearchRepositories.
-
Parameter Tuning: Systematically adjusted PostgreSQL parameters such as shared_buffers, work_mem, and effective_cache_size.
-
Durability vs Performance: Tested the impact of fsync and synchronous_commit to find the best balance for throughput.
In the Analytical Workload, we tackled complex queries (Q1-Q4) by refactoring SQL code and using advanced indexing:
-
SQL Refactoring: Simplified relational algebra, replaced inefficient NOT IN clauses with LEFT JOIN anti-joins, and used LATERAL JOINs for better aggregation.
-
Comparison: Extensively compared PostgreSQL results with DuckDB, noting how DuckDB's columnar nature often outperformed row-oriented PostgreSQL in large-scale analytical tasks.
We were very satisfied with the results, achieving a 17.6x increase in throughput and a 94% reduction in response time for the transactional workload in our final configuration. In the analytical configuration 3 out of the 4 queries greatly surpassed the performance of DuckDB.
For a full report and project files, check out the repository at https://github.com/luis25franca/dba-postgres-benchmark.