Google Summer of Code 2024

You are viewing an old version of this article. View the current version here.

This year we are again participating in the Google Summer of Code. We, joined with the MariaDB Foundation, believe we are making a better database that remains application compatible with MySQL. We also work on making LGPL connectors (currently C, C++, ODBC, Java, Node.js) and on MariaDB Galera Cluster, which allows you to scale your reads & writes. And we have MariaDB ColumnStore, which is a columnar storage engine, designed to process petabytes of data with real-time response to analytical queries.

Where to Start
List of Tasks
1. MariaDB Server
Suggest a Task

Where to Start

Please join us on Zulip to mingle with the community. You should also subscribe to the developers mailing list (this is the main list where we discuss development - there are also other mailing lists).

To improve your chances of being accepted, it is a good idea to submit a pull request with a bug fix to the server.

Also see the List of beginner friendly issues from the MariaDB Issue Tracker.

List of Tasks

MariaDB Server

Implement IVFFlat indexing strategy for MariaDB Vector and evaluate performance

Part-time (175h) or full-time project (350h) - depending on scope MariaDB Vector is coming to MariaDB Server to serve AI Workloads. The current indexing strategy will use HNSW, but IVFFlat is a possible alternative that costs fewer resources to create. Having it as an option is desirable.

MDEV-17398 Spatial (GIS) functions in MariaDB

Part-time (175h) or full-time project (350h) - depending on scope

Our GIS functionality is limitted compared to other DBMSes. Given that MariaDB looks to facilitate migration from MySQL, we should be on par. We have a list of functions that are missing in MariaDB compared to MySQL, as described in https://mariadb.com/kb/en/function-differences-between-mariadb-1010-and-mysql-80/.
Our goal is to have as many of these functions available within MariaDB. Some of the functionality can be ported from MySQL, while others might require implementation from scratch.

Skills needed: Understanding of C++ development. Ability to navigate a large codebase (with help from mentor).
Mentors: Anel Husakovic (primary) / Vicențiu Ciorbaru (secondary)

MDEV-16482 MariaDB Oracle mode misses Synonyms

Full-time project 350h

Synonyms are an important feature, particularly as it helps smooth migration from other databases. While the initial project scope seems straightforward, there are a number of aspects that must be considered:

Grammar extension
Where will the synonyms definitions be stored?
How do synonyms map to the underlying privilege system? Who can create a synonym? Who can access a synonym?
Do we enforce the underlying object to exists before creating a synonym? What if the underlying object gets dropped?
What kind of error messages do we present to the user in various corner cases?
How do synonyms interact with replication (row based vs statement based)
How do synonyms interact with views (and views execution)
How to present synonyms to users (as part of INFORMATION_SCHEMA for instance?)
Performance considerations for multiple connections to the database.

Skills needed: Understanding of C++ development. Able to write and discuss various tradeoffs such that we achieve a feature set that makes sense given the database's priorities.
Mentors: Vicențiu Ciorbaru (primary) / Michael Widenius (secondary)

MDEV-30645 CREATE TRIGGER FOR { STARTUP | SHUTDOWN }

Full-time project 350h

Support generalized triggers like

CREATE TRIGGER ... AFTER STARTUP ...
CREATE TRIGGER ... BEFORE SHUTDOWN ...
CREATE TRIGGER ... ON SCHEDULE ...

the latter being a synonym for CREATE EVENT.

should STARTUP/SHUTDOWN triggers run exclusively? that is, STARTUP trigger is run before any connection is allowed or in parallel with them? Same for SHUTDOWN.

MDEV-21978 make my_vsnprintf to use gcc-compatible format extensions

Part-time project 175h

my_vsnprintf() is used internally in the server as a portable printf replacement. And it's also exported to plugins as a service.

It supports a subset of printf formats and three extensions:

%`s means that a string should be quoted like an `identifier`
%b means that it's a binary string, not zero-terminated; printing won't stop at \0, so one should always specify the field width (like %.100b)
%M is used in error messages and prints the integer (errno) and the corresponding strerror() for it
%T takes string and print it like %s but if the string should be truncated puts "..." at the end

gcc knows printf formats and check whether actual arguments match the format string and issue a warning if they don't. Unfortunately there seems to be no easy way to teach gcc our extensions, so for now we have to disable printf format checks.

An better approach would be to use gcc compatible format extensions, like Linux kernel does. We should migrate to a different syntax for our extensions

%sI to mean "print as an identifier"
%sB to mean "print a binary string"
%uE to mean "print an errno"
%sT to put a "..." as truncation indicator

old formats can still be supported or they can be removed and in the latter case the major version of the service should be increased to signal an incompatible change.

All error messages and all usages of my_vsnprintf should be changed to use the new syntax. One way to do it is to disable old syntax conditionally, only in debug builds. All gcc printf format checks should be enabled.

Skills needed: Understanding of C development.
Mentors: Sergei Golubchik

MCOL-5688 Parallel CSV read leveraging Apache Arrow

Full-time project 350h

cpimport is a binary that ingests data into MCS in an efficient manner reducing ingest timings significantly whilst preserving transaction isolation levels.

cpimport is relatively complex facility that reads data from local file/S3 parses it, converts and put into MCS-specific files. cpimport is unable to read a big-sized single CSV file from disk in parallel. Apache Arrow has a CSV read faciilty that can do parallel CSV read. The goal of the project is to replace an existing homebrew CSV parser implemented in cpimport with the one from Apache Arrow.

Skills needed: modern C++.
Mentors: Leonid Fedorov

MCOL-4889 Manual vacuum cleaning for on-disk data empty records

Full-time project 350h

Here extent is a unit of group of columnar values and partition is a group of extents that stores all column values for a specific portion of a table. MCS has a notion of an empty value for columnar segment/token files and dictionaries. Empty values are marked with a bit in the special 1 byte auxiliary column that is created for every table. When DELETE removes records from a table the records are marked with empty bit in the auxiliary column. The deleted records become a wasted disk space. The goal of the project is to reclaim the wasted disk space either re-creating the whole partition or moving partition values.

Skills needed: modern C++
Mentors: Roman Nozdrin

Suggest a Task

Do you have an idea of your own, not listed above? Do let us know in the comments below (Click 'Login' on the top of the page first)!

Comments

Comments loading...

Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.

Google Summer of Code 2024

Contents

Where to Start

List of Tasks

MariaDB Server

Implement IVFFlat indexing strategy for MariaDB Vector and evaluate performance

MDEV-17398 Spatial (GIS) functions in MariaDB

MDEV-16482 MariaDB Oracle mode misses Synonyms

MDEV-30645 CREATE TRIGGER FOR { STARTUP | SHUTDOWN }

MDEV-21978 make my_vsnprintf to use gcc-compatible format extensions

MCOL-5688 Parallel CSV read leveraging Apache Arrow

MCOL-4889 Manual vacuum cleaning for on-disk data empty records

Suggest a Task

Comments

Products

Services

Pricing

Resources

About Us

Download MariaDB