Google Summer of Code 2021

This year we are again participating in the Google Summer of Code. The MariaDB Foundation believes we are making a better database that remains application compatible with MySQL. We also work on making LGPL connectors (currently C, ODBC, Java, Node.js) and on MariaDB Galera Cluster, which allows you to scale your reads & writes. And we have MariaDB ColumnStore, which is a columnar storage engine, designed to process petabytes of data with real-time response to analytical queries.

Where to Start

Please join us on Zulip to mingle with the community. You should also subscribe to (this is the main list where we discuss development).

To improve your chances of being accepted, it is a good idea to submit a pull request with a bug fix to the server.

Also see the List of beginner friendly issues and issues labelled gsoc21 from the MariaDB Issue Tracker.

List of Tasks

Add autocompletion capabilities to the MariaDB Jupyter kernel

As part of the Jupyter Messaging protocol, the Jupyter frontend sends a complete_request message to the MariaDB kernel when the user invokes the code completer in a Jupyter notebook.

This message is handled in the do_complete function from the MariaDBKernel class. In simpler words, whenever the user hits the key shortcut for code autocompletion in a notebook, the MariaDB kernel's do_complete function is called with a number of arguments that help the kernel understand what the user wants to autocomplete.

So the autocompletion infrastructure in the MariaDB kernel is already kindly provided by Jupyter, we only need to send back to Jupyter a list of suggestions based on the arguments that do_complete receives :-).

Ideally we should aim to enable at least database, table and column name completion and also SQL keyword completion. But no worries, there are plenty of possibilities to extend the functionality even more if the accepted student turns out to be very productive :D

Details:Project Issue
Mentor:Robert Bindar

Implement interacting editing of result sets in the MariaDB Jupyter kernel

At this moment the MariaDB kernel is only capable of getting the results sets from the MariaDB client in HTML format and packing them in a Jupyter compatible format. Jupyter then displays them in notebooks like it would display Python Pandas dataframes.

Sure, the users can easily write SQL code to modify the content of a table like they would write in a classical command line database client. But we want to go a bit further, we would love to have the capability to edit a result set returned by a SELECT statement (i.e. double click on table cells and edit) and have a button that users can press to generate a SQL statement that will update the content of the table via the MariaDB server.

Apart from interacting with the Jupyter frontend for providing this UI capability, we also have to implement a field integrity functionality so that we make sure users can't enter data that is not compatible with the datatype of the column as it is seen by the MariaDB server.

The project should start with a fair bit of research to understand how we can play with the Jupyter Messaging protocol to create the UI functionality and also to check other Jupyter kernels and understand what's the right and best approach for tackling this.

Details:Project Issue
Mentor:Andreia Hendea

Make the MariaDB Jupyter kernel capable of dealing with huge SELECTs

Currently the MariaDB kernel doesn't impose any internal limits for the number of rows a user can SELECT in a notebook cell. Internally the kernel gets the result set from MariaDB and stores it in a pandas DataFrame, so users can use it with magic commands to chart data.

But this DataFrame is stored in memory, so if you SELECT a huge number of rows, say 500k or 1M, it's probably not a very good idea to create such a huge DataFrame. We tested with 500k rows, and the DataFrame itself is not the biggest problem, it consumed around 500MB of memory. The problem is the amount of rows the browser needs to render, for 500k rows the browser tab with the notebook consumes around 2GB of memory, so the Jupyter frontend (JupyterLab, Jupyter Notebook) slows down considerably.

A potential solution is to introduce a two new config options which would specify:

  • a limit for the number of rows the Jupyter notebook should render, a reasonable default value for this could 50 rows for instance (display_max_rows)
  • a limit for each SELECT statement, limit_max_rows, that the kernel would use to determine whether it should store the result set in memory in a DataFrame or store the result set on disk. A reasonable default value might be 100k rows.

The trickiest part of the project though is that, once the kernel writes a result set on disk, the charting magic commands need to detect that the data is not in memory, it is on disk, and they should find a smart mechanism for generating the chart from the disk data without loading the entire data in memory (which would defeat the whole purpose of the project). This might involve finding a new Python plotting library (instead of current matplotlib) that can accomplish the job.

Details:Project Issue
Mentor:Vlad Bogolin

Suggest a Task

Do you have an idea of your own, not listed above? Do let us know!


Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.