Project
FlinkSketch - A Sketch Analytical System for Multidimensional Data Streams
August 2022 - Present
[Will be open-source!]
- Implemented a sketch analytical system for multidimensional data streams based on Apache Flink.
- Implemented the Sketch-of-sketches mechanism.
- Integrated the universal sketch algorithm.
- Implemented the hashed mini-batch mechanism.
- Fully deployable through common public network services, e.g., Flink, Kafka, PostgreSQL, and Grafana.
- Easy management of sketches and queries with full-duplex client-engine communication.
Benchmarking Cloud-Native Storage Engines
April 2021 - June 2022
- Studied the characteristics of I/O workloads of the cloud-native storage engine at ByteDance.
- Leveraged ML models to predict the characteristics of I/O workloads from the characteristics of compute workloads in the cloud-native database in ByteDance.
- Adapted YCSB to design and implement a learned benchmark to generate realistic I/O workloads.
- Improved the accuracy of the tail-latency measured by benchmark tools by orders of magnitudes.
- Submitted an industry-track paper as the primary author.
Interactive Workflow System Texera
June 2021 - September 2021
- Implemented set difference operator and set symmetric difference operator. [LINK]
- Designed and implemented storage backends and a workflow rewriter for operator materialization. [LINK]
GPU-integrated OLAP System GHive
April 2020 - April 2021
- Implemented the prototype of GHive, a GPU-integrated OLAP system based on Apache Hive.
- Designed and implemented operators, dataflow, and data processing logic.
- Implemented low-cost data transfer between JVM and native environment.
- Published a demo paper in SIGMOD’22 as a co-author.
- Published an industry-track paper in SoCC’22 as a co-author.
For a list of other (course) projects, go here.