Each instance launched Presto and Alluxio, co-locating the two services. For hardware, we used three h1.8xlarge AWS instances, each with 8TB ephemeral disks mounted for use by Alluxio to cache data local to Presto.S3 was mounted to Alluxio as the underlying persisting file system.
Two catalogs were configured for Presto; one connected to our existing Hive metastore, referencing the benchmark datasets externally stored on S3, and the other connected to a separate Hive metastore with the benchmark tables created in Alluxio.
We used the same datasets on S3 for performance comparison and pre-loaded the data into Alluxio with alluxio fs distributedLoad /testDB.
The Alluxio cluster was started with the following configuration settings:
# Impersonation alluxio.master.security.impersonation.presto.users=* alluxio.master.mount.table.root.ufs=<s3://alluxio_path> alluxio.security.authorization.permission.enabled=false alluxio.security.authentication.type=SIMPLE alluxio.security.authorization.permission.supergroup=* # Alluxio Worker tier configuration alluxio.worker.block.heartbeat.interval=30sec # Prevent disk thrashing alluxio.user.file.passive.cache.enabled=false # Increase Threadpool concurrency for Presto alluxio.user.block.master.client.pool.size.max=256 alluxio.user.file.master.client.pool.size.max=256 # Return full list of blocks alluxio.user.ufs.block.location.all.fallback.enabled=true # Worker properties alluxio.worker.tieredstore.levels=1 alluxio.worker.tieredstore.level0.dirs.mediumtype=HDD alluxio.worker.tieredstore.level0.dirs.path=</path1> alluxio.worker.tieredstore.level0.dirs.quota=<1000GB> # file replica alluxio.user.file.replication.max=3 # User properties alluxio.user.file.readtype.default=CACHE_PROMOTE # Writes data only to Alluxio before returning a successful write alluxio.user.file.writetype.default=MUST_CACHE
Above shows initial alluxio-site.properties
We noticed that Alluxio did not perform as expected when processing large amounts of small files. We enabled metadata caching to tune the performance.
alluxio.user.metadata.cache.enabled=true alluxio.user.metadata.cache.max.size=100000 alluxio.user.metadata.cache.expiration.time=10min
Above shows alluxio-site.properties to enable metadata caching
Four independent benchmarks were used to benchmark the performance with and without Alluxio:
Test 1 runs our internal benchmark, which consists of synthesis snapshots of player in-game events. The datasets are in ORC format with a total size of 1GB, 10GB, and 100GB. Each dataset is created with the same DDL, containing 49 cols, 40 varchar, 5 booleans, and 4 maps. The benchmark queries select all columns with one varchar field filter condition, which is a typical query for I/O heavy use cases.
Result: Alluxio with metadata caching is 2x to 7x faster than S3.
Test 2 simulates data visualization with game metadata and user engagement records. We selected two commonly used datasets and queries frequently used in Tableau and Dundas, respectively. The queries select all columns with a date filtering condition, followed by GROUP BY and ORDER BY of the date. This is a typical query that stresses both CPU and I/O. In this test, we did not need metadata caching enabled in Alluxio as it already shows significant improvement.
Result: Without metadata caching, Presto with Alluxio is 2.75x faster than S3 with the Dundas dataset and 5.1x faster with the Tableau dataset.
Test 3 simulates our dashboarding use case using a dataset with a large number of small files. The datasets are batches of 2MB files, totaling 50, 500, and 5000 files. The query used is a select query aggregating the number of entries for each date.
Result: Alluxio with metadata caching is 1.2x to 5.9x faster than S3. Without metadata caching, Alluxio is only 1x to 1.35x faster. Enabling metadata caching significantly reduces the execution time by memorizing metadata, recognizing hot data, and increasing replicas.
Test 4 simulates the conversational bot. The dataset used was a snapshot of daily game performance. The query contains multiple stages of calculation to simulate a CPU-intensive query. It converts an integer field into HyperLogLog, merges it, and selects the cardinality. The results are filtered by an integer and varchar field.
Result: Alluxio with metadata caching shortens the timespan from 85.2 seconds to 3 seconds, which improves the performance as much as 27x.
This blog explores an innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage to support online services with an instantaneous response within the gaming industry. We evaluated this platform with real industrial examples of data visualization, dashboarding, and a conversational chatbot.
Our preliminary results show that Presto with Alluxio outperforms S3 significantly in all cases. In particular, Alluxio with metadata caching shows up to 5.9x performance gain when handling large numbers of small files.
Alluxio enables the separation of storage and compute by managing the allocated ephemeral disks to cache data from S3 local to Presto. Advanced cache management with an asymmetric number of replicas for hot vs. cold data accounts for performance gains in each scenario we tested.
Create your free account to unlock your custom reading experience.