Spark descriptive name for cached dataframes

Concise names for cached tables.

By Georg Heiler

Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to? Usually no specific name is set. When you apply a df.cache spark will auto generate the name as a snippet from the query plan. But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.

However, there is a better way:

def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
      df: DataFrame): DataFrame = {
    df.sparkSession.sharedState.cacheManager
      .cacheQuery(df, Some(name), storageLevel)
    df
  }

one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.

Original post published here.

One thought on “Spark descriptive name for cached dataframes

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Pin It on Pinterest

Website Security Test