Understanding file paths in Databricks
October 06, 2023
When first getting started with Databricks, one of the confusing aspects can be figuring out how to specify file paths to read or write files. This can lead to frustrating
FileNotFound errors or wandering the file system searching for that file you're pretty sure you just serialized... somewhere.
In this post, we'll dig into the mechanics of file paths in Databricks, discuss how to work with them, and hopefully get a better understanding of their nuances.
Before discussing file paths, it's helpful to know that there are two file systems in Databricks: the Local File System on the Driver Node and the Databricks File System (DBFS).
The Driver Node is the node on the cluster responsible for executing your code. The Driver node coordinates and manages the work across the other nodes in the cluster.
The Driver Node has its own file system, known as the Local File System. Files stored in the Local File System exist in volume storage attached to the cluster, which differs from object storage.
DBFS is the file system abstraction that sits over object storage (e.g., S3 or Blob). DBFS lets users interact with their object storage like a regular file system rather than using object URLs.
DBFS is also what we see when we click the Browse DBFS button in the Catalog area of the Databricks UI.
In Databricks, the code being executed, e.g., Python, Spark SQL, is associated with a default location on either the Local File System of the Driver Node or DBFS. Because of this feature, how we specify a file path depends on the code we're running and the file system we're accessing.
If we're accessing a location on a file system that is different from the default for our code, then we modify the target file path to reflect the non-default file system.
The following table, adapted from the Databricks documentation, shows the defaults for the different code types.
|Command/Code||Default Location||Prefix to access DBFS||Prefix to access Local File System|
|Spark SQL||DBFS Root||
|Python code, e.g., pandas||Local File System||
||Local File System||
Let's look at a handful of commands to see how the default location behavior works.
The default file system location for the
fs command is DFBS. When we run the
%fs ls command, we get the contents of the DBFS Root.
%fs ls /
| # | path | name | size | modificationTime | |-----|-----------------|------------|------|------------------| | 1 | dbfs:/FileStore | FileStore/ | 0 | 0 | | 2 | dbfs:/mnt/ | mnt/ | 0 | 0 | | 3 | dbfs:/tmp/ | tmp/ | 0 | 0 | | ... | ... | ... | ... | ... |
If we want to list the contents of a directory on the Local File System, we need to include the
file:/ prefix on the file path.
%fs ls file:/
| # | path | name | size | modificationTime | |-----|------------|------|------|------------------| | 1 | file:/bin | bin/ | 0 | 0 | | 2 | file:/etc/ | etc/ | 0 | 0 | | 3 | file:/lib/ | lib/ | 0 | 0 | | ... | ... | ... | ... | ... |
We can also use regular Python Code to list the directory contents using the os module of the standard library. The default file system for Python Code is the Local File System of the Driver Node. As expected, we see different contents to what's in DBFS.
import os print(os.listdir("/"))
['bin', 'lib', 'databricks', 'var', 'tmp', 'etc', ... ]
If we want to list the directory contents of the DBFS Root, then we must add the
/dbfs prefix to the file path.
import os print(os.listdir("/dbfs/"))
['FileStore', 'tmp', 'mnt', 'databricks-datasets', ...]
Here's another example where we specify both file systems simultaneously. We can use
dbutils.fs to copy a file from the Local File System to DBFS. The default for
dbutils.fs is DBFS, so we add
file:/ to indicate the Local File System.
Now, the same operation using the
%sh (shell) command, which defaults to the Local File System.
%sh cp /tmp/my_file.txt /dbfs/FileStore/
Spark SQL also defaults to DBFS.
CREATE OR REPLACE TEMPORARY VIEW my_view USING CSV OPTIONS ( path "/FileStore/my_csv.csv");
And because Spark SQL defaults to DBFS, the same rule applies; we add
file:/ to read files from the Local File System.
CREATE OR REPLACE TEMPORARY VIEW my_view USING CSV OPTIONS ( path "file:/tmp/my_csv.csv");
For brevity, I've omitted examples for Pyspark, but it follows the same pattern as the examples above.
Lastly, another point of confusion is the use of
/dbfs in a file path.
/dbfs both denote DBFS, but where each can be used is context-specific.
/dbfs is used with code where the default file system location is the Local File System. In this context, if we add
dbfs:/ to a file path, Databricks will display a message suggesting we replace it with
dbfs:/, on the other hand, can be used for file paths in code where the default file system location is DBFS. Because DBFS is the default in this context, the prefix is optional.
Some consider it good practice always to include
dbfs:/ as it is more explicit about which file system is accessed.
The key idea to remember is that the file path depends on the code being executed and the default file system location for that code. Once you know the default, it's a case of adding the proper prefix to the file path.