Read a CSV file into a Spark DataFrame

R/data_interface.R

spark_read_csv

Description

Read a tabular data file into a Spark DataFrame.

Usage

spark_read_csv( 
  sc, 
  name = NULL, 
  path = name, 
  header = TRUE, 
  columns = NULL, 
  infer_schema = is.null(columns), 
  delimiter = ",", 
  quote = "\"", 
  escape = "\\", 
  charset = "UTF-8", 
  null_value = NULL, 
  options = list(), 
  repartition = 0, 
  memory = TRUE, 
  overwrite = TRUE, 
  ... 
) 

Arguments

Arguments Description
sc A spark_connection.
name The name to assign to the newly generated table.
path The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3a://" and "file://" protocols.
header Boolean; should the first row of data be used as a header? Defaults to TRUE.
columns A vector of column names or a named vector of column types. If specified, the elements can be "binary" for BinaryType, "boolean" for BooleanType, "byte" for ByteType, "integer" for IntegerType, "integer64" for LongType, "double" for DoubleType, "character" for StringType, "timestamp" for TimestampType and "date" for DateType.
infer_schema Boolean; should column types be automatically inferred? Requires one extra pass over the data. Defaults to is.null(columns).
delimiter The character used to delimit each column. Defaults to ','.
quote The character used as a quote. Defaults to '"'.
escape The character used to escape other characters. Defaults to '\'.
charset The character set. Defaults to "UTF-8".
null_value The character to use for null, or missing, values. Defaults to NULL.
options A list of strings with additional options.
repartition The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.
memory Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)
overwrite Boolean; overwrite the table with the given name if it already exists?
Optional arguments; currently unused.

Details

You can read data from HDFS (hdfs://), S3 (s3a://), as well as the local file system (file://).

If you are reading from a secure S3 bucket be sure to set the following in your spark-defaults.conf spark.hadoop.fs.s3a.access.key, spark.hadoop.fs.s3a.secret.key or any of the methods outlined in the aws-sdk documentation Working with AWS credentials

In order to work with the newer s3a:// protocol also set the values for spark.hadoop.fs.s3a.impl and spark.hadoop.fs.s3a.endpoint. In addition, to support v4 of the S3 api be sure to pass the -Dcom.amazonaws.services.s3.enableV4 driver options for the config key spark.driver.extraJavaOptions

For instructions on how to configure s3n:// check the hadoop documentation: s3n authentication properties

When header is FALSE, the column names are generated with a V prefix; e.g. V1, V2, ....

See Also

Other Spark serialization routines: collect_from_rds(), spark_insert_table(), spark_load_table(), spark_read_avro(), spark_read_binary(), spark_read_delta(), spark_read_image(), spark_read_jdbc(), spark_read_json(), spark_read_libsvm(), spark_read_orc(), spark_read_parquet(), spark_read_source(), spark_read_table(), spark_read_text(), spark_read(), spark_save_table(), spark_write_avro(), spark_write_csv(), spark_write_delta(), spark_write_jdbc(), spark_write_json(), spark_write_orc(), spark_write_parquet(), spark_write_source(), spark_write_table(), spark_write_text()