sparklyr
R Interface to Apache Spark
Spark Sessions
Function(s) | Description |
---|---|
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() | Manage Spark Connections |
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() | Manage Spark Connections |
spark_config() | Read Spark Configuration |
spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() | Download and install various versions of Spark |
spark_log() | View Entries in the Spark Log |
spark_web() | Open the Spark web interface |
Spark Data
Function(s) | Description |
---|---|
spark_read() | Read file(s) into a Spark DataFrame using a custom reader |
spark_read_avro() | Read Apache Avro data into a Spark DataFrame. |
spark_read_binary() | Read binary data into a Spark DataFrame. |
spark_read_csv() | Read a CSV file into a Spark DataFrame |
spark_read_delta() | Read from Delta Lake into a Spark DataFrame. |
spark_read_image() | Read image data into a Spark DataFrame. |
spark_read_jdbc() | Read from JDBC connection into a Spark DataFrame. |
spark_read_json() | Read a JSON file into a Spark DataFrame |
spark_read_libsvm() | Read libsvm file into a Spark DataFrame. |
spark_read_orc() | Read a ORC file into a Spark DataFrame |
spark_read_parquet() | Read a Parquet file into a Spark DataFrame |
spark_read_source() | Read from a generic source into a Spark DataFrame. |
spark_read_table() | Reads from a Spark Table into a Spark DataFrame. |
spark_read_text() | Read a Text file into a Spark DataFrame |
spark_write() | Write Spark DataFrame to file using a custom writer |
spark_write_avro() | Serialize a Spark DataFrame into Apache Avro format |
spark_write_csv() | Write a Spark DataFrame to a CSV |
spark_write_delta() | Writes a Spark DataFrame into Delta Lake |
spark_write_jdbc() | Writes a Spark DataFrame into a JDBC table |
spark_write_json() | Write a Spark DataFrame to a JSON file |
spark_write_orc() | Write a Spark DataFrame to a ORC file |
spark_write_parquet() | Write a Spark DataFrame to a Parquet file |
spark_write_rds() | Write Spark DataFrame to RDS files |
spark_write_source() | Writes a Spark DataFrame into a generic source |
spark_write_table() | Writes a Spark DataFrame into a Spark table |
spark_write_text() | Write a Spark DataFrame to a Text file |
spark_insert_table() | Inserts a Spark DataFrame into a Spark table |
spark_save_table() | Saves a Spark DataFrame as a Spark table |
collect_from_rds() | Collect Spark data serialized in RDS format into R |
Spark Tables
Function(s) | Description |
---|---|
src_databases() | Show database list |
tbl_cache() | Cache a Spark Table |
tbl_change_db() | Use specific database |
tbl_uncache() | Uncache a Spark Table |
Spark DataFrames
Function(s) | Description |
---|---|
dplyr_hof | dplyr wrappers for Apache Spark higher order functions |
sdf_save_table() sdf_load_table() sdf_save_parquet() sdf_load_parquet() | Save / Load a Spark DataFrame |
sdf_predict() sdf_transform() sdf_fit() sdf_fit_and_transform() | Spark ML – Transform, fit, and predict methods (sdf_ interface) |
sdf_along() | Create DataFrame for along Object |
sdf_bind_rows() sdf_bind_cols() | Bind multiple Spark DataFrames by row and column |
sdf_broadcast() | Broadcast hint |
sdf_checkpoint() | Checkpoint a Spark DataFrame |
sdf_coalesce() | Coalesces a Spark DataFrame |
sdf_collect() | Collect a Spark DataFrame into R. |
sdf_copy_to() sdf_import() | Copy an Object into Spark |
sdf_crosstab() | Cross Tabulation |
sdf_debug_string() | Debug Info for Spark DataFrame |
sdf_describe() | Compute summary statistics for columns of a data frame |
sdf_dim() sdf_nrow() sdf_ncol() | Support for Dimension Operations |
sdf_distinct() | Invoke distinct on a Spark DataFrame |
sdf_drop_duplicates() | Remove duplicates from a Spark DataFrame |
sdf_expand_grid() | Create a Spark dataframe containing all combinations of inputs |
sdf_from_avro() | Convert column(s) from avro format |
sdf_is_streaming() | Spark DataFrame is Streaming |
sdf_last_index() | Returns the last index of a Spark DataFrame |
sdf_len() | Create DataFrame for Length |
sdf_num_partitions() | Gets number of partitions of a Spark DataFrame |
sdf_partition_sizes() | Compute the number of records within each partition of a Spark DataFrame |
sdf_persist() | Persist a Spark DataFrame |
sdf_pivot() | Pivot a Spark DataFrame |
sdf_project() | Project features onto principal components |
sdf_quantile() | Compute (Approximate) Quantiles with a Spark DataFrame |
sdf_random_split() sdf_partition() | Partition a Spark Dataframe |
sdf_rbeta() | Generate random samples from a Beta distribution |
sdf_rbinom() | Generate random samples from a binomial distribution |
sdf_rcauchy() | Generate random samples from a Cauchy distribution |
sdf_rchisq() | Generate random samples from a chi-squared distribution |
sdf_read_column() | Read a Column from a Spark DataFrame |
sdf_register() | Register a Spark DataFrame |
sdf_repartition() | Repartition a Spark DataFrame |
sdf_residuals() | Model Residuals |
sdf_rexp() | Generate random samples from an exponential distribution |
sdf_rgamma() | Generate random samples from a Gamma distribution |
sdf_rgeom() | Generate random samples from a geometric distribution |
sdf_rhyper() | Generate random samples from a hypergeometric distribution |
sdf_rlnorm() | Generate random samples from a log normal distribution |
sdf_rnorm() | Generate random samples from the standard normal distribution |
sdf_rpois() | Generate random samples from a Poisson distribution |
sdf_rt() | Generate random samples from a t-distribution |
sdf_runif() | Generate random samples from the uniform distribution U(0, 1). |
sdf_rweibull() | Generate random samples from a Weibull distribution. |
sdf_sample() | Randomly Sample Rows from a Spark DataFrame |
sdf_schema() | Read the Schema of a Spark DataFrame |
sdf_separate_column() | Separate a Vector Column into Scalar Columns |
sdf_seq() | Create DataFrame for Range |
sdf_sort() | Sort a Spark DataFrame |
sdf_sql() | Spark DataFrame from SQL |
sdf_to_avro() | Convert column(s) to avro format |
sdf_unnest_longer() | Unnest longer |
sdf_unnest_wider() | Unnest wider |
sdf_weighted_sample() | Perform Weighted Random Sampling on a Spark DataFrame |
sdf_with_sequential_id() | Add a Sequential ID Column to a Spark DataFrame |
sdf_with_unique_id() | Add a Unique ID Column to a Spark DataFrame |
hof_aggregate() | Apply Aggregate Function to Array Column |
hof_array_sort() | Sorts array using a custom comparator |
hof_exists() | Determine Whether Some Element Exists in an Array Column |
hof_filter() | Filter Array Column |
hof_forall() | Checks whether all elements in an array satisfy a predicate |
hof_map_filter() | Filters a map |
hof_map_zip_with() | Merges two maps into one |
hof_transform() | Transform Array Column |
hof_transform_keys() | Transforms keys of a map |
hof_transform_values() | Transforms values of a map |
hof_zip_with() | Combines 2 Array Columns |
transform_sdf() | transform a subset of column(s) in a Spark Dataframe |
Spark ML - Regression
Function(s) | Description |
---|---|
ml_linear_regression() | Spark ML – Linear Regression |
ml_aft_survival_regression() ml_survival_regression() | Spark ML – Survival Regression |
ml_isotonic_regression() | Spark ML – Isotonic Regression |
ml_aft_survival_regression() ml_survival_regression() | Spark ML – Survival Regression |
ml_generalized_linear_regression() | Spark ML – Generalized Linear Regression |
Spark ML - Classification
Function(s) | Description |
---|---|
ml_naive_bayes() | Spark ML – Naive-Bayes |
ml_one_vs_rest() | Spark ML – OneVsRest |
ml_logistic_regression() | Spark ML – Logistic Regression |
ml_multilayer_perceptron_classifier() ml_multilayer_perceptron() | Spark ML – Multilayer Perceptron |
ml_linear_svc() | Spark ML – LinearSVC |
Spark ML - Tree
Spark ML - Clustering
Function(s) | Description |
---|---|
ml_kmeans() ml_compute_cost() ml_compute_silhouette_measure() | Spark ML – K-Means Clustering |
ml_kmeans_cluster_eval | Evaluate a K-mean clustering |
ml_bisecting_kmeans() | Spark ML – Bisecting K-Means Clustering |
ml_gaussian_mixture() | Spark ML – Gaussian Mixture clustering. |
ml_kmeans() ml_compute_cost() ml_compute_silhouette_measure() | Spark ML – K-Means Clustering |
ml_power_iteration() | Spark ML – Power Iteration Clustering |
Spark ML - Text
Function(s) | Description |
---|---|
ml_lda() ml_describe_topics() ml_log_likelihood() ml_log_perplexity() ml_topics_matrix() | Spark ML – Latent Dirichlet Allocation |
ml_chisquare_test() | Chi-square hypothesis testing for categorical data. |
ml_default_stop_words() | Default stop words |
ml_fpgrowth() ml_association_rules() ml_freq_itemsets() | Frequent Pattern Mining – FPGrowth |
ml_prefixspan() ml_freq_seq_patterns() | Frequent Pattern Mining – PrefixSpan |
ft_count_vectorizer() ml_vocabulary() | Feature Transformation – CountVectorizer (Estimator) |
Spark ML - Recommendations
Function(s) | Description |
---|---|
ml_als() ml_recommend() | Spark ML – ALS |
Spark ML - Hyper-parameter tuning
Function(s) | Description |
---|---|
ml_sub_models() ml_validation_metrics() ml_cross_validator() ml_train_validation_split() | Spark ML – Tuning |
Spark ML - Evaluation
Function(s) | Description |
---|---|
ml_metrics_binary() | Extracts metrics from a fitted table |
ml_metrics_multiclass() | Extracts metrics from a fitted table |
ml_metrics_regression() | Extracts metrics from a fitted table |
ml_evaluate() | Evaluate the Model on a Validation Set |
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() | Spark ML - Evaluators |
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() | Spark ML - Evaluators |
ml_clustering_evaluator() | Spark ML - Clustering Evaluator |
ml_binary_classification_evaluator() ml_binary_classification_eval() ml_multiclass_classification_evaluator() ml_classification_eval() ml_regression_evaluator() | Spark ML - Evaluators |
Spark ML - Operations
Spark Pipelines
Function(s) | Description |
---|---|
ml_pipeline() | Spark ML – Pipelines |
ml_stage() ml_stages() | Spark ML – Pipeline stage extraction |
ml_add_stage() | Add a Stage to a Pipeline |
Spark Feature Transformers
Function(s) | Description |
---|---|
ft_binarizer() | Feature Transformation – Binarizer (Transformer) |
ft_bucketizer() | Feature Transformation – Bucketizer (Transformer) |
ft_chisq_selector() | Feature Transformation – ChiSqSelector (Estimator) |
ft_count_vectorizer() ml_vocabulary() | Feature Transformation – CountVectorizer (Estimator) |
ft_dct() ft_discrete_cosine_transform() | Feature Transformation – Discrete Cosine Transform (DCT) (Transformer) |
ft_elementwise_product() | Feature Transformation – ElementwiseProduct (Transformer) |
ft_feature_hasher() | Feature Transformation – FeatureHasher (Transformer) |
ft_hashing_tf() | Feature Transformation – HashingTF (Transformer) |
ft_idf() | Feature Transformation – IDF (Estimator) |
ft_imputer() | Feature Transformation – Imputer (Estimator) |
ft_index_to_string() | Feature Transformation – IndexToString (Transformer) |
ft_interaction() | Feature Transformation – Interaction (Transformer) |
ft_bucketed_random_projection_lsh() ft_minhash_lsh() | Feature Transformation – LSH (Estimator) |
ml_approx_nearest_neighbors() ml_approx_similarity_join() | Utility functions for LSH models |
ft_max_abs_scaler() | Feature Transformation – MaxAbsScaler (Estimator) |
ft_min_max_scaler() | Feature Transformation – MinMaxScaler (Estimator) |
ft_ngram() | Feature Transformation – NGram (Transformer) |
ft_normalizer() | Feature Transformation – Normalizer (Transformer) |
ft_one_hot_encoder() | Feature Transformation – OneHotEncoder (Transformer) |
ft_one_hot_encoder_estimator() | Feature Transformation – OneHotEncoderEstimator (Estimator) |
ft_pca() ml_pca() | Feature Transformation – PCA (Estimator) |
ft_polynomial_expansion() | Feature Transformation – PolynomialExpansion (Transformer) |
ft_quantile_discretizer() | Feature Transformation – QuantileDiscretizer (Estimator) |
ft_r_formula() | Feature Transformation – RFormula (Estimator) |
ft_regex_tokenizer() | Feature Transformation – RegexTokenizer (Transformer) |
ft_robust_scaler() | Feature Transformation – RobustScaler (Estimator) |
ft_standard_scaler() | Feature Transformation – StandardScaler (Estimator) |
ft_stop_words_remover() | Feature Transformation – StopWordsRemover (Transformer) |
ft_string_indexer() ml_labels() ft_string_indexer_model() | Feature Transformation – StringIndexer (Estimator) |
ft_tokenizer() | Feature Transformation – Tokenizer (Transformer) |
ft_vector_assembler() | Feature Transformation – VectorAssembler (Transformer) |
ft_vector_indexer() | Feature Transformation – VectorIndexer (Estimator) |
ft_vector_slicer() | Feature Transformation – VectorSlicer (Transformer) |
ft_word2vec() ml_find_synonyms() | Feature Transformation – Word2Vec (Estimator) |
ft_sql_transformer() ft_dplyr_transformer() | Feature Transformation – SQLTransformer |
ft_pca() ml_pca() | Feature Transformation – PCA (Estimator) |
Extensions
Function(s) | Description |
---|---|
ml_supervised_pipeline() ml_clustering_pipeline() ml_construct_model_supervised() ml_construct_model_clustering() new_ml_model_prediction() new_ml_model() new_ml_model_classification() new_ml_model_regression() new_ml_model_clustering() | Constructors for ml_model Objects |
compile_package_jars() | Compile Scala sources into a Java Archive (jar) |
connection_config() | Read configuration values for a connection |
download_scalac() | Downloads default Scala Compilers |
find_scalac() | Discover the Scala Compiler |
spark_context() java_context() hive_context() spark_session() | Access the Spark API |
hive_context_config() | Runtime configuration interface for Hive |
invoke() invoke_static() invoke_new() | Invoke a Method on a JVM Object |
j_invoke() j_invoke_static() j_invoke_new() | Invoke a Java function. |
jarray() | Instantiate a Java array with a specific element type. |
jfloat() | Instantiate a Java float type. |
jfloat_array() | Instantiate an Array[Float]. |
spark_context() java_context() hive_context() spark_session() | Access the Spark API |
register_extension() registered_extensions() | Register a Package that Implements a Spark Extension |
spark_compilation_spec() | Define a Spark Compilation Specification |
spark_default_compilation_spec() | Default Compilation Specification for Spark Extensions |
spark_context() java_context() hive_context() spark_session() | Access the Spark API |
spark_context_config() | Runtime configuration interface for the Spark Context. |
spark_dataframe() | Retrieve a Spark DataFrame |
spark_dependency() | Define a Spark dependency |
spark_home_set() | Set the SPARK_HOME environment variable |
spark_jobj() | Retrieve a Spark JVM Object Reference |
spark_context() java_context() hive_context() spark_session() | Access the Spark API |
spark_version() | Get the Spark Version Associated with a Spark Connection |
Distributed Computing
Function(s) | Description |
---|---|
spark_apply() | Apply an R Function in Spark |
spark_apply_bundle() | Create Bundle for Spark Apply |
spark_apply_log() | Log Writer for Spark Apply |
registerDoSpark() | Register a Parallel Backend |
Livy
Function(s) | Description |
---|---|
livy_config() | Create a Spark Configuration for Livy |
livy_service_start() livy_service_stop() | Start Livy |
Streaming
dplyr integration
Function(s) | Description |
---|---|
copy_to( |
Copy an R Data Frame to Spark |
distinct | Distinct |
filter | Filter |
full_join | Full join |
inner_join | Inner join |
inner_join( |
Join Spark tbls. |
left_join | Left join |
mutate | Mutate |
right_join | Right join |
select | Select |
tidyr integration
Function(s) | Description |
---|---|
pivot_longer | Pivot longer |
pivot_wider | Pivot wider |
fill | Fill |
na.replace() | Replace Missing Values in Objects |
nest | Nest |
replace_na | Replace NA |
separate | Separate |
unite | Unite |
unnest | Unnest |
tidymodels integration
Spark Operations
Function(s) | Description |
---|---|
get_spark_sql_catalog_implementation() | Retrieve the Spark connection’s SQL catalog implementation property |
connection_is_open() | Check whether the connection is open |
connection_spark_shinyapp() | A Shiny app that can be used to construct a spark_connect statement |
spark_session_config() | Runtime configuration interface for the Spark Session |
spark_set_checkpoint_dir() spark_get_checkpoint_dir() | Set/Get Spark checkpoint directory |
spark_connect() spark_connection_is_open() spark_disconnect() spark_disconnect_all() spark_submit() | Manage Spark Connections |
spark_table_name() | Generate a Table Name from Expression |
spark_install() spark_uninstall() spark_install_dir() spark_install_tar() spark_installed_versions() spark_available_versions() | Download and install various versions of Spark |
spark_version_from_home() | Get the Spark Version Associated with a Spark Installation |
spark_versions() | Retrieves a dataframe available Spark versions that van be installed. |
spark_config_kubernetes() | Kubernetes Configuration |
spark_config_settings() | Retrieve Available Settings |
spark_connection_find() | Find Spark Connection |
spark_dependency_fallback() | Fallback to Spark Dependency |
spark_extension() | Create Spark Extension |
spark_load_table() | Reads from a Spark Table into a Spark DataFrame. |
list_sparklyr_jars() | list all sparklyr-*.jar files that have been built |
spark_config_packages() | Creates Spark Configuration |
spark_connection() | Retrieve the Spark Connection Associated with an R Object |
spark_adaptive_query_execution() | Retrieves or sets status of Spark AQE |
spark_advisory_shuffle_partition_size() | Retrieves or sets advisory size of the shuffle partition |
spark_auto_broadcast_join_threshold() | Retrieves or sets the auto broadcast join threshold |
spark_coalesce_initial_num_partitions() | Retrieves or sets initial number of shuffle partitions before coalescing |
spark_coalesce_min_num_partitions() | Retrieves or sets the minimum number of shuffle partitions after coalescing |
spark_coalesce_shuffle_partitions() | Retrieves or sets whether coalescing contiguous shuffle partitions is enabled |
spark_connection-class | spark_connection class |
spark_jobj-class | spark_jobj class |
sparklyr_get_backend_port() | Return the port number of a sparklyr backend. |
Other
Function(s) | Description |
---|---|
spark_statistical_routines | Generate random samples from some distribution |
ensure | Enforce Specific Structure for R Objects |
random_string() | Random string generation |
%->% |
Infix operator for composing a lambda expression |
[ ( |
Subsetting operator for Spark dataframe |
generic_call_interface | Generic Call Interface |