library(sparklyr)
library(dplyr)
<- spark_connect(master = "local")
sc <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl
# aggregating by mean
%>%
iris_tbl mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>%
sdf_pivot(Petal_Width ~ Species,
fun.aggregate = list(Petal_Length = "mean")
) #> # Source: spark<?> [?? x 4]
#> Petal_Width setosa versicolor virginica
#> <chr> <dbl> <dbl> <dbl>
#> 1 Low 1.46 4.20 5.23
#> 2 High NA 4.82 5.57
# aggregating all observations in a list
%>%
iris_tbl mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>%
sdf_pivot(Petal_Width ~ Species,
fun.aggregate = list(Petal_Length = "collect_list")
) #> # Source: spark<?> [?? x 4]
#> Petal_Width setosa versicolor virginica
#> <chr> <list> <list> <list>
#> 1 Low <dbl [50]> <dbl [45]> <dbl [3]>
#> 2 High <dbl [0]> <dbl [5]> <dbl [47]>
Pivot a Spark DataFrame
R/sdf_wrapper.R
sdf_pivot
Description
Construct a pivot table over a Spark Dataframe, using a syntax similar to that from reshape2::dcast
.
Usage
sdf_pivot(x, formula, fun.aggregate = "count")
Arguments
Arguments | Description |
---|---|
x | A spark_connection , ml_pipeline , or a tbl_spark . |
formula | A two-sided R formula of the form x_1 + x_2 + ... ~ y_1 . The left-hand side of the formula indicates which variables are used for grouping, and the right-hand side indicates which variable is used for pivoting. Currently, only a single pivot column is supported. |
fun.aggregate | How should the grouped dataset be aggregated? Can be a length-one character vector, giving the name of a Spark aggregation function to be called; a named R list mapping column names to an aggregation method, or an R function that is invoked on the grouped dataset. |