R - How to replicate rows in a spark dataframe using sparklyr - r

Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr?
sc <- spark_connect(master = "spark://####:7077")
df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")
This is the desired output, saved into a new spark tbl:
> df2_tbl
row1 row2
<int> <chr>
1 1 A
2 1 A
3 1 A
4 2 B
5 2 B
6 2 B
7 3 C
8 3 C
9 3 C

With sparklyr you can use array and explode as suggested by #Oli:
df_tbl %>%
mutate(arr = explode(array(1, 1, 1))) %>%
select(-arr)
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
and generalized
library(rlang)
df_tbl %>%
mutate(arr = !!rlang::parse_quo(
paste("explode(array(", paste(rep(1, 3), collapse = ","), "))")
)) %>% select(-arr)
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
where you can easily adjust number of rows.

The idea that comes to mind first is to use the explode function (it is exactly what it is meant for in Spark). Yet arrays do not seem to be supported in SparkR (to the best of my knowledge).
> structField("a", "array")
Error in checkType(type) : Unsupported type for SparkDataframe: array
I can however propose two other methods:
A straightforward but not very elegant one:
head(rbind(df, df, df), n=30)
# row1 row2
# 1 1 A
# 2 2 B
# 3 3 C
# 4 1 A
# 5 2 B
# 6 3 C
# 7 1 A
# 8 2 B
# 9 3 C
Or with a for loop for more genericity:
df2 = df
for(i in 1:2) df2=rbind(df, df2)
Note that this would also work with union.
The second, more elegant method (because it only implies one spark operation) is based on a cross join (Cartesian product) with a dataframe of size 3 (or any other number):
j <- as.DataFrame(data.frame(s=1:3))
head(drop(crossJoin(df, j), "s"), n=100)
# row1 row2
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C

I am not aware of a cluster side version of R's repfunction. We can however use a join to emulate it cluster side.
df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")
replyr <- function(data, n, sc){
joiner_frame <- copy_to(sc, data.frame(joiner_index = rep(1,n)), "tmp_joining_frame", overwrite = TRUE)
data %>%
mutate(joiner_index = 1) %>%
left_join(joiner_frame) %>%
select(-joiner_index)
}
df_tbl2 <- replyr(df_tbl, 3, sc)
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
It gets the job done, but it is a bit dirty since the tmp_joining_frame will persist. I'm not sure how well this will work given lazy evaluation on multiple calls to the function.

Related

Replacement of column values based on a named vector

Consider the following named vector vec and tibble df:
vec <- c("1" = "a", "2" = "b", "3" = "c")
df <- tibble(col = rep(1:3, c(4, 2, 5)))
df
# # A tibble: 11 x 1
# col
# <int>
# 1 1
# 2 1
# 3 1
# 4 1
# 5 2
# 6 2
# 7 3
# 8 3
# 9 3
# 10 3
# 11 3
I would like to replace the values in the col column with the corresponding named values in vec.
I'm looking for a tidyverse approach, that doesn't involve converting vec as a tibble.
I tried the following, without success:
df %>%
mutate(col = map(
vec,
~ str_replace(col, names(.x), .x)
))
Expected output:
# A tibble: 11 x 1
col
<chr>
1 a
2 a
3 a
4 a
5 b
6 b
7 c
8 c
9 c
10 c
11 c
You could use col :
df$col1 <- vec[as.character(df$col)]
Or in mutate :
library(dplyr)
df %>% mutate(col1 = vec[as.character(col)])
# col col1
# <int> <chr>
# 1 1 a
# 2 1 a
# 3 1 a
# 4 1 a
# 5 2 b
# 6 2 b
# 7 3 c
# 8 3 c
# 9 3 c
#10 3 c
#11 3 c
We can also use data.table
library(data.table)
setDT(df)[, col1 := vec[as.character(col)]]

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})

Add together 2 dataframes in R without losing columns

I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.

bind_rows to each group of tibble

Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))

Eliminate factors contributing less

There are hundreds of levels in a column and not all of them really add value - as in, about 60% of levels account for <80% (they don't occur many a times in the dataframe) and also expected to not influence the outcome. Objective is to eliminate those levels that do not contribute more than 80%.
Could someone help? Thanks in advance
Here is a simple process that spots values that account for less than 80% of the dataset (rows) and groups them together using a new value. This process uses a character column and not a factor column.
library(dplyr)
# example dataset
dt = data.frame(type = c("A","A","A","B","B","B","c","D"),
value = 1:8, stringsAsFactors = F)
dt
# type value
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 c 7
# 8 D 8
# count number of rows for each type
dt %>% count(type)
# # A tibble: 4 x 2
# type n
# <chr> <int>
# 1 A 3
# 2 B 3
# 3 c 1
# 4 D 1
# add cumulative percentages
dt %>%
count(type) %>%
mutate(Prc = n/sum(n),
CumPrc = cumsum(Prc))
# # A tibble: 4 x 4
# type n Prc CumPrc
# <chr> <int> <dbl> <dbl>
# 1 A 3 0.375 0.375
# 2 B 3 0.375 0.750
# 3 c 1 0.125 0.875
# 4 D 1 0.125 1.000
# pick the types you want to group together
dt %>%
count(type) %>%
mutate(Prc = n/sum(n),
CumPrc = cumsum(Prc)) %>%
filter(CumPrc > 0.80) %>%
pull(type) -> types_to_group
# group them
dt %>% mutate(type_upd = ifelse(type %in% types_to_group, "Rest", type))
# type value type_upd
# 1 A 1 A
# 2 A 2 A
# 3 A 3 A
# 4 B 4 B
# 5 B 5 B
# 6 B 6 B
# 7 c 7 Rest
# 8 D 8 Rest

Resources