Run length ID in sparklyr

Run length ID in sparklyr - r

data.table provides a rleid function which I find invaluable - it acts as a ticker when a watched variable(s) changes, ordered by some other variable(s).
library(dplyr)
tbl = tibble(time = as.integer(c(1, 2, 3, 4, 5, 6, 7, 8)),
var = c("A", "A", "A", "B", "B", "A", "A", "A"))
> tbl
# A tibble: 8 × 2
time var
<int> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
6 6 A
7 7 A
8 8 A
Desired result is
> tbl %>% mutate(rleid = data.table::rleid(var))
# A tibble: 8 × 3
time var rleid
<int> <chr> <int>
1 1 A 1
2 2 A 1
3 3 A 1
4 4 B 2
5 5 B 2
6 6 A 3
7 7 A 3
8 8 A 3
I was wondering if I could reproduce something similar using the tools provided by sparklyr. When testing, I found the best I could do was get to the point at which I needed to do a fill forward, but then couldn't achieve that.
library(sparklyr)
spark_install(version = "2.0.2")
sc <- spark_connect(master = "local",
spark_home = spark_home_dir())
spk_tbl = copy_to(sc, tbl, overwrite = TRUE)
spk_tbl %>%
mutate(var2 = (var != lag(var, 1L, order = time))) %>% # Thanks #JaimeCaffarel
mutate(var3 = if(var2) { paste0(time, var) } else { NA })
Source: query [8 x 4]
Database: spark connection master=local[4] app=sparklyr local=TRUE
time var var2 var3
<int> <chr> <lgl> <chr>
1 1 A TRUE 1A
2 2 A FALSE <NA>
3 3 A FALSE <NA>
4 4 B TRUE 4B
5 5 B FALSE <NA>
6 6 A TRUE 6A
7 7 A FALSE <NA>
8 8 A FALSE <NA>
I've tried using SparkR, however I much prefer the sparklyr interface and its ease of use, so I'd ideally be able to do this in Spark SQL.
I can of course, already do this by partitioning the data into small enough chunks, collecting it, running a function and sending it back.
For context, the reason I find the rleid to be useful is that I work with a lot of train data, and it's useful to be able to index what run it's on.
Thanks for any help
Akhil

A working solution in sparklyr would be this:
spk_tbl %>%
dplyr::arrange(time) %>%
dplyr::mutate(rleid = (var != lag(var, 1, order = time, default = FALSE))) %>%
dplyr::mutate(rleid = cumsum(as.numeric(rleid)))

Try this:
tbl %>% mutate(run = c(0,cumsum(var[-1L] != var[-length(var)])))
# A tibble: 8 × 3
time var run
<int> <chr> <dbl>
1 1 A 0
2 2 A 0
3 3 A 0
4 4 B 1
5 5 B 1
6 6 A 2
7 7 A 2
8 8 A 2

Related

Create new column based on previous column by group; if missing, use NA

I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3

You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA

Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).

Adding column if it does not exist inside purrr language

I've been struggling trying to add a new column if it does not exist. I found the answer in here: Adding column if it does not exist .
However, in my problem I must use it inside purrr environment. I tried to adapt the above answer, but it doesn't fit my needs.
Here is an example what I'm dealing with:
Suppose I have a list of two data.frames:
library(tibble)
A = tibble(
x = 1:5, y = 1, z = 2
)
B = tibble(
x = 5:1, y = 3, z = 3, w = 7
)
dt_list = list(A, B)
The column I'd like to add is w:
cols = c(w = NA_real_)
Separately, if I want to add a column if it does not exist, I could do the following:
Since it does exist, not columns is added:
B %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 5 3 3 7
2 4 3 3 7
3 3 3 3 7
4 2 3 3 7
5 1 3 3 7
In this case, since it does not exist, w is added:
A %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 1 1 2 NA
2 2 1 2 NA
3 3 1 2 NA
4 4 1 2 NA
5 5 1 2 NA
I tried the following to replicate it using purrr (I'd prefer not to use a for loop):
dt_list_2 = dt_list %>%
purrr::map(
~dplyr::select(., -starts_with("x")) %>%
~tibble::add_column(!!!cols[!names(cols) %in% names(.)])
)
But the output is not the same as doing it separately.
Note: This is an example of my real problem. In fact, I'm using purrr to read many *.csv files and then apply some data transformation. Something like this:
re_file <- list.files(path = dir_path, pattern = "*.csv")
cols_add = c(UCI = NA_real_)
file_list = re_file %>%
purrr::map(function(file_name){ # iterate through each file name
read_csv(file = paste0(dir_path, "//",file_name), skip = 2)
}) %>%
purrr::map(
~dplyr::select(., -starts_with("Textbox")) %>%
~dplyr::tibble(!!!cols[!names(cols) %in% names(.)])
)

You can use :
dt_list %>%
purrr::map(
~tibble::add_column(., !!!cols[!names(cols) %in% names(.)])
)
#[[1]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 1 1 2 NA
#2 2 1 2 NA
#3 3 1 2 NA
#4 4 1 2 NA
#5 5 1 2 NA
#[[2]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 5 3 3 7
#2 4 3 3 7
#3 3 3 3 7
#4 2 3 3 7
#5 1 3 3 7

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)

You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1

We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1

Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1

Expanding a data.frame based on (group) values from the data.frame

Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?

One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))

One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9

r-How to add column in r

I have data table
Name Score
A 5
A 6
B 9
B 1
B 0
...
I want to calculate and add a column 'FScore'=max score to this table
My expected result
Name Score Fscore
A 5 6
A 6 6
B 9 9
B 1 9
B 0 9
Thank.

We can use the base R option ave
df$Fscore <- ave(df$Score, df$Name, FUN = max)
df
# Name Score Fscore
#1 A 5 6
#2 A 6 6
#3 B 9 9
#4 B 1 9
#5 B 0 9

If you are trying to find the maximum score for each Name value, you can use data.table as below.
# example data
d <- data.table(Name = c("A", "A", "B", "B", "B"),
Score = c(5, 6, 9, 1, 0))
# find max for each Name and save the value in a new column, Fscore
d[ , Fscore := max(Score), by=Name]
Result:
> print(d)
Name Score Fscore
1: A 5 6
2: A 6 6
3: B 9 9
4: B 1 9
5: B 0 9

Another option using dplyr could be:
df = data.frame(Name = c('a', 'a', 'b','b','b'), Score = c(5,6,9,1,0))
df %>% group_by(Name) %>% mutate(Fscore = max(Score))
Source: local data frame [5 x 3]
Groups: Name [2]
Name Score FScore
<fctr> <dbl> <dbl>
1 a 5 6
2 a 6 6
3 b 9 9
4 b 1 9
5 b 0 9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Run length ID in sparklyr - r

A working solution in sparklyr would be this: spk_tbl %>% dplyr::arrange(time) %>% dplyr::mutate(rleid = (var != lag(var, 1, order = time, default = FALSE))) %>% dplyr::mutate(rleid = cumsum(as.numeric(rleid)))

Try this: tbl %>% mutate(run = c(0,cumsum(var[-1L] != var[-length(var)]))) # A tibble: 8 × 3 time var run <int> <chr> <dbl> 1 1 A 0 2 2 A 0 3 3 A 0 4 4 B 1 5 5 B 1 6 6 A 2 7 7 A 2 8 8 A 2

Related

Create new column based on previous column by group; if missing, use NA

Adding column if it does not exist inside purrr language

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

Expanding a data.frame based on (group) values from the data.frame

r-How to add column in r

Categories

Resources