Unnest (seperate) multiple column values into new rows using Sparklyr - r

I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr.
id <- c(1,1,1,1,1,2,2,2,3,3,3)
name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E")
value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3")
dt <- data.frame(id,name,value)
R solution:
separate_rows(dt, name, sep=",") %>%
separate_rows(value, sep=",")
Desired Output from sparkframe(sparklyr package)-
> final_result
id name value
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 1
5 1 B 2
6 1 B 3
7 1 C 1
8 1 C 2
9 1 C 3
10 1 B 2
11 1 B 4
12 1 B 43
13 1 B 2
14 1 F 2
15 1 F 4
16 1 F 43
17 1 F 2
18 1 C 3
19 1 C 1
20 1 C 2
21 1 C 3
22 1 D 1
23 1 R 1
24 1 P 1
25 1 E 1
26 1 E 2
27 2 A 26
28 2 A 6
29 2 A 7
30 2 Q 26
31 2 Q 6
32 2 Q 7
33 2 W 26
34 2 W 6
35 2 W 7
36 2 B 3
37 2 B 3
38 2 B 4
39 2 J 3
40 2 J 3
41 2 J 4
42 2 C 1
43 3 D 1
44 3 D 12
45 3 M 1
46 3 M 12
47 3 E 2
48 3 E 3
49 3 E 3
50 3 X 2
51 3 X 3
52 3 X 3
53 3 F 3
54 3 E 3
Note-
I have approx 1000 columns with nested values. so, I need a function which can loop in for each column.
I know we have sdf_unnest() function from package sparklyr.nested. But, I am not sure how to split strings of multiple columns and apply this function. I am quite new in sparklyr.
Any help would be much appreciated.

You have to combine explode and split
sdt %>%
mutate(name = explode(split(name, ","))) %>%
mutate(value = explode(split(value, ",")))
# Source: lazy query [?? x 3]
# Database: spark_connection
id name value
<dbl> <chr> <chr>
1 1.00 A 1
2 1.00 A 2
3 1.00 A 3
4 1.00 B 1
5 1.00 B 2
6 1.00 B 3
7 1.00 C 1
8 1.00 C 2
9 1.00 C 3
10 1.00 B 2
# ... with more rows
Please note that lateral views have be to expressed as separate subqueries, so this:
sdt %>%
mutate(
name = explode(split(name, ",")),
value = explode(split(value, ",")))
won't work

Related

Assign value to data based on more than two conditions and on other data

I have a data frame that looks like this
> df
name time count
1 A 10 9
2 A 12 17
3 A 24 19
4 A 3 15
5 A 29 11
6 B 31 14
7 B 7 7
8 B 30 18
9 C 29 13
10 C 12 12
11 C 3 16
12 C 4 6
and for each name group (A, B, C) I would need to assign a category following the rules below:
if time<= 10 then category = 1
if 10 <time<= 20 then category = 2
if 20 <time<= 30 then category = 3
if time> 30 then category = 4
to have a data frame that looks like this:
> df_final
name time count category
1 A 10 9 1
2 A 12 17 2
3 A 24 19 3
4 A 3 15 1
5 A 29 11 3
6 B 31 14 4
7 B 7 7 1
8 B 30 18 3
9 C 29 13 3
10 C 12 12 2
11 C 3 16 1
12 C 4 6 1
after that I would need to sum the value in count based on their category. The ultimate data frame should loo like this:
> df_ultimate
name count category
1 A 24 1
2 A 17 2
3 A 30 3
4 A NA 4
5 B 7 1
6 B NA 2
7 B 18 3
8 B 14 4
9 C 22 1
10 C 12 2
11 C 13 3
12 C NA 4
I have tried to play around with summarise and group_by but without much success.
Thanks for your help
With cut + complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(name, category = cut(time, breaks = c(-Inf, 10, 20, 30, Inf), labels = 1:4)) %>%
summarise(count = sum(count)) %>%
complete(category)
# # Groups: name [3]
# name category count
# 1 A 1 24
# 2 A 2 17
# 3 A 3 30
# 4 A 4 NA
# 5 B 1 7
# 6 B 2 NA
# 7 B 3 18
# 8 B 4 14
# 9 C 1 22
# 10 C 2 12
# 11 C 3 13
# 12 C 4 NA

How to add a column with repeating but changing sequence?

I'm trying to add a column with repeating sequence but one that changes for each group. In the example data, the group is the id column.
data <- tibble::expand_grid(id = 1:12, condition = c("a", "b", "c"))
data
id condition
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
... and so on
I'd like to add a column called order to repeat various combinations like 1 2 3 2 3 1 3 1 2 1 3 2 2 1 3 3 2 1 for each id.
In the end, the desired output will look like this
id condition order
1 a 1
1 b 2
1 c 3
2 a 2
2 b 3
2 c 1
3 a 3
3 b 1
3 c 2
... and so on
I'm looking for a simple mutate solution or base R solution. I tried generating a list of combinations but I'm not sure how to create a variable from that.
You can use perms from package pracma to generate all permutations, e.g.,
data %>%
cbind(order = c(t(pracma::perms(1:3))))
which gives
id condition order
1 1 a 3
2 1 b 2
3 1 c 1
4 2 a 3
5 2 b 1
6 2 c 2
7 3 a 2
8 3 b 3
9 3 c 1
10 4 a 2
11 4 b 1
12 4 c 3
13 5 a 1
14 5 b 2
15 5 c 3
16 6 a 1
17 6 b 3
18 6 c 2
19 7 a 3
20 7 b 2
21 7 c 1
22 8 a 3
23 8 b 1
24 8 c 2
25 9 a 2
26 9 b 3
27 9 c 1
28 10 a 2
29 10 b 1
30 10 c 3
31 11 a 1
32 11 b 2
33 11 c 3
34 12 a 1
35 12 b 3
36 12 c 2

Sum values incrementally for panel data

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a particular column to the previous one for each cross-sectional unit in my data separately. My data looks like this:
firm date value
A 1 10
A 2 15
A 3 20
A 4 0
B 1 0
B 2 1
B 3 5
B 4 10
C 1 3
C 2 2
C 3 10
C 4 1
D 1 7
D 2 3
D 3 6
D 4 9
And I want to achieve the data below. So I want to sum values for each cross-sectional unit incrementally.
firm date value cumulative value
A 1 10 10
A 2 15 25
A 3 20 45
A 4 0 45
B 1 0 0
B 2 1 1
B 3 5 6
B 4 10 16
C 1 3 3
C 2 2 5
C 3 10 15
C 4 1 16
D 1 7 7
D 2 3 10
D 3 6 16
D 4 9 25
Below is a reproducible example code. I tried lag() but couldn't figure out how to repeat it for each firm.
firm <- c("A","A","A","A","B","B","B","B","C","C","C", "C","D","D","D","D")
date <- c("1","2","3","4","1","2","3","4","1","2","3","4", "1", "2", "3", "4")
value <- c(10, 15, 20, 0, 0, 1, 5, 10, 3, 2, 10, 1, 7, 3, 6, 9)
data <- data.frame(firm = firm, date = date, value = value)
Does this work:
library(dplyr)
df %>% group_by(firm) %>% mutate(cumulative_value = cumsum(value))
# A tibble: 16 x 4
# Groups: firm [4]
firm date value cumulative_value
<chr> <int> <int> <int>
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25
Using base R with ave
data$cumulative_value <- with(data, ave(value, firm, FUN = cumsum))
-output
> data
firm date value cumulative_value
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25

Creating groups based on running totals against a value

I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.
My data looks something like this
ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2
What is want is something like this
ID X Y Z CumSum Group
1 A A 1 1 1
2 A B 5 6 1
3 A C 2 8 1
4 A D 42 50 1
5 A E 10 10 2
6 A F 2 12 2
7 A G 0 12 2
8 A H 3 15 2
9 A I 0 15 2
10 A J 8 23 2
11 A K 19 42 2
12 A L 3 45 2
13 A M 1 1 3
14 A N 1 2 3
15 A O 2 4 3
16 A P 0 4 3
17 A Q 1 5 3
18 A R 2 7 3
Please let me know how I can achieve this with R.
EDIT: I extended the minimum reproducible example for more clarity
EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?
A function for this is available in the MESS-package...
library(MESS)
library(data.table)
DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][]
output
ID X Y Z Group Cumsum
1: 1 A A 1 1 1
2: 2 A B 5 1 6
3: 3 A C 2 1 8
4: 4 A D 42 1 50
5: 5 A E 10 2 10
6: 6 A F 2 2 12
7: 7 A G 0 2 12
8: 8 A H 3 2 15
9: 9 A I 0 2 15
10: 10 A J 8 2 23
11: 11 A K 19 2 42
12: 12 A L 3 2 45
sample data
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3")
Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used.
Accum <- function(acc, x) if (acc < 45) acc + x else x
applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE)
cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum
GroupNo <- function(x) {
y <- cumsumr(x >= 45)
max(y) - y + 1
}
transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)),
Group = ave(Cumsum, ID, FUN = GroupNo))
giving:
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
13 12 A L 3 3 3
14 12 A L 3 6 3
Note
The input in reproducible form:
Lines <- "ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
12 A L 3
12 A L 3"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
One tidyverse possibility could be:
df %>%
mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
Group = cumsum(Cumsum >= 45),
Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
Not a pretty solution, but functional.
df$Group<-0
group<-1
while (df$Group[nrow(df)]==0) {
df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0])
df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group
group=group+1
}
df
ID X Y Z ww Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
OK, yeah, #tmfmnk 's solution is vastly better:
Unit: milliseconds
expr min lq mean median uq max neval
tm 2.224536 2.805771 6.76661 3.221449 3.990778 303.7623 100
iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062 100
Or using data.table:
library(data.table)
n <- 45L
DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][,
Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)]
output:
ID X Y Z cs Group
1: 1 A A 1 1 1
2: 2 A B 5 6 1
3: 3 A C 2 8 1
4: 4 A D 42 42 1
5: 5 A E 10 10 2
6: 6 A F 2 12 2
7: 7 A G 0 12 2
8: 8 A H 3 15 2
9: 9 A I 0 15 2
10: 10 A J 8 23 2
11: 11 A K 19 42 2
12: 12 A L 3 45 2
13: 13 A M 1 1 3
14: 14 A N 1 2 3
15: 15 A O 2 4 3
16: 16 A P 0 4 3
17: 17 A Q 1 5 3
18: 18 A R 2 7 3
data:
library(data.table)
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2")

Relabel samples in kmean results considering the order of centers

I am using kmeans to cluster my data, for the produced result I have a plan.
I wanted to relabel the samples based on ordered centres. Consider following example :
a = c("a","b","c","d","e","F","i","j","k","l","m","n")
b = c(1,2,3,20,21,21,40,41,42,4,23,50)
mydata = data.frame(id=a,amount=b)
result = kmeans(mydata$amount,3,nstart=10)
Here is the result :
clus$cluster
2 2 2 3 3 3 1 1 1 2 3 1
clus$centers
1 43.25
2 2.50
3 21.25
mydata = data.frame(mydata,label =clus$cluster)
mydata
id amount label
1 a 1 2
2 b 2 2
3 c 3 2
4 d 20 3
5 e 21 3
6 F 21 3
7 i 40 1
8 j 41 1
9 k 42 1
10 l 4 2
11 m 23 3
12 n 50 1
What I am looking for is sorting the centres and producing the labels accordingly:
1 2.50
2 21.25
3 43.25
and label the samples going to:
1 1 1 2 2 2 3 3 3 1 2 3
and the result should be :
id amount label
1 a 1 1
2 b 2 1
3 c 3 1
4 d 20 2
5 e 21 2
6 F 21 2
7 i 40 3
8 j 41 3
9 k 42 3
10 l 4 1
11 m 23 2
12 n 50 3
I think it is possible to do it by, order the centres and for each sample taking the index of minimum distance of samples with centres as the label of that cluster.
Is there another way that R can do it automatically ?
One idea is to create a named vector by matching your centers with the sorted centers. Then match the vector with mydata$label and replace with the names of the vector, i.e.
i1 <- setNames(match(sort(result$centers), result$centers), rownames(result$centers))
as.numeric(names(i1)[match(mydata$label, i1)])
# [1] 1 1 1 2 2 2 3 3 3 1 2 3
You can use for loop, if you don't mind loops
cls <- result$cluster
for (i in 1 : length(result$cluster))
result$cluster[cls == order(result$centers)[i]] <- i
result$cluster
#[1] 1 1 1 2 2 2 3 3 3 1 2 3

Resources