I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.
My data looks something like this
ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2
What is want is something like this
ID X Y Z CumSum Group
1 A A 1 1 1
2 A B 5 6 1
3 A C 2 8 1
4 A D 42 50 1
5 A E 10 10 2
6 A F 2 12 2
7 A G 0 12 2
8 A H 3 15 2
9 A I 0 15 2
10 A J 8 23 2
11 A K 19 42 2
12 A L 3 45 2
13 A M 1 1 3
14 A N 1 2 3
15 A O 2 4 3
16 A P 0 4 3
17 A Q 1 5 3
18 A R 2 7 3
Please let me know how I can achieve this with R.
EDIT: I extended the minimum reproducible example for more clarity
EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?
A function for this is available in the MESS-package...
library(MESS)
library(data.table)
DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][]
output
ID X Y Z Group Cumsum
1: 1 A A 1 1 1
2: 2 A B 5 1 6
3: 3 A C 2 1 8
4: 4 A D 42 1 50
5: 5 A E 10 2 10
6: 6 A F 2 2 12
7: 7 A G 0 2 12
8: 8 A H 3 2 15
9: 9 A I 0 2 15
10: 10 A J 8 2 23
11: 11 A K 19 2 42
12: 12 A L 3 2 45
sample data
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3")
Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used.
Accum <- function(acc, x) if (acc < 45) acc + x else x
applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE)
cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum
GroupNo <- function(x) {
y <- cumsumr(x >= 45)
max(y) - y + 1
}
transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)),
Group = ave(Cumsum, ID, FUN = GroupNo))
giving:
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
13 12 A L 3 3 3
14 12 A L 3 6 3
Note
The input in reproducible form:
Lines <- "ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
12 A L 3
12 A L 3"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
One tidyverse possibility could be:
df %>%
mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
Group = cumsum(Cumsum >= 45),
Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
Not a pretty solution, but functional.
df$Group<-0
group<-1
while (df$Group[nrow(df)]==0) {
df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0])
df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group
group=group+1
}
df
ID X Y Z ww Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
OK, yeah, #tmfmnk 's solution is vastly better:
Unit: milliseconds
expr min lq mean median uq max neval
tm 2.224536 2.805771 6.76661 3.221449 3.990778 303.7623 100
iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062 100
Or using data.table:
library(data.table)
n <- 45L
DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][,
Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)]
output:
ID X Y Z cs Group
1: 1 A A 1 1 1
2: 2 A B 5 6 1
3: 3 A C 2 8 1
4: 4 A D 42 42 1
5: 5 A E 10 10 2
6: 6 A F 2 12 2
7: 7 A G 0 12 2
8: 8 A H 3 15 2
9: 9 A I 0 15 2
10: 10 A J 8 23 2
11: 11 A K 19 42 2
12: 12 A L 3 45 2
13: 13 A M 1 1 3
14: 14 A N 1 2 3
15: 15 A O 2 4 3
16: 16 A P 0 4 3
17: 17 A Q 1 5 3
18: 18 A R 2 7 3
data:
library(data.table)
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2")
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr.
id <- c(1,1,1,1,1,2,2,2,3,3,3)
name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E")
value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3")
dt <- data.frame(id,name,value)
R solution:
separate_rows(dt, name, sep=",") %>%
separate_rows(value, sep=",")
Desired Output from sparkframe(sparklyr package)-
> final_result
id name value
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 1
5 1 B 2
6 1 B 3
7 1 C 1
8 1 C 2
9 1 C 3
10 1 B 2
11 1 B 4
12 1 B 43
13 1 B 2
14 1 F 2
15 1 F 4
16 1 F 43
17 1 F 2
18 1 C 3
19 1 C 1
20 1 C 2
21 1 C 3
22 1 D 1
23 1 R 1
24 1 P 1
25 1 E 1
26 1 E 2
27 2 A 26
28 2 A 6
29 2 A 7
30 2 Q 26
31 2 Q 6
32 2 Q 7
33 2 W 26
34 2 W 6
35 2 W 7
36 2 B 3
37 2 B 3
38 2 B 4
39 2 J 3
40 2 J 3
41 2 J 4
42 2 C 1
43 3 D 1
44 3 D 12
45 3 M 1
46 3 M 12
47 3 E 2
48 3 E 3
49 3 E 3
50 3 X 2
51 3 X 3
52 3 X 3
53 3 F 3
54 3 E 3
Note-
I have approx 1000 columns with nested values. so, I need a function which can loop in for each column.
I know we have sdf_unnest() function from package sparklyr.nested. But, I am not sure how to split strings of multiple columns and apply this function. I am quite new in sparklyr.
Any help would be much appreciated.
You have to combine explode and split
sdt %>%
mutate(name = explode(split(name, ","))) %>%
mutate(value = explode(split(value, ",")))
# Source: lazy query [?? x 3]
# Database: spark_connection
id name value
<dbl> <chr> <chr>
1 1.00 A 1
2 1.00 A 2
3 1.00 A 3
4 1.00 B 1
5 1.00 B 2
6 1.00 B 3
7 1.00 C 1
8 1.00 C 2
9 1.00 C 3
10 1.00 B 2
# ... with more rows
Please note that lateral views have be to expressed as separate subqueries, so this:
sdt %>%
mutate(
name = explode(split(name, ",")),
value = explode(split(value, ",")))
won't work
I have 15 columns and I want to group by values in each column by either 0 or 1 or na.
my dataset
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA
1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA
I want output to be like:
A B C D E F G H I J K L M N O
0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always"
sapply(df1, table, useNA="always")
If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1
sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always"))
# A B C D E F G H I J K L M N O
#0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9
#1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11
#<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5
I've this data set
id <- c(0,0,1,1,2,2,3,3,4,4)
gender <- c("m","m","f","f","f","f","m","m","m","m")
x1 <-c(1,1,1,1,2,2,3,3,10,10)
x2 <- c(3,7,5,6,9,15,10,15,12,20)
alldata <- data.frame(id,gender,x1,x2)
which looks like:
id gender x1 x2
0 m 1 3
0 m 1 7
1 f 1 5
1 f 1 6
2 f 2 9
2 f 2 15
3 m 3 10
3 m 3 15
4 m 10 12
4 m 10 20
Notice that for each unique id x1 are similar, but x2 are different. I need to sort data by id and x2 (from smallest to largest)
and then for each unique id I need to set x1(for the second record) = x2 (for the first record).
The data would look like:
id gender x1 x2
0 m 1 3
0 m 3 7
1 f 1 5
1 f 5 6
2 f 2 9
2 f 9 15
3 m 3 10
3 m 10 15
4 m 10 12
4 m 12 20
I found this easier using data.table
> library(data.table)
> dt = data.table(alldata)
> setkey(dt, id, x2) #sort the data
This next line says: within each ID for x1, take the first value of x1, then every remaining value take from x2 as needed.
> dt[,x1 := c(x1[1], x2)[1:.N],keyby=id]
> dt
id gender x1 x2
1: 0 m 1 3
2: 0 m 3 7
3: 1 f 1 5
4: 1 f 5 6
5: 2 f 2 9
6: 2 f 9 15
7: 3 m 3 10
8: 3 m 10 15
9: 4 m 10 12
10: 4 m 12 20
Here's another possible solution using the seq command to select every other record:
alldata <- alldata[order(id, x2),]
alldata$x1[seq(2, length(alldata$x1), 2)] <- alldata$x2[seq(1, length(alldata$x2) - 1, 2)]
Here is a dplyr solution.
library(dplyr)
arrange(alldata,id,x2) %>%
group_by(id) %>%
mutate(x1= c(first(x1), first(x2)))
Source: local data frame [10 x 4]
Groups: id
id gender x1 x2
1 0 m 1 3
2 0 m 3 7
3 1 f 1 5
4 1 f 5 6
5 2 f 2 9
6 2 f 9 15
7 3 m 3 10
8 3 m 10 15
9 4 m 10 12
10 4 m 12 20
`rownames<-`(do.call(rbind,by(alldata,alldata$id,function(g) { o <- order(g$x2); g$x1[o[2]] <- g$x2[o[1]]; g; })),NULL);
## id gender x1 x2
## 1 0 m 1 3
## 2 0 m 3 7
## 3 1 f 1 5
## 4 1 f 5 6
## 5 2 f 2 9
## 6 2 f 9 15
## 7 3 m 3 10
## 8 3 m 10 15
## 9 4 m 10 12
## 10 4 m 12 20