I would like to make a spagettiplot of the data below. Treatment C should be set as the reference,1, compared to treatment A and B. Does anyone have a suggestion how to do that? Thanks in advance! :)
id <- rep(c(300,450), each=6)
trt <- rep(c("A","B","C"),2)
q1 <- c(100,89, 60,85,40,10)
df <- data.frame(id,trt,q1)
df
id trt q1
1 300 A 100
2 300 B 89
3 300 C 60
4 300 A 85
5 300 B 40
6 300 C 10
7 450 A 100
8 450 B 89
9 450 C 60
10 450 A 85
11 450 B 40
12 450 C 10
Related
I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?
I want to make all rows with number 2 in column q1 to zero in column q2. Anyone have a smart solution?
a <- rep(c(300,450), each=c(3,3))
q1 <- rep(c(1,1,2,1,1,2),2)
q2 <- c(100,40,"",80,30,"" , 45,78,"",20,58,"")
df <- cbind(a,q1,q2)
df <- as.data.frame(df)
Original input data :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2
4 450 1 80
5 450 1 30
6 450 2
7 300 1 45
8 300 1 78
9 300 2
10 450 1 20
11 450 1 58
12 450 2
Desired output :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2 0
4 450 1 80
5 450 1 30
6 450 2 0
7 300 1 45
8 300 1 78
9 300 2 0
10 450 1 20
11 450 1 58
12 450 2 0
An option would be to create a logical vector based on the column 'q1' and assign the value of 'q2' to 0
df$q2[df$q1 == 2] <- 0
df
# a q1 q2
#1 300 1 100
#2 300 1 40
#3 300 2 0
#4 450 1 80
#5 450 1 30
#6 450 2 0
#7 300 1 45
#8 300 1 78
#9 300 2 0
#10 450 1 20
#11 450 1 58
#12 450 2 0
Another option is replace
transform(df, q2 = replace(q2, q1 == 2, 0))
With cbind, it converts to a matrix first, so any character element anywhere results in the whole matrix to be character. Better, would be use data.frame directly
Or in data.table
library(data.table)
setDT(df)[q1== 2, q2 := '0']
data
df <- data.frame(a, q1, q2, stringsAsFactors = FALSE)
I have two tables as follows:
A<-data.frame("Task"=c("a","b","c","d","e"),"FC"=(c(100,120,200,300,400)))
B<-data.frame("Task"=c("a","b","c"),"FC"=(c(20,50,30)))
Task FC
1 a 100
2 b 120
3 c 200
4 d 300
5 e 400
Task FC
1 a 20
2 b 50
3 c 30
How can I create table C with output is summarise of coresposing Task from A and B?
Task FC
1 a 120
2 b 170
3 c 230
merge dfs
df=merge(A,B,by="Task",all=F)
summarise the data
df$sum=apply(df[,2:3],1,sum)#sum, sd, min, max or ...
> df
Task FC.x FC.y sum
1 a 100 20 120
2 b 120 50 170
3 c 200 30 230
Here is my data :
class x1 x2
c 6 90
b 5 50
c 3 70
b 9 40
a 5 30
b 1 60
a 7 20
c 4 80
a 2 10
I first want to order it by class (increasing or decreasing doesn't really matter) and then by x1 (decreasing), so I do the following :
df <- df[with(df, order(class, x1, decreasing = TRUE))]
class x1 x2
c 6 90
c 4 80
c 3 70
b 9 40
b 5 50
b 1 60
a 7 20
a 5 30
a 2 10
And then I would like the cumulative sum over x1 for each class :
class x1 x2 cumsum
c 6 90 90
c 4 80 170 # 90+80
c 3 70 240 # 90+80+70
b 9 40 40
b 5 50 90 # 40+50
b 1 60 150 # 40+50+60
a 7 20 20
a 5 30 50 # 20+30
a 2 10 60 # 20+30+10
Following this answer, I did this :
df$cumsum <- unlist(by(df$x2, df$class, cumsum))
# (Also tried this, same result)
df$cumsum <- unlist(by(df[,x2], df[,class], cumsum))
But what I get is a cumulative sum over the whole set + misordered. To be more specific, Here is what I get :
class x1 x2 cumsum
c 6 90 20 # this cumsum
c 4 80 50 # and this cumsum
c 3 70 60 # and this cumsum are the cumsum of the lines of class a,
b 9 40 100 # then it adds the 'x2' values of class b : 60 ('cumsum' from the previous line) + 40
b 5 50 150 # and keeps doing so : 100 + 50
b 1 60 210 # 150 + 60
a 7 20 300 # 210 + 90
a 5 30 380 # 300 + 80
a 2 10 450 # 380 + 70
Any idea on how I could solve this ? Thanks
dplyr can work here too
library(dplyr)
df %>%
group_by(class) %>%
arrange(desc(x1)) %>%
mutate(cumsum=cumsum(x2))
## class x1 x2 cumsum
## (fctr) (int) (int) (int)
## 1 a 7 20 20
## 2 a 5 30 50
## 3 a 2 10 60
## 4 b 9 40 40
## 5 b 5 50 90
## 6 b 1 60 150
## 7 c 6 90 90
## 8 c 4 80 170
## 9 c 3 70 240
As described here (https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and elsewhere, the group_by in conjunction with arrange implies that the data will be sorted by the grouping variable first.
We can use data.table
library(data.table)
setDT(df)[, x2:= cumsum(x2) , class]
df
# class x1 x2
#1: c 6 90
#2: c 4 170
#3: c 3 240
#4: b 9 40
#5: b 5 90
#6: b 1 150
#7: a 7 20
#8: a 5 50
#9: a 2 60
NOTE: In the above I used the ordered data
If we need to order also,
setorder(setDT(df), -class, -x1)[, x2:=cumsum(x2), class]
You can use base R transform and ave to cumsum over the class column
transform(df[order(df$class, decreasing = T), ], cumsum = ave(x2, class, FUN=cumsum))
# class x1 x2 cumsum
#1 c 6 90 90
#3 c 3 70 160
#8 c 4 80 240
#2 b 5 50 50
#4 b 9 40 90
#6 b 1 60 150
#5 a 5 30 30
#7 a 7 20 50
#9 a 2 10 60
I encountered big problem when trying to apply my micro solution to macro scale. I want to write a function that will allow me to automatize adding all values of specific data frames together.
First, I have created list of all data frames:
> lst
$data001
A B C D E
X 10 30 50 70
Y 20 40 60 80
$data002
A B C D E
X 10 30 50 70
Y 20 40 60 80
$data003
A B C D E
X 10 30 50 70
Y 20 40 60 80
Z 20 40 60 80
$data004
A B C D E
X 10 30 50 70
Y 20 40 60 80
Z 20 40 60 80
V 20 40 60 80
$data005
A B C D E
Q 10 30 50 70
$data006
A B C D E
X 10 30 50 70
Y 20 40 60 80
$data007
A B C D E
X 10 30 50 70
Y 20 40 60 80
$data008
A B C D E
X 10 30 50 70
Y 20 40 60 80
$data09
A B C D E
X 11 33 55 77
Y 22 44 66 88
$data010
A B C D E
X 10 30 50 70
Y 20 40 60 80
Second, I have determined which data frames I would like to add together (add 1 to 1 and 2 to 2 etc.). In this example there are 10 data frames organized in the following order, within lst:
[1] 1 1 2 2 2 2 2 2 3 2
Manually adding all "ones" I would look something like this:
> ddply(rbind(lst[[1]],lst[[2]]), "A", numcolwise(sum))
A B C D E
X 20 60 100 140
Y 40 80 120 160
Manually adding all "two" I would look something like this:
A B C D E
X 60 180 300 420
Y 120 240 360 480
Z 40 80 120 160
V 20 40 60 80
Q 10 30 50 70
However, I just cannot figure it out how write a loop that will create list with, in this example, 3 data frames that are result of summing up selected data frames.
Thank you in advance!
We may use data.table
library(data.table)
lapply(split(seq_along(lst), v1), function(i)
rbindlist(lst[i], fill=TRUE)[
, lapply(.SD, sum), A, .SDcols= B:E])
#$`1`
# A B C D E
#1: X 20 60 100 140
#2: Y 40 80 120 160
#$`2`
# A B C D E
#1: X 60 180 300 420
#2: Y 120 240 360 480
#3: Z 40 80 120 160
#4: V 20 40 60 80
#5: Q 10 30 50 70
#$`3`
# A B C D E
#1: X 11 33 55 77
#2: Y 22 44 66 88
data
v1 <- c(1, 1, 2, 2, 2, 2, 2, 2, 3, 2)