R Dataframe: Add new column as sum of certain other columns - r

Name Trial# Result ResultsSoFar
1 Bob 1 14 14
2 Bob 2 22 36
3 Bob 3 3 39
4 Bob 4 18 57
5 Nancy 2 33 33
6 Nancy 3 87 120
Hello, say I have the dataframe above. What's the best way to generate the "ResultsSoFar" column which is a sum of that person's results up to and including that trial (Bob's results do not include Nancy's and vice versa).

With data.table you can do:
library(data.table)
setDT(df)[, ResultsSoFar:=cumsum(Result), by=Name]
df
Name Trial. Result ResultsSoFar
1: Bob 1 14 14
2: Bob 2 22 36
3: Bob 3 3 39
4: Bob 4 18 57
5: Nancy 2 33 33
6: Nancy 3 87 120
Note:
If Trial# is not sorted, you can do setDT(df)[, ResultsSoFar:=cumsum(Result[order(Trial.)]), by=Name] to get the right order for the cumsum

Related

Pasted tables from Jupyter to StackOverflow get ill-formatted

Consider the code:
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df
When I copy the output, I get a not nicely formatted table here on StackOverflow:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
How to fix this problem?
from an ipython terminal session I can paste:
In [73]: df
Out[73]:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
From a notebook, it looks like print(df) gives a better result:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
The copy selection should be a solid color block.

Split data frame based on group into list in defined order in R

I have simple data.frame
ts_df <- data.frame(val=c(20,30,40,50,21,26,11,41,47,41),
cycle=c(3,3,3,3,2,2,2,1,1,1),
date=as.Date(c("1985-06-30","1985-09-30","1985-12-31","1986-03-31","1986-06-30","1986-09-30","1986-12-31","1987-03-31","1987-06-30","1987-09-30")))
and I need split ts_df based on group but keep order inside resulted list based on date.
list_ts_df <- split(ts_df, ts_df$cycle)
So instead of
> list_ts_df
$`1`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
I need desired output as
> list_ts_df
$`1`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
Is there any simple solution how to achieve desired output? Thank you very much for any of your advises.
We can do an order of the dataset first and then do the split on the 'cycle' by creating a factor with the levels specified as unique elements
t1 <- ts_df[order(ts_df$date),]
split(t1, factor(t1$cycle, levels = unique(t1$cycle)) )

Randomly draw rows from dataframe based on unique values and column values

I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:
trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
trt individual session data
1 A Bob 1 3.72013685581385
2 A Nancy 1 3.97225419000673
3 A Tim 1 4.44714175686225
4 B Bob 2 5.00024599458127
5 B Nancy 2 3.43615965145765
6 B Tim 2 6.7920094635501
7 C Bob 3 4.36315054477571
8 C Nancy 3 5.07117348146375
9 C Tim 3 4.38503325758969
10 A Bob 4 4.30677162933005
11 A Nancy 4 1.89311687510669
12 A Tim 4 3.09084920968413
13 B Bob 5 3.10436190897144
14 B Nancy 5 3.59454992439722
15 B Tim 5 3.40778069131207
16 C Bob 6 4.00171937800892
17 C Nancy 6 0.14578811080644
18 C Tim 6 4.20754733296227
19 A Bob 7 3.69131009783284
20 A Nancy 7 4.7025756891679
21 A Tim 7 4.46196017363017
22 B Bob 8 3.97573281432736
23 B Nancy 8 4.5373185942686
24 B Tim 8 2.40937847038141
25 C Bob 9 4.57519884980087
26 C Nancy 9 5.19143914630448
27 C Tim 9 4.83144732833874
28 A Bob 10 3.01769965527235
29 A Nancy 10 5.17300616827746
30 A Tim 10 4.65432284571663
31 B Bob 11 4.50892032922527
32 B Nancy 11 3.38082717995663
33 B Tim 11 4.92022245677209
34 C Bob 12 4.54149796547394
35 C Nancy 12 3.21992774137179
36 C Tim 12 3.74507360931023
37 A Bob 13 3.39524949548056
38 A Nancy 13 4.17518916890901
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
41 B Nancy 14 2.08784850191654
42 B Tim 14 3.98446125755258
43 C Bob 15 4.01837496797085
44 C Nancy 15 3.40610126858125
45 C Tim 15 4.57107635588582
46 A Bob 16 3.15839276840723
47 A Nancy 16 2.19932140340504
48 A Tim 16 4.77588798035668
49 B Bob 17 4.3524768657397
50 B Nancy 17 4.49071625925856
51 B Tim 17 4.02576463486266
52 C Bob 18 3.74783360762117
53 C Nancy 18 2.84123227236184
54 C Tim 18 3.2024114782253
55 A Bob 19 4.93837445490921
56 A Nancy 19 4.7103051496802
57 A Tim 19 6.22083635045134
58 B Bob 20 4.5177747677824
59 B Nancy 20 1.78839270771153
60 B Tim 20 5.07140678136995
61 C Bob 21 3.47818616035335
62 C Nancy 21 4.28526474048439
63 C Tim 21 4.22597602946575
64 A Bob 22 1.91700925257901
65 A Nancy 22 2.96317997587458
66 A Tim 22 2.53506974227672
67 B Bob 23 5.52714403395316
68 B Nancy 23 3.3618513551059
69 B Tim 23 4.85869007113978
70 C Bob 24 3.4367068543959
71 C Nancy 24 4.47769879000349
72 C Tim 24 5.77340483757836
73 A Bob 25 4.78524317734622
74 A Nancy 25 3.55373702554664
75 A Tim 25 2.88541465503637
76 B Bob 26 4.62885302019139
77 B Nancy 26 3.59430293369092
78 B Tim 26 2.29610255924296
79 C Bob 27 4.38433001299722
80 C Nancy 27 3.77825207859976
81 C Tim 27 2.12163194694365
How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:
trt individual session data
1 A Bob 1 3.72013685581385
5 B Nancy 2 3.43615965145765
7 C Bob 3 4.36315054477571
12 A Tim 4 3.09084920968413
15 B Tim 5 3.40778069131207
17 C Nancy 6 0.14578811080644
19 A Bob 7 3.69131009783284
29 A Nancy 10 5.17300616827746
31 B Bob 11 4.50892032922527
34 C Bob 12 4.54149796547394
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
47 A Nancy 16 2.19932140340504
51 B Tim 17 4.02576463486266
54 C Tim 18 3.2024114782253
59 B Nancy 20 1.78839270771153
71 C Nancy 24 4.47769879000349
81 C Tim 27 2.12163194694365
I have tried a couple things with no luck.
I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:
setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
trt individual session data
1: A Bob 25 2.7560788894668
2: A Bob 19 4.12040841647523
3: A Nancy 4 5.35362338127901
4: A Nancy 19 5.51636882737692
5: A Tim 19 5.10553640201998
6: A Tim 1 2.77380671625473
7: B Bob 23 3.50585105164409
8: B Bob 8 3.58167259470814
9: B Nancy 23 2.85301307507985
10: B Nancy 8 2.85179395539781
11: B Tim 26 2.40666507132474
12: B Tim 20 3.31276311351286
13: C Bob 24 3.19076007024549
14: C Bob 3 3.59146613276121
15: C Nancy 9 4.46606667880457
16: C Nancy 15 2.25405252536256
17: C Tim 12 4.43111661206133
18: C Tim 27 4.23868848646589
I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:
ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) :
unused argument (by = .(trt, individual))
Thanks in advance for your help!
Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:
setDT(df)
setkey(df, session)
usedsessions = 0 # some value that's not a session number
df[, {
res = .SD[!.(usedsessions)][sample(.N, 2)]
usedsessions = c(usedsessions, res$session)
res
}
, by = .(trt, individual)]
# trt individual session data
# 1: A Bob 7 4.256668
# 2: A Bob 25 2.431821
# 3: A Nancy 16 4.785859
# 4: A Nancy 19 4.865248
# 5: A Tim 4 3.303689
# 6: A Tim 13 3.550261
# 7: B Bob 26 3.987136
# 8: B Bob 17 3.283055
# 9: B Nancy 14 3.177226
#10: B Nancy 2 3.639542
#11: B Tim 8 2.168447
#12: B Tim 5 3.521123
#13: C Bob 21 3.284245
#14: C Bob 12 5.773098
#15: C Nancy 24 4.624428
#16: C Nancy 9 3.235467
#17: C Tim 18 4.001395
#18: C Tim 27 5.002110
You'll probably need to add corner case processing (e.g. if there is no such sampling).

reshape / restructure the data frame in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I'm cleaning a dataset, but the frame is not ideal, I have to reshape it, but I don't know how. The following are the original data frame:
Rater Rater ID Ratee1 Ratee2 Ratee3 Ratee1.item1 Ratee1.item2 Ratee2.item1 Ratee2.item2 Ratee3.item1 Ratee3.item2
A 12 701 702 800 1 2 3 4 5 6
B 23 45 46 49 3 3 3 3 3 3
C 24 80 81 28 2 3 4 5 6 9
Then I am wondering how to reshape it as the below:
Rater Rater ID Ratee item1 item2
A 12 701 1 2
A 12 702 3 4
A 12 800 5 6
B 23 45 3 3
B 23 46 3 3
B 23 49 3 3
C 24 80 2 3
C 24 81 4 5
C 24 28 6 9
This reshaping is a little bit different from this one (Reshaping data.frame from wide to long format). As I have three parts in the original data.
First part is about the rater's ID (Rater and Rater ID).
The second is about retee's ID (Ratee1, Ratee2, Ratee3).
The Third part is about Rater's rating on each retee (retee*.item1(or2)).
To make it more clear, let me brief the data collecting process.
First, a rater types in his own name and ID,
then nominates three persons (Ratee1 to Ratee3),
and then rates the questions regarding each retee (for each retee, there are two questions).
Does anyone know how to reshape this? Thanks!
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^Ratee\\d+$", "^Ratee\\d+\\.item1",
"^Ratee\\d+\\.item2"), value.name = c("Ratee", "item1", "item2"))[,
variable := NULL][order(Rater)]
# Rater RaterID Ratee item1 item2
#1: A 12 701 1 2
#2: A 12 702 3 4
#3: A 12 800 5 6
#4: B 23 45 3 3
#5: B 23 46 3 3
#6: B 23 49 3 3
#7: C 24 80 2 3
#8: C 24 81 4 5
#9: C 24 28 6 9

Select row meeting condition and all subsequent rows by group

Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

Resources