Subsetting dataframe based on unique values and other column data - r

I have a dataframe that has a series of ID characters (trt,individual, and session):
> trt<-c(rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3))
individual<-rep(c("Bob","Nancy","Tim"),9)
session<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
data<-rnorm(27,mean=4,sd=1)
df<-as.data.frame(cbind(trt,individual,session,data))
df
trt individual session data
1 A Bob 1 4.36604594311893
2 A Nancy 1 3.29568979189961
3 A Tim 1 3.55849387209243
4 B Bob 2 5.41661201729216
5 B Nancy 2 4.7158873476798
6 B Tim 2 5.34401708530548
7 C Bob 3 4.54277206331273
8 C Nancy 3 3.53976115781019
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
11 A Nancy 4 4.7995601464568
12 A Tim 4 3.17821205815185
13 B Bob 5 3.62379779744325
14 B Nancy 5 4.07387328854209
15 B Tim 5 5.60156909861945
16 C Bob 6 4.06727142161431
17 C Nancy 6 4.59940289933985
18 C Tim 6 3.07543217234973
19 A Bob 7 2.63468285023662
20 A Nancy 7 3.22650587327078
21 A Tim 7 6.31062631711196
22 B Bob 8 4.69047076193906
23 B Nancy 8 4.79190101388308
24 B Tim 8 1.61906440409175
25 C Bob 9 2.85180524036416
26 C Nancy 9 3.43304058627408
27 C Tim 9 4.89263600498695
I am looking to create a new dataframe where I have randomly pulled each trtxindividual combination but under the constraint that each unique session number is only selected once
This is what I want my dataframe to look like:
trt individual session data
2 A Nancy 1 3.29568979189961
4 B Bob 2 5.41661201729216
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
15 B Tim 5 5.60156909861945
17 C Nancy 6 4.59940289933985
21 A Tim 7 6.31062631711196
23 B Nancy 8 4.79190101388308
25 C Bob 9 2.85180524036416
I know how to randomly select a subset of each trtxindividual combination:
> setDT(df)
newdf<-df[, .SD[sample(.N, 1)] , by=.(trt, individual)]
newdf
trt individual session data
1: A Bob 4 4.75145309337952
2: A Nancy 1 3.29568979189961
3: A Tim 7 6.31062631711196
4: B Bob 8 4.69047076193906
5: B Nancy **2** 4.7158873476798
6: B Tim **2** 5.34401708530548
7: C Bob 6 4.06727142161431
8: C Nancy 9 3.43304058627408
9: C Tim 3 3.7954788384957
But I dont know how to restrict the pulls to only allow one session to be pulled (aka not allow duplicates as there are above)
Thanks in advance for your help!

This will need to iterate through the data.table and might not be quick, but it doesn't require setting any parameters for the fields of interest
library(data.table)
set.seed(7)
setDT(df)
dt1 <- df[, .SD[sample(.N)]]
dt1[, i := .I]
dt1[, flag := NA]
setkey(dt1, flag)
lapply(dt1$i, function(x) {
dt1[is.na(flag[x]) & (trt == trt[x] & individual == individual[x] | session == session[x]), flag := i == x]
})
dt1[flag == TRUE, ]
trt individual session data i flag
1: C Tim 9 3.63712332100071 1 TRUE
2: A Nancy 4 4.54908662150973 2 TRUE
3: A Tim 1 5.84217708521442 3 TRUE
4: B Tim 2 2.37343483362789 5 TRUE
5: C Nancy 3 2.87792051390258 7 TRUE
6: A Bob 7 3.45471592963754 12 TRUE
7: B Nancy 8 4.54792567807183 15 TRUE
8: C Bob 6 4.45667777212948 24 TRUE
9: B Bob 5 2.33285598638319 27 TRUE

Related

Pasted tables from Jupyter to StackOverflow get ill-formatted

Consider the code:
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df
When I copy the output, I get a not nicely formatted table here on StackOverflow:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
How to fix this problem?
from an ipython terminal session I can paste:
In [73]: df
Out[73]:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
From a notebook, it looks like print(df) gives a better result:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
The copy selection should be a solid color block.

R Dataframe: Add new column as sum of certain other columns

Name Trial# Result ResultsSoFar
1 Bob 1 14 14
2 Bob 2 22 36
3 Bob 3 3 39
4 Bob 4 18 57
5 Nancy 2 33 33
6 Nancy 3 87 120
Hello, say I have the dataframe above. What's the best way to generate the "ResultsSoFar" column which is a sum of that person's results up to and including that trial (Bob's results do not include Nancy's and vice versa).
With data.table you can do:
library(data.table)
setDT(df)[, ResultsSoFar:=cumsum(Result), by=Name]
df
Name Trial. Result ResultsSoFar
1: Bob 1 14 14
2: Bob 2 22 36
3: Bob 3 3 39
4: Bob 4 18 57
5: Nancy 2 33 33
6: Nancy 3 87 120
Note:
If Trial# is not sorted, you can do setDT(df)[, ResultsSoFar:=cumsum(Result[order(Trial.)]), by=Name] to get the right order for the cumsum

Randomly draw rows from dataframe based on unique values and column values

I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:
trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
trt individual session data
1 A Bob 1 3.72013685581385
2 A Nancy 1 3.97225419000673
3 A Tim 1 4.44714175686225
4 B Bob 2 5.00024599458127
5 B Nancy 2 3.43615965145765
6 B Tim 2 6.7920094635501
7 C Bob 3 4.36315054477571
8 C Nancy 3 5.07117348146375
9 C Tim 3 4.38503325758969
10 A Bob 4 4.30677162933005
11 A Nancy 4 1.89311687510669
12 A Tim 4 3.09084920968413
13 B Bob 5 3.10436190897144
14 B Nancy 5 3.59454992439722
15 B Tim 5 3.40778069131207
16 C Bob 6 4.00171937800892
17 C Nancy 6 0.14578811080644
18 C Tim 6 4.20754733296227
19 A Bob 7 3.69131009783284
20 A Nancy 7 4.7025756891679
21 A Tim 7 4.46196017363017
22 B Bob 8 3.97573281432736
23 B Nancy 8 4.5373185942686
24 B Tim 8 2.40937847038141
25 C Bob 9 4.57519884980087
26 C Nancy 9 5.19143914630448
27 C Tim 9 4.83144732833874
28 A Bob 10 3.01769965527235
29 A Nancy 10 5.17300616827746
30 A Tim 10 4.65432284571663
31 B Bob 11 4.50892032922527
32 B Nancy 11 3.38082717995663
33 B Tim 11 4.92022245677209
34 C Bob 12 4.54149796547394
35 C Nancy 12 3.21992774137179
36 C Tim 12 3.74507360931023
37 A Bob 13 3.39524949548056
38 A Nancy 13 4.17518916890901
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
41 B Nancy 14 2.08784850191654
42 B Tim 14 3.98446125755258
43 C Bob 15 4.01837496797085
44 C Nancy 15 3.40610126858125
45 C Tim 15 4.57107635588582
46 A Bob 16 3.15839276840723
47 A Nancy 16 2.19932140340504
48 A Tim 16 4.77588798035668
49 B Bob 17 4.3524768657397
50 B Nancy 17 4.49071625925856
51 B Tim 17 4.02576463486266
52 C Bob 18 3.74783360762117
53 C Nancy 18 2.84123227236184
54 C Tim 18 3.2024114782253
55 A Bob 19 4.93837445490921
56 A Nancy 19 4.7103051496802
57 A Tim 19 6.22083635045134
58 B Bob 20 4.5177747677824
59 B Nancy 20 1.78839270771153
60 B Tim 20 5.07140678136995
61 C Bob 21 3.47818616035335
62 C Nancy 21 4.28526474048439
63 C Tim 21 4.22597602946575
64 A Bob 22 1.91700925257901
65 A Nancy 22 2.96317997587458
66 A Tim 22 2.53506974227672
67 B Bob 23 5.52714403395316
68 B Nancy 23 3.3618513551059
69 B Tim 23 4.85869007113978
70 C Bob 24 3.4367068543959
71 C Nancy 24 4.47769879000349
72 C Tim 24 5.77340483757836
73 A Bob 25 4.78524317734622
74 A Nancy 25 3.55373702554664
75 A Tim 25 2.88541465503637
76 B Bob 26 4.62885302019139
77 B Nancy 26 3.59430293369092
78 B Tim 26 2.29610255924296
79 C Bob 27 4.38433001299722
80 C Nancy 27 3.77825207859976
81 C Tim 27 2.12163194694365
How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:
trt individual session data
1 A Bob 1 3.72013685581385
5 B Nancy 2 3.43615965145765
7 C Bob 3 4.36315054477571
12 A Tim 4 3.09084920968413
15 B Tim 5 3.40778069131207
17 C Nancy 6 0.14578811080644
19 A Bob 7 3.69131009783284
29 A Nancy 10 5.17300616827746
31 B Bob 11 4.50892032922527
34 C Bob 12 4.54149796547394
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
47 A Nancy 16 2.19932140340504
51 B Tim 17 4.02576463486266
54 C Tim 18 3.2024114782253
59 B Nancy 20 1.78839270771153
71 C Nancy 24 4.47769879000349
81 C Tim 27 2.12163194694365
I have tried a couple things with no luck.
I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:
setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
trt individual session data
1: A Bob 25 2.7560788894668
2: A Bob 19 4.12040841647523
3: A Nancy 4 5.35362338127901
4: A Nancy 19 5.51636882737692
5: A Tim 19 5.10553640201998
6: A Tim 1 2.77380671625473
7: B Bob 23 3.50585105164409
8: B Bob 8 3.58167259470814
9: B Nancy 23 2.85301307507985
10: B Nancy 8 2.85179395539781
11: B Tim 26 2.40666507132474
12: B Tim 20 3.31276311351286
13: C Bob 24 3.19076007024549
14: C Bob 3 3.59146613276121
15: C Nancy 9 4.46606667880457
16: C Nancy 15 2.25405252536256
17: C Tim 12 4.43111661206133
18: C Tim 27 4.23868848646589
I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:
ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) :
unused argument (by = .(trt, individual))
Thanks in advance for your help!
Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:
setDT(df)
setkey(df, session)
usedsessions = 0 # some value that's not a session number
df[, {
res = .SD[!.(usedsessions)][sample(.N, 2)]
usedsessions = c(usedsessions, res$session)
res
}
, by = .(trt, individual)]
# trt individual session data
# 1: A Bob 7 4.256668
# 2: A Bob 25 2.431821
# 3: A Nancy 16 4.785859
# 4: A Nancy 19 4.865248
# 5: A Tim 4 3.303689
# 6: A Tim 13 3.550261
# 7: B Bob 26 3.987136
# 8: B Bob 17 3.283055
# 9: B Nancy 14 3.177226
#10: B Nancy 2 3.639542
#11: B Tim 8 2.168447
#12: B Tim 5 3.521123
#13: C Bob 21 3.284245
#14: C Bob 12 5.773098
#15: C Nancy 24 4.624428
#16: C Nancy 9 3.235467
#17: C Tim 18 4.001395
#18: C Tim 27 5.002110
You'll probably need to add corner case processing (e.g. if there is no such sampling).

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

Remove duplicates column combinations from a dataframe in R

I want to remove duplicate combinations of sessionid, qf and qn from the following data
sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville
I read in the data as a data.frame and call it mydata. Heree is the code I have so far, but I need to know how to first sort the data.frame correctly. Secondly remove the duplicate combinations of sessionid, qf, and qn. And lastly graph in a histogram characters in the column qf
sortDATA<-function(name)
{
#sort the code by session Id, first name, then last name
sort1.name <- name[order("sessionid","qf","qn") , ]
#create a vector of length of first names
sname<-nchar(sort1.name$qf)
hist(sname)
}
thanks!
duplicated() has a method for data.frames, which is designed for just this sort of task:
df <- data.frame(a = c(1:4, 1:4),
b = c(4:1, 4:1),
d = LETTERS[1:8])
df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D
In your example the repeated rows were entirely repeated. unique works with data.frames.
udf <- unique( my.data.frame )
As for sorting... joran just posted the answer.
To address your sorting problems, first reading in your example data:
dat <- read.table(text = " sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos NA
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee NA
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville ",sep = "",header = TRUE)
and then you can use arrange from plyr,
arrange(dat,sessionid,qf,qn)
or using base functions,
with(dat,dat[order(sessionid,qf,qn),])
It works if you use duplicated twice:
> df
a b c d
1 1 2 A 1001
2 2 4 B 1002
3 3 6 B 1002
4 4 8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006
> df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]
a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Resources