Consider the code:
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df
When I copy the output, I get a not nicely formatted table here on StackOverflow:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
How to fix this problem?
from an ipython terminal session I can paste:
In [73]: df
Out[73]:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
From a notebook, it looks like print(df) gives a better result:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
The copy selection should be a solid color block.
Related
Name Trial# Result ResultsSoFar
1 Bob 1 14 14
2 Bob 2 22 36
3 Bob 3 3 39
4 Bob 4 18 57
5 Nancy 2 33 33
6 Nancy 3 87 120
Hello, say I have the dataframe above. What's the best way to generate the "ResultsSoFar" column which is a sum of that person's results up to and including that trial (Bob's results do not include Nancy's and vice versa).
With data.table you can do:
library(data.table)
setDT(df)[, ResultsSoFar:=cumsum(Result), by=Name]
df
Name Trial. Result ResultsSoFar
1: Bob 1 14 14
2: Bob 2 22 36
3: Bob 3 3 39
4: Bob 4 18 57
5: Nancy 2 33 33
6: Nancy 3 87 120
Note:
If Trial# is not sorted, you can do setDT(df)[, ResultsSoFar:=cumsum(Result[order(Trial.)]), by=Name] to get the right order for the cumsum
I have a dataframe that has a series of ID characters (trt,individual, and session):
> trt<-c(rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3))
individual<-rep(c("Bob","Nancy","Tim"),9)
session<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
data<-rnorm(27,mean=4,sd=1)
df<-as.data.frame(cbind(trt,individual,session,data))
df
trt individual session data
1 A Bob 1 4.36604594311893
2 A Nancy 1 3.29568979189961
3 A Tim 1 3.55849387209243
4 B Bob 2 5.41661201729216
5 B Nancy 2 4.7158873476798
6 B Tim 2 5.34401708530548
7 C Bob 3 4.54277206331273
8 C Nancy 3 3.53976115781019
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
11 A Nancy 4 4.7995601464568
12 A Tim 4 3.17821205815185
13 B Bob 5 3.62379779744325
14 B Nancy 5 4.07387328854209
15 B Tim 5 5.60156909861945
16 C Bob 6 4.06727142161431
17 C Nancy 6 4.59940289933985
18 C Tim 6 3.07543217234973
19 A Bob 7 2.63468285023662
20 A Nancy 7 3.22650587327078
21 A Tim 7 6.31062631711196
22 B Bob 8 4.69047076193906
23 B Nancy 8 4.79190101388308
24 B Tim 8 1.61906440409175
25 C Bob 9 2.85180524036416
26 C Nancy 9 3.43304058627408
27 C Tim 9 4.89263600498695
I am looking to create a new dataframe where I have randomly pulled each trtxindividual combination but under the constraint that each unique session number is only selected once
This is what I want my dataframe to look like:
trt individual session data
2 A Nancy 1 3.29568979189961
4 B Bob 2 5.41661201729216
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
15 B Tim 5 5.60156909861945
17 C Nancy 6 4.59940289933985
21 A Tim 7 6.31062631711196
23 B Nancy 8 4.79190101388308
25 C Bob 9 2.85180524036416
I know how to randomly select a subset of each trtxindividual combination:
> setDT(df)
newdf<-df[, .SD[sample(.N, 1)] , by=.(trt, individual)]
newdf
trt individual session data
1: A Bob 4 4.75145309337952
2: A Nancy 1 3.29568979189961
3: A Tim 7 6.31062631711196
4: B Bob 8 4.69047076193906
5: B Nancy **2** 4.7158873476798
6: B Tim **2** 5.34401708530548
7: C Bob 6 4.06727142161431
8: C Nancy 9 3.43304058627408
9: C Tim 3 3.7954788384957
But I dont know how to restrict the pulls to only allow one session to be pulled (aka not allow duplicates as there are above)
Thanks in advance for your help!
This will need to iterate through the data.table and might not be quick, but it doesn't require setting any parameters for the fields of interest
library(data.table)
set.seed(7)
setDT(df)
dt1 <- df[, .SD[sample(.N)]]
dt1[, i := .I]
dt1[, flag := NA]
setkey(dt1, flag)
lapply(dt1$i, function(x) {
dt1[is.na(flag[x]) & (trt == trt[x] & individual == individual[x] | session == session[x]), flag := i == x]
})
dt1[flag == TRUE, ]
trt individual session data i flag
1: C Tim 9 3.63712332100071 1 TRUE
2: A Nancy 4 4.54908662150973 2 TRUE
3: A Tim 1 5.84217708521442 3 TRUE
4: B Tim 2 2.37343483362789 5 TRUE
5: C Nancy 3 2.87792051390258 7 TRUE
6: A Bob 7 3.45471592963754 12 TRUE
7: B Nancy 8 4.54792567807183 15 TRUE
8: C Bob 6 4.45667777212948 24 TRUE
9: B Bob 5 2.33285598638319 27 TRUE
I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:
trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
trt individual session data
1 A Bob 1 3.72013685581385
2 A Nancy 1 3.97225419000673
3 A Tim 1 4.44714175686225
4 B Bob 2 5.00024599458127
5 B Nancy 2 3.43615965145765
6 B Tim 2 6.7920094635501
7 C Bob 3 4.36315054477571
8 C Nancy 3 5.07117348146375
9 C Tim 3 4.38503325758969
10 A Bob 4 4.30677162933005
11 A Nancy 4 1.89311687510669
12 A Tim 4 3.09084920968413
13 B Bob 5 3.10436190897144
14 B Nancy 5 3.59454992439722
15 B Tim 5 3.40778069131207
16 C Bob 6 4.00171937800892
17 C Nancy 6 0.14578811080644
18 C Tim 6 4.20754733296227
19 A Bob 7 3.69131009783284
20 A Nancy 7 4.7025756891679
21 A Tim 7 4.46196017363017
22 B Bob 8 3.97573281432736
23 B Nancy 8 4.5373185942686
24 B Tim 8 2.40937847038141
25 C Bob 9 4.57519884980087
26 C Nancy 9 5.19143914630448
27 C Tim 9 4.83144732833874
28 A Bob 10 3.01769965527235
29 A Nancy 10 5.17300616827746
30 A Tim 10 4.65432284571663
31 B Bob 11 4.50892032922527
32 B Nancy 11 3.38082717995663
33 B Tim 11 4.92022245677209
34 C Bob 12 4.54149796547394
35 C Nancy 12 3.21992774137179
36 C Tim 12 3.74507360931023
37 A Bob 13 3.39524949548056
38 A Nancy 13 4.17518916890901
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
41 B Nancy 14 2.08784850191654
42 B Tim 14 3.98446125755258
43 C Bob 15 4.01837496797085
44 C Nancy 15 3.40610126858125
45 C Tim 15 4.57107635588582
46 A Bob 16 3.15839276840723
47 A Nancy 16 2.19932140340504
48 A Tim 16 4.77588798035668
49 B Bob 17 4.3524768657397
50 B Nancy 17 4.49071625925856
51 B Tim 17 4.02576463486266
52 C Bob 18 3.74783360762117
53 C Nancy 18 2.84123227236184
54 C Tim 18 3.2024114782253
55 A Bob 19 4.93837445490921
56 A Nancy 19 4.7103051496802
57 A Tim 19 6.22083635045134
58 B Bob 20 4.5177747677824
59 B Nancy 20 1.78839270771153
60 B Tim 20 5.07140678136995
61 C Bob 21 3.47818616035335
62 C Nancy 21 4.28526474048439
63 C Tim 21 4.22597602946575
64 A Bob 22 1.91700925257901
65 A Nancy 22 2.96317997587458
66 A Tim 22 2.53506974227672
67 B Bob 23 5.52714403395316
68 B Nancy 23 3.3618513551059
69 B Tim 23 4.85869007113978
70 C Bob 24 3.4367068543959
71 C Nancy 24 4.47769879000349
72 C Tim 24 5.77340483757836
73 A Bob 25 4.78524317734622
74 A Nancy 25 3.55373702554664
75 A Tim 25 2.88541465503637
76 B Bob 26 4.62885302019139
77 B Nancy 26 3.59430293369092
78 B Tim 26 2.29610255924296
79 C Bob 27 4.38433001299722
80 C Nancy 27 3.77825207859976
81 C Tim 27 2.12163194694365
How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:
trt individual session data
1 A Bob 1 3.72013685581385
5 B Nancy 2 3.43615965145765
7 C Bob 3 4.36315054477571
12 A Tim 4 3.09084920968413
15 B Tim 5 3.40778069131207
17 C Nancy 6 0.14578811080644
19 A Bob 7 3.69131009783284
29 A Nancy 10 5.17300616827746
31 B Bob 11 4.50892032922527
34 C Bob 12 4.54149796547394
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
47 A Nancy 16 2.19932140340504
51 B Tim 17 4.02576463486266
54 C Tim 18 3.2024114782253
59 B Nancy 20 1.78839270771153
71 C Nancy 24 4.47769879000349
81 C Tim 27 2.12163194694365
I have tried a couple things with no luck.
I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:
setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
trt individual session data
1: A Bob 25 2.7560788894668
2: A Bob 19 4.12040841647523
3: A Nancy 4 5.35362338127901
4: A Nancy 19 5.51636882737692
5: A Tim 19 5.10553640201998
6: A Tim 1 2.77380671625473
7: B Bob 23 3.50585105164409
8: B Bob 8 3.58167259470814
9: B Nancy 23 2.85301307507985
10: B Nancy 8 2.85179395539781
11: B Tim 26 2.40666507132474
12: B Tim 20 3.31276311351286
13: C Bob 24 3.19076007024549
14: C Bob 3 3.59146613276121
15: C Nancy 9 4.46606667880457
16: C Nancy 15 2.25405252536256
17: C Tim 12 4.43111661206133
18: C Tim 27 4.23868848646589
I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:
ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) :
unused argument (by = .(trt, individual))
Thanks in advance for your help!
Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:
setDT(df)
setkey(df, session)
usedsessions = 0 # some value that's not a session number
df[, {
res = .SD[!.(usedsessions)][sample(.N, 2)]
usedsessions = c(usedsessions, res$session)
res
}
, by = .(trt, individual)]
# trt individual session data
# 1: A Bob 7 4.256668
# 2: A Bob 25 2.431821
# 3: A Nancy 16 4.785859
# 4: A Nancy 19 4.865248
# 5: A Tim 4 3.303689
# 6: A Tim 13 3.550261
# 7: B Bob 26 3.987136
# 8: B Bob 17 3.283055
# 9: B Nancy 14 3.177226
#10: B Nancy 2 3.639542
#11: B Tim 8 2.168447
#12: B Tim 5 3.521123
#13: C Bob 21 3.284245
#14: C Bob 12 5.773098
#15: C Nancy 24 4.624428
#16: C Nancy 9 3.235467
#17: C Tim 18 4.001395
#18: C Tim 27 5.002110
You'll probably need to add corner case processing (e.g. if there is no such sampling).
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
this is pre-data
Name type cnt
Jay A 1
Jay B 2
John A 3
John B 6
how to set the data like this
Name A B
Jay 1 2
John 3 6
data.frame() is not wokring as I expect. any idea?
here is my data for real
Name type cnt
1 Anne A 30
2 Barbara A 15
3 Ben A 21
4 Cindy A 100
5 Edd A 105
6 Eric A 22
7 Jacky A 17
8 John A 97
9 Lex A 22
10 Nick A 18
11 Paul A 100
12 DoIt B 66
13 FixIt B 66
14 Anne C 185
15 Barbara C 88
16 Ben C 4
17 Eric C 4
18 Jacky C 92
19 Lex C 7
20 Nick C 3
Using tidyr'sspread
library(tidyr)
spread(df, type, cnt, fill = 0)
Using reshape2's dcast
library(reshape2)
dcast(df, Name ~ type, fill = 0)
# Name A B C
#1 Anne 30 0 185
#2 Barbara 15 0 88
#3 Ben 21 0 4
#4 Cindy 100 0 0
#5 DoIt 0 66 0
#6 Edd 105 0 0
#7 Eric 22 0 4
#8 FixIt 0 66 0
#9 Jacky 17 0 92
#10 John 97 0 0
#11 Lex 22 0 7
#12 Nick 18 0 3
#13 Paul 100 0 0
Or using base R
reshape(df, idvar='Name', timevar='type', direction='wide')
Or
xtabs(cnt~Name+type, df)
Or
with(df, tapply(cnt, list(Name, type), FUN=I))
Newbie to R, I've tried Googling but I'm failing find a solution.
Here's my data frame:
Name Value
Bob 50
Mary 55
John 51
Todd 50
Linda 56
Tom 55
So I've sorted it but what I need to add a rank column, so it looks like this:
Name Value Rank
Bob 50 1
Todd 50 1
John 51 2
Mary 55 3
Tom 55 3
Linda 56 4
So what I found is:
resultset$Rank <- ave(resultset$Name, resultset$Value, FUN = rank)
But this gives me:
Name Value Rank
Bob 50 1
Todd 50 2
John 51 1
Mary 55 1
Tom 55 2
Linda 56 1
So close but yet so far...
Here's a base-R solution:
uv <- unique(df$Value)
merge(df,data.frame(uv,r=rank(uv)),by.x="Value",by.y="uv")
which gives
Value Name r
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4
This is memory inefficient and has the side-effect of resorting your data. You could alternately do:
require(data.table)
DT <- data.table(df)
DT[order(Value),r:=.GRP,by=Value]
which gives
Name Value r
1: Bob 50 1
2: Mary 55 3
3: John 51 2
4: Todd 50 1
5: Linda 56 4
6: Tom 55 3
No need to sort... You can use dense_rank from "dplyr":
> library(dplyr)
> mydf %>% mutate(rank = dense_rank(Value))
Name Value rank
1 Bob 50 1
2 Mary 55 3
3 John 51 2
4 Todd 50 1
5 Linda 56 4
6 Tom 55 3
I guess your rank variable can be obtained by 1:length(unique(df$value)). Below is my trial.
df <- data.frame(name = c("Bob", "Mary", "John", "Todd", "Linda", "Tom"),
value = c(50, 55, 51, 50, 56, 55))
# rank by lengths of unique values
rank <- data.frame(rank = 1:length(unique(df$value)), value = sort(unique(df$value)))
merge(df, rank, by="value")
value name rank
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4