how to make data frame in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
this is pre-data
Name type cnt
Jay A 1
Jay B 2
John A 3
John B 6
how to set the data like this
Name A B
Jay 1 2
John 3 6
data.frame() is not wokring as I expect. any idea?
here is my data for real
Name type cnt
1 Anne A 30
2 Barbara A 15
3 Ben A 21
4 Cindy A 100
5 Edd A 105
6 Eric A 22
7 Jacky A 17
8 John A 97
9 Lex A 22
10 Nick A 18
11 Paul A 100
12 DoIt B 66
13 FixIt B 66
14 Anne C 185
15 Barbara C 88
16 Ben C 4
17 Eric C 4
18 Jacky C 92
19 Lex C 7
20 Nick C 3

Using tidyr'sspread
library(tidyr)
spread(df, type, cnt, fill = 0)
Using reshape2's dcast
library(reshape2)
dcast(df, Name ~ type, fill = 0)
# Name A B C
#1 Anne 30 0 185
#2 Barbara 15 0 88
#3 Ben 21 0 4
#4 Cindy 100 0 0
#5 DoIt 0 66 0
#6 Edd 105 0 0
#7 Eric 22 0 4
#8 FixIt 0 66 0
#9 Jacky 17 0 92
#10 John 97 0 0
#11 Lex 22 0 7
#12 Nick 18 0 3
#13 Paul 100 0 0
Or using base R
reshape(df, idvar='Name', timevar='type', direction='wide')
Or
xtabs(cnt~Name+type, df)
Or
with(df, tapply(cnt, list(Name, type), FUN=I))

Related

Pasted tables from Jupyter to StackOverflow get ill-formatted

Consider the code:
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df
When I copy the output, I get a not nicely formatted table here on StackOverflow:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
How to fix this problem?
from an ipython terminal session I can paste:
In [73]: df
Out[73]:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
From a notebook, it looks like print(df) gives a better result:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
The copy selection should be a solid color block.

Randomly draw rows from dataframe based on unique values and column values

I have a dataframe with many descriptor variables (trt, individual, session). I want to be able to randomly select a fraction of the possible trt x individual combinations but control for the session variable such that no random pull has the same session number. Here is what my dataframe looks like:
trt <- c(rep(c(rep("A", 3), rep("B", 3), rep("C", 3)), 9))
individual <- rep(c("Bob", "Nancy", "Tim"), 27)
session <- rep(1:27, each = 3)
data <- rnorm(81, mean = 4, sd = 1)
df <- data.frame(trt, individual, session, data))
df
trt individual session data
1 A Bob 1 3.72013685581385
2 A Nancy 1 3.97225419000673
3 A Tim 1 4.44714175686225
4 B Bob 2 5.00024599458127
5 B Nancy 2 3.43615965145765
6 B Tim 2 6.7920094635501
7 C Bob 3 4.36315054477571
8 C Nancy 3 5.07117348146375
9 C Tim 3 4.38503325758969
10 A Bob 4 4.30677162933005
11 A Nancy 4 1.89311687510669
12 A Tim 4 3.09084920968413
13 B Bob 5 3.10436190897144
14 B Nancy 5 3.59454992439722
15 B Tim 5 3.40778069131207
16 C Bob 6 4.00171937800892
17 C Nancy 6 0.14578811080644
18 C Tim 6 4.20754733296227
19 A Bob 7 3.69131009783284
20 A Nancy 7 4.7025756891679
21 A Tim 7 4.46196017363017
22 B Bob 8 3.97573281432736
23 B Nancy 8 4.5373185942686
24 B Tim 8 2.40937847038141
25 C Bob 9 4.57519884980087
26 C Nancy 9 5.19143914630448
27 C Tim 9 4.83144732833874
28 A Bob 10 3.01769965527235
29 A Nancy 10 5.17300616827746
30 A Tim 10 4.65432284571663
31 B Bob 11 4.50892032922527
32 B Nancy 11 3.38082717995663
33 B Tim 11 4.92022245677209
34 C Bob 12 4.54149796547394
35 C Nancy 12 3.21992774137179
36 C Tim 12 3.74507360931023
37 A Bob 13 3.39524949548056
38 A Nancy 13 4.17518916890901
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
41 B Nancy 14 2.08784850191654
42 B Tim 14 3.98446125755258
43 C Bob 15 4.01837496797085
44 C Nancy 15 3.40610126858125
45 C Tim 15 4.57107635588582
46 A Bob 16 3.15839276840723
47 A Nancy 16 2.19932140340504
48 A Tim 16 4.77588798035668
49 B Bob 17 4.3524768657397
50 B Nancy 17 4.49071625925856
51 B Tim 17 4.02576463486266
52 C Bob 18 3.74783360762117
53 C Nancy 18 2.84123227236184
54 C Tim 18 3.2024114782253
55 A Bob 19 4.93837445490921
56 A Nancy 19 4.7103051496802
57 A Tim 19 6.22083635045134
58 B Bob 20 4.5177747677824
59 B Nancy 20 1.78839270771153
60 B Tim 20 5.07140678136995
61 C Bob 21 3.47818616035335
62 C Nancy 21 4.28526474048439
63 C Tim 21 4.22597602946575
64 A Bob 22 1.91700925257901
65 A Nancy 22 2.96317997587458
66 A Tim 22 2.53506974227672
67 B Bob 23 5.52714403395316
68 B Nancy 23 3.3618513551059
69 B Tim 23 4.85869007113978
70 C Bob 24 3.4367068543959
71 C Nancy 24 4.47769879000349
72 C Tim 24 5.77340483757836
73 A Bob 25 4.78524317734622
74 A Nancy 25 3.55373702554664
75 A Tim 25 2.88541465503637
76 B Bob 26 4.62885302019139
77 B Nancy 26 3.59430293369092
78 B Tim 26 2.29610255924296
79 C Bob 27 4.38433001299722
80 C Nancy 27 3.77825207859976
81 C Tim 27 2.12163194694365
How do I pull out 2 of each trt x individual combinations with a unique session number? This is an example what I want the dataframe to look like:
trt individual session data
1 A Bob 1 3.72013685581385
5 B Nancy 2 3.43615965145765
7 C Bob 3 4.36315054477571
12 A Tim 4 3.09084920968413
15 B Tim 5 3.40778069131207
17 C Nancy 6 0.14578811080644
19 A Bob 7 3.69131009783284
29 A Nancy 10 5.17300616827746
31 B Bob 11 4.50892032922527
34 C Bob 12 4.54149796547394
39 A Tim 13 3.02932375225388
40 B Bob 14 3.59660910672907
47 A Nancy 16 2.19932140340504
51 B Tim 17 4.02576463486266
54 C Tim 18 3.2024114782253
59 B Nancy 20 1.78839270771153
71 C Nancy 24 4.47769879000349
81 C Tim 27 2.12163194694365
I have tried a couple things with no luck.
I have tried to just randomly select two trt x individual combinations, but I end up with duplicate session values:
setDT((df))
df[ , .SD[sample(.N, 2)] , keyby = .(trt, individual)]
trt individual session data
1: A Bob 25 2.7560788894668
2: A Bob 19 4.12040841647523
3: A Nancy 4 5.35362338127901
4: A Nancy 19 5.51636882737692
5: A Tim 19 5.10553640201998
6: A Tim 1 2.77380671625473
7: B Bob 23 3.50585105164409
8: B Bob 8 3.58167259470814
9: B Nancy 23 2.85301307507985
10: B Nancy 8 2.85179395539781
11: B Tim 26 2.40666507132474
12: B Tim 20 3.31276311351286
13: C Bob 24 3.19076007024549
14: C Bob 3 3.59146613276121
15: C Nancy 9 4.46606667880457
16: C Nancy 15 2.25405252536256
17: C Tim 12 4.43111661206133
18: C Tim 27 4.23868848646589
I have tried randomly selecting one of each session number and then pulling 2 trt x individual combinations, but it typically comes back with an error since the random selection doesnt grab an equal number of trt x individual combinations:
ind <- sapply( unique(df$session ) , function(x) sample( which(df$session == x) , 1) )
df.unique <- df[ind, ]
df.sub <- df.unique[, .SD[sample(.N, 2)] , by = .(trt, individual)]
Error in `[.data.frame`(df.unique, , .SD[sample(.N, 2)], by = .(trt, individual)) :
unused argument (by = .(trt, individual))
Thanks in advance for your help!
Perhaps there is a clever way to sample, but here's a straightforward idea to get you started in the meanwhile:
setDT(df)
setkey(df, session)
usedsessions = 0 # some value that's not a session number
df[, {
res = .SD[!.(usedsessions)][sample(.N, 2)]
usedsessions = c(usedsessions, res$session)
res
}
, by = .(trt, individual)]
# trt individual session data
# 1: A Bob 7 4.256668
# 2: A Bob 25 2.431821
# 3: A Nancy 16 4.785859
# 4: A Nancy 19 4.865248
# 5: A Tim 4 3.303689
# 6: A Tim 13 3.550261
# 7: B Bob 26 3.987136
# 8: B Bob 17 3.283055
# 9: B Nancy 14 3.177226
#10: B Nancy 2 3.639542
#11: B Tim 8 2.168447
#12: B Tim 5 3.521123
#13: C Bob 21 3.284245
#14: C Bob 12 5.773098
#15: C Nancy 24 4.624428
#16: C Nancy 9 3.235467
#17: C Tim 18 4.001395
#18: C Tim 27 5.002110
You'll probably need to add corner case processing (e.g. if there is no such sampling).

Conditionally mapping values from one data frame to another one R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two dataframes:
> SubObj
sNumber runningTrialNo wordTar SubObj_ind
1 34 nerd 3
1 32 hooligan 1
1 7 villager 3
2 32 oak 2
2 8 deer 2
3 8 mammal 3
> df
sNumber runningTrialNo wordTar
1 34 nerd
1 34 nerd
1 34 nerd
1 32 hooligan
1 32 hooligan
1 7 villager
2 32 oak
2 32 oak
2 8 deer
3 8 mammal
3 8 mammal
I want to map values from SubObj$SubObj_ind into df$SubObj, so all the values would be in accordance with sNumber (subject number) and runningTrialNo (trial number). It should look smth like this:
> df
sNumber runningTrialNo wordTar SubObj_ind
1 34 nerd 3
1 34 nerd 3
1 34 nerd 3
1 32 hooligan 1
1 32 hooligan 1
1 7 villager 3
2 32 oak 2
2 32 oak 2
2 8 deer 2
3 8 mammal 3
3 8 mammal 3
I wrote the code that hypothetically should do the work but it doesn't map over trial and subject number:
df$SubObj_indO <- array(0, nrow(df))
for(i in 1:nrow(SubObj)) {
index <- df$runningTrialNo == SubObj[i,"runningTrialNo"] &
df$sNumber == SubObj[i,"sNumber"]
df$SubObj_ind[index] <- SubObj[index, "SubObj_ind"]
}
What is wrong in this peace of the code?
We can use match
df$SubObj_ind <- with(df, SubObj$SubObj_ind[match(wordTar, SubObj$wordTar)])
df
# sNumber runningTrialNo wordTar SubObj_ind
#1 1 34 nerd 3
#2 1 34 nerd 3
#3 1 34 nerd 3
#4 1 32 hooligan 1
#5 1 32 hooligan 1
#6 1 7 villager 3
#7 2 32 oak 2
#8 2 32 oak 2
#9 2 8 deer 2
#10 3 8 mammal 3
#11 3 8 mammal 3
Or use data.table
library(data.table)
setDT(df)[SubObj[c("wordTar", "SubObj_ind")], on = "wordTar"]

Using do() with names of list elements

I am trying to take the names of list elements and use do() to apply a function over them all, then bind them in a single data frame.
require(XML)
require(magrittr)
url <- "http://gd2.mlb.com/components/game/mlb/year_2016/month_05/day_21/gid_2016_05_21_milmlb_nynmlb_1/boxscore.xml"
box <- xmlParse(url)
xml_data <- xmlToList(box)
end <- length(xml_data[[2]]) - 1
x <- seq(1:end)
away_pitchers_names <- paste0("xml_data[[2]][", x, "]")
away_pitchers_names <- as.data.frame(away_pitchers_names)
names(away_pitchers_names) <- "elements"
away_pitchers_names$elements %<>% as.character()
listTodf <- function(x) {
df <- as.data.frame(x)
tdf <- as.data.frame(t(df))
row.names(tdf) <- NULL
tdf
}
test <- away_pitchers_names %>% group_by(elements) %>% do(listTodf(.$elements))
When I run the listTodf function on a list element it works fine:
listTodf(xml_data[[2]][1]
id name name_display_first_last pos out bf er r h so hr bb np s w l sv bs hld s_ip s_h s_r s_er s_bb
1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3 0 0 0 36.0 41 24 23 15
s_so game_score era
1 25 45 5.75
But when I try to loop through the names of the elements with the do() function I get the following:
Warning message:
In rbind_all(out[[1]]) : Unequal factor levels: coercing to character
And here is the output:
> test
Source: local data frame [5 x 2]
Groups: elements [5]
elements V1
(chr) (chr)
1 xml_data[[2]][1] xml_data[[2]][1]
2 xml_data[[2]][2] xml_data[[2]][2]
3 xml_data[[2]][3] xml_data[[2]][3]
4 xml_data[[2]][4] xml_data[[2]][4]
5 xml_data[[2]][5] xml_data[[2]][5]
I am sure it is something extremely simple, but I can't figure out where things are getting tripped up.
For evaluating the strings, eval(parse can be used
library(dplyr)
lapply(away_pitchers_names$elements,
function(x) as.data.frame.list(eval(parse(text=x))[[1]], stringsAsFactors=FALSE)) %>%
bind_rows()
# id name name_display_first_last pos out bf er r h so hr bb np s w l
#1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3
#2 430641 Boyer Blaine Boyer P 2 4 0 0 2 0 0 0 8 7 1 0
#3 448614 Torres, C Carlos Torres P 3 4 0 0 0 1 0 2 21 11 0 1
#4 592804 Thornburg Tyler Thornburg P 3 3 0 0 0 1 0 0 14 8 2 1
#5 518468 Blazek Michael Blazek P 1 5 1 1 2 0 0 2 23 10 1 1
# sv bs hld s_ip s_h s_r s_er s_bb s_so game_score era loss note
#1 0 0 0 36.0 41 24 23 15 25 45 5.75 <NA> <NA>
#2 0 1 0 21.1 22 4 4 5 7 48 1.69 <NA> <NA>
#3 0 0 2 22.1 22 9 9 14 21 52 3.63 <NA> <NA>
#4 1 2 8 18.2 13 8 8 7 29 54 3.86 <NA> <NA>
#5 0 1 8 21.1 23 6 6 14 18 41 2.53 true (L, 1-1)
However, it is easier and faster to just do
lapply(xml_data[[2]][1:5], function(x)
as.data.frame.list(x, stringsAsFactors=FALSE)) %>%
bind_rows()

cumulative variable construction in longitudinal data set

The problem:
I would like to construct a variable that measures cumulative work experience within a person-year longitudinal data set. The problem applies to all sorts of longitudinal data sets and many variables might be constructed in this cumulative way (e.g., number of children, cumulative education, cumulative dollars spend on vacations, etc.)
The case:
I have a large longitudinal data set in which every row constitutes a person year. The data set contains thousands of persons (variable “ID”) followed through their lives (variable “age”), resulting in a data frame with about 1.2 million rows. One variable indicates how many months a person has worked in each person year (variable “work”). For example, when Dan was 15 years old he worked 3 months.
ID age work
1 Dan 10 0
2 Dan 11 0
3 Dan 12 0
4 Dan 13 0
5 Dan 14 0
6 Dan 15 3
7 Dan 16 5
8 Dan 17 8
9 Dan 18 5
10 Dan 19 12
11 Jeff 20 0
12 Jeff 16 0
13 Jeff 17 0
14 Jeff 18 0
15 Jeff 19 0
16 Jeff 20 0
17 Jeff 21 8
18 Jeff 22 10
19 Jeff 23 12
20 Jeff 24 12
21 Jeff 25 12
22 Jeff 26 12
23 Jeff 27 12
24 Jeff 28 12
25 Jeff 29 12
I now want to construct a cumulative work experience variable, which adds the value of year x to year x+1. The goal is to know at each age of a person how many months they have worked in their entire carrier. The variable should look like “cumwork”.
ID age work cumwork
1 Dan 10 0 0
2 Dan 11 0 0
3 Dan 12 0 0
4 Dan 13 0 0
5 Dan 14 0 0
6 Dan 15 3 3
7 Dan 16 5 8
8 Dan 17 8 16
9 Dan 18 5 21
10 Dan 19 12 33
11 Jeff 20 0 0
12 Jeff 16 0 0
13 Jeff 17 0 0
14 Jeff 18 0 0
15 Jeff 19 0 0
16 Jeff 20 0 0
17 Jeff 21 8 8
18 Jeff 22 10 18
19 Jeff 23 12 30
20 Jeff 24 12 42
21 Jeff 25 12 54
22 Jeff 26 12 66
23 Jeff 27 12 78
24 Jeff 28 12 90
25 Jeff 29 12 102
A poor solution: I can construct such a cumulative variable using the following simple loop:
# Generate test data set
x=data.frame(ID=c(rep("Dan",times=10),rep("Jeff",times=15)),age=c(10:20,16:29),work=c(rep(0,times=5),3,5,8,5,12,rep(0,times=6),8,10,rep(12,times=7)),stringsAsFactors=F)
# Generate cumulative work experience variable
x$cumwork=x$work
for(r in 2:nrow(x)){
if(x$ID[r]==x$ID[r-1]){
x$cumwork[r]=x$cumwork[r-1]+x$cumwork[r]
}
}
However, my dataset has 1.2 million rows and looping through each row is highly inefficient and running this loop would take hours. Does any brilliant programmer have a suggestion of how to construct this cumulative measure most efficiently?
Many thanks in advance!
Best,
Raphael
ave is convenient for these types of tasks. The function you want to use with it is cumsum:
x$cumwork <- ave(x$work, x$ID, FUN = cumsum)
x
# ID age work cumwork
# 1 Dan 10 0 0
# 2 Dan 11 0 0
# 3 Dan 12 0 0
# 4 Dan 13 0 0
# 5 Dan 14 0 0
# 6 Dan 15 3 3
# 7 Dan 16 5 8
# 8 Dan 17 8 16
# 9 Dan 18 5 21
# 10 Dan 19 12 33
# 11 Jeff 20 0 0
# 12 Jeff 16 0 0
# 13 Jeff 17 0 0
# 14 Jeff 18 0 0
# 15 Jeff 19 0 0
# 16 Jeff 20 0 0
# 17 Jeff 21 8 8
# 18 Jeff 22 10 18
# 19 Jeff 23 12 30
# 20 Jeff 24 12 42
# 21 Jeff 25 12 54
# 22 Jeff 26 12 66
# 23 Jeff 27 12 78
# 24 Jeff 28 12 90
# 25 Jeff 29 12 102
However, given the scale of your data, I would also strongly suggest the "data.table" package, which also gives you access to convenient syntax:
library(data.table)
DT <- data.table(x)
DT[, cumwork := cumsum(work), by = ID]

Resources