moving results from one data frame to a data set - r

I am working with two different data sets, and I'd like to move data from one to the other. I'm thinking of it this way: One contains the results, paired with the correct factor (HTm), and I want to spread those out over another frame. Here is the first frame:
head(five)
Week Game.ID VTm VPts HTm HPts HDifferential VDifferential
1 1 NFL_20050908_OAK#NE OAK 20 NE 30 10 -10
2 1 NFL_20050911_ARI#NYG ARI 19 NYG 42 23 -23
3 1 NFL_20050911_CHI#WAS CHI 7 WAS 9 2 -2
4 1 NFL_20050911_CIN#CLE CIN 27 CLE 13 -14 14
5 1 NFL_20050911_DAL#SD DAL 28 SD 24 -4 4
6 1 NFL_20050911_DEN#MIA DEN 10 MIA 34 24 -24
VTm.f HTm.f average
1 OAK NE 19.4375
2 ARI NYG 19.4375
3 CHI WAS 19.4375
4 CIN CLE 19.4375
5 DAL SD 19.4375
6 DEN MIA 19.4375
> tail(five)
Week Game.ID VTm VPts HTm HPts HDifferential VDiff
262 19 NFL_20060114_WAS#SEA WAS 10 SEA 20 10 -10
263 19 NFL_20060115_CAR#CHI CAR 29 CHI 21 -8 8
264 19 NFL_20060115_PIT#IND PIT 21 IND 18 -3 3
265 20 NFL_20060122_CAR#SEA CAR 14 SEA 34 20 -20
266 20 NFL_20060122_PIT#DEN PIT 34 DEN 17 -17 17
267 21 NFL_20060205_SEA#PIT SEA 10 PIT 21 11 -11
VTm.f HTm.f average
262 WAS SEA 0
263 CAR CHI 0
264 PIT IND 0
265 CAR SEA 0
266 PIT DEN 0
267 SEA PIT 0
and here is the other (aggregated means from the first frame).
head(fiveINFO)
HTm HPts VPts average
1 ARI 19.87500 19.00000 19.43750
2 ATL 24.75000 19.12500 21.93750
3 BAL 19.37500 13.75000 16.56250
4 BUF 16.50000 17.37500 16.93750
5 CAR 25.12500 23.27273 24.19886
6 CHI 18.77778 14.00000 16.38889
tail(fiveINFO)
VTm HPts VPts average
27 SEA 21.00 25.000 23.0000
28 SF 30.75 12.625 21.6875
29 STL 28.00 22.000 25.0000
30 TB 15.75 15.375 15.5625
31 TEN 28.00 14.750 21.3750
32 WAS 20.60 18.800 19.7000
For reference, this data is looking at NFL scores. I want to take the averages in fiveINFO, frame two, and move them to the corresponding team in the first frame. five is 266 rows long, while fiveINFO is 32 rows — fiveINFO contains each HTm only once, while five contains each one 8-10 times, depending on the number of home games each team plays. I found several answers that seemed similar, but with much smaller data sets. I don't want to merge the two; I want the averages data from the second frame to be spread across the appropriate HTm values in the first frame.
I'm imagining I'll need to use some kind of for loop for this, but everything I'm doing is striking out. Help?

total<-merge(five, fiveINFO, by="HTm")
where total is the data frame that has the merged columns from five and fiveINFO based on htm column. The value of htm that do not match in five and fiveINFO will not be filled. But, if you want that filled with NA, you can do so explicitly ( use this option in merge function: all=TRUE, all.x or all.y = TRUE).
You can also remove extra columns that you don't want after merging.
total=subset(total,select= -c(HPts,VPts)) #for removing columns HPts, VPts from the merged data-frame

Related

For Loop Alternative on large data frame that runs a different filter with each iteration

I'm running a loop that takes the ranking from R1[i] and filters a data frame of all rankings in the specified range and at the same time filtering a different column R2[i] to find the ranking of an opponent so I end up with a new data frame that only includes matches that involve players in those specific ranking ranges so that I can find the mean of a column for only those matches.
For Example: Player 1 is Ranked 10th and Player 2 is Ranked 34th. The following code takes every match including players ranked between 5-15 +/- 20% of 10 and players ranked between 29-39 +/- 20% of 34.
Then it finds the mean of Data_Dif and returns to the initial DF in row [i] and does so for every row.
This code works fine but it's a bit messy and it takes 4 hours to run 57,000 matches. Does anyone have a faster solution please? I have to run this every day.
Rank <- Data %>% filter(between(R1, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R1, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
%>% filter(between(R2, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R2, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
Rank_Difference <- Data$Rank_Dif[i]
Rank <- Rank %>% filter(Rank_Dif >= Rank_Difference-5)
Data$Rank_Adv[i] <- mean(Rank$Data_Dif)
}
Data
R1 R2 Rank_Dif Data_Dif Rank_Adv
1 2 1 1 -0.272 0.037696970
2 10 34 24 0.377 0.146838617
3 10 29 19 0.373 0.130336232
4 2 5 3 0.134 0.076242424
5 34 17 17 -0.196 0.094226519
6 1 18 17 0.144 0.186158879
7 17 25 8 0.264 0.036212219
8 42 18 24 0.041 0.102343915
9 5 13 8 -0.010 0.091952381
10 34 21 13 -0.226 0.060790576
11 2 14 12 0.022 0.122350649
12 10 158 148 0.330 0.184901961
13 11 1 10 -0.042 0.109918367
14 29 52 23 0.463 0.054469108
15 10 1000 990 0.628 0.437600000
16 17 329 312 0.445 0.307750000
17 11 20 9 0.216 0.072621875
18 417 200 217 -0.466 0.106737401
19 5 53 48 0.273 0.243890710
20 14 7 7 -0.462 0.075739414

Best way to split a dataset every Nth observation

Currently, I have 3 datasets each 1368 rows of data points.
a <- sample(0:10000,1368, rep=TRUE)
Df <- data.frame(obs=c(1:1368),
var1=a)
df2<-data.frame(col1=Df$var1[1:90],
col2=Df$var1[91:180],
col3=Df$var1[181:270])
Dataset 1
col1 col2 col3
1 7878 8130 3924
2 5781 4375 6232
3 9324 9066 1734
4 9754 8796 2047
5 3462 4930 7381
6 7379 8103 3404
7 7355 5212 4505
dataset 2
col1 col2 col3
1 7878 8130 3924
2 5781 4375 6232
3 9324 9066 1734
4 9754 8796 2047
5 3462 4930 7381
6 7379 8103 3404
7 7355 5212 4505
8 5599 6887 5775
9 2321 7948 3553
10 3717 1248 5818
11 6276 5528 206
12 1328 1158 8681
13 4470 3009 1332
14 6472 9018 606
An example of one of the datasets that is being used with the expected outcome, I left out the excess rows.
My intention is to split each dataset sequentially into subsets, each with 90 observations. I am aware of the divisible issue, but the last subset having more entries isn't a problem, the main concern is just splitting the observations into either different datasets or different columns to perform specific statistical tests such as a T-test on each subset of data. The end result should a data frame with 14 columns.
The end goal is to have all 3 datasets of 1368 observations split into equal subsets.
What would be the best way to split the dataset into these subsets?
This should get you started, but without reproducible data, it is impossible to adapt a general method to your specific data:
n <- 1368 # rows
subsets <- n %/% 90 # 15 subsets
extra <- n %% 90 # 18 extra
grp <- c(rep(1:subsets, each=90), rep(subsets, extra)) # group numbers for each row assuming the extra goes in the last group
table(grp)
# grp
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 90 90 90 90 90 90 90 90 90 90 90 90 90 90 108
Then use grp to split() your data frame into a list of groups.

How to user NSE inside fct_reorder() in ggplot2

I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.

Error in sort.list(y) whlie using 'Strata()' in R

When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

how to make a pie graph only name top n performance

I haven't been using pie graph a lot in r, is there a way to make a pie graph and only show the top 10 names with percentage?
For example, here's a simple version of my data:
> data
count METRIC_ID
1 8 71
2 2 1035
3 5 1219
4 4 1277
5 1 1322
6 3 1444
7 5 1462
8 17 1720
9 6 2019
10 2 2040
11 1 2413
12 11 2489
13 24 2610
14 29 2737
15 1 2907
16 1 2930
17 2 2992
18 1 2994
19 2 3020
20 4 3045
21 35 3222
22 2 3245
23 5 3306
24 2 3348
25 2 3355
26 2 3381
27 3 3383
28 4 3389
29 6 3404
30 1 3443
31 22 3465
32 3 3558
33 15 3600
34 3 3730
35 6 3750
36 1 3863
37 1 3908
38 5 3913
39 3 3968
40 9 3972
41 2 3978
42 5 4077
43 4 4086
44 3 4124
45 2 4165
46 3 4205
47 8 4206
48 4 4210
49 12 4222
50 4 4228
and I want to see the count of each METRIC_ID's distribution:
pie(data$count, data$METRIC_ID)
But this Chart marks every single METRIC_ID on the graph, when I have over 100 METRIC_ID, it looks like a mess. How can I only mark the top n (for example, n=5) METRIC_ID on the graph, and show the count of that n METRIC_ID only?
Thank you for your help!!!
To suppress plotting of some labels, set them to NA. Try this:
labls <- data$METRIC_ID
labls[data$count < 3] <- NA
pie(data$count, paste(labls))
Simply subset your data before creating the piechart. I'd do somehting like:
Sort your datasets using order.
Select the first ten rows.
Create the pie chart from the resulting data.
Pie charts are not the best way to visualize your data, just google pie chart problems, e.g. this link. I'd go for something like:
library(ggplot2)
dat = dat[order(-dat$count),]
dat = within(dat, {METRIC_ID = factor(METRIC_ID, levels = METRIC_ID)})
ggplot(dat, aes(x = METRIC_ID, y = count)) + geom_point()
Here I just plot all the data, which I think still leads to a readable graph. This graph is more formally known as a dotplot, and is heavily used in the graphics book of Cleveland. Here the height is linked to count, which is much easier to interpret that linking count to the fraction of the area of a circle, as in the case of the piechart.
Find a better type of chart for your data.
Here is a possibility to create the chart you want:
data2 <- data[data$count %in% tail(sort(data$count),5),]
pie(data2$count, data2$METRIC_ID)
Slightly better:
data3 <- data2
data3$METRIC_ID <- as.character(data3$METRIC_ID)
data3 <- rbind(data3,data.frame(count=sum(data[! data$count %in% tail(sort(data$count),5),"count"]),METRIC_ID="others"))
pie(data3$count, data3$METRIC_ID)

Resources