I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.
The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.
Here is my code up until my road block:
HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)
Any help on how to do this would be greatly appreciated!
EDIT
I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.
Here is the output for head(PdataPD):
Player Date Rk Pos Tm X..H OppTm W.L GS MP FG FGA FG. X2P
22408 Aaron Brooks 2015-10-27 817 G CHI CLE W 0 16 3 9 0.333 3
22144 Aaron Brooks 2015-10-28 553 G CHI # BRK W 0 16 5 9 0.556 3
21987 Aaron Brooks 2015-10-30 396 G CHI # DET L 0 18 2 6 0.333 1
21456 Aaron Brooks 2015-11-01 4687 G CHI ORL W 0 16 3 11 0.273 3
21152 Aaron Brooks 2015-11-03 4383 G CHI # CHO L 0 17 5 8 0.625 1
20805 Aaron Brooks 2015-11-05 4036 G CHI OKC W 0 13 4 8 0.500 3
X2PA X2P. X3P X3PA X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408 8 0.375 0 1 0.000 0 0 NA 0 2 2 0 0 0 2 1 6 -0.9
22144 3 1.000 2 6 0.333 0 0 NA 0 1 1 3 1 0 1 4 12 8.5
21987 2 0.500 1 4 0.250 0 0 NA 0 4 4 4 0 0 0 1 5 5.2
21456 6 0.500 0 5 0.000 0 0 NA 2 1 3 1 1 1 1 4 6 1.0
21152 3 0.333 4 5 0.800 0 0 NA 0 0 0 4 1 0 0 4 14 12.6
20805 5 0.600 1 3 0.333 0 0 NA 1 1 2 0 0 0 0 1 9 5.6
FPTS H.A
22408 7.50 H
22144 20.25 A
21987 16.50 A
21456 14.75 H
21152 24.00 A
20805 12.00 H
And for head(TdataPD):
OppTm Date Rk X Opp Result MP FG FGA FG. X2P X2PA X2P. X3P X3PA
2105 ATL 2015-10-27 71 DET L 94-106 240 37 82 0.451 29 55 0.527 8 27
2075 ATL 2015-10-29 41 # NYK W 112-101 240 42 83 0.506 32 59 0.542 10 24
2047 ATL 2015-10-30 13 CHO W 97-94 240 36 83 0.434 28 60 0.467 8 23
2025 ATL 2015-11-01 437 # CHO W 94-92 240 37 88 0.420 30 59 0.508 7 29
2001 ATL 2015-11-03 413 # MIA W 98-92 240 37 90 0.411 30 69 0.435 7 21
1973 ATL 2015-11-04 385 BRK W 101-87 240 37 76 0.487 29 54 0.537 8 22
X3P. FT FTA FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12 15 0.800 94 37 96 0.385 25 67 0.373 12
2075 0.417 18 26 0.692 112 38 93 0.409 32 64 0.500 6
2047 0.348 17 22 0.773 97 36 88 0.409 24 58 0.414 12
2025 0.241 13 14 0.929 94 32 86 0.372 18 49 0.367 14
2001 0.333 17 22 0.773 98 38 86 0.442 33 58 0.569 5
1973 0.364 19 24 0.792 101 36 83 0.434 31 62 0.500 5
Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105 29 0.414 20 26 0.769 106
2075 29 0.207 19 21 0.905 101
2047 30 0.400 10 13 0.769 94
2025 37 0.378 14 15 0.933 92
2001 28 0.179 11 16 0.688 92
1973 21 0.238 10 13 0.769 87
If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.
It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:
merge(PData, TData, by=c("Date", "Team"))
The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.
For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
Related
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
I am trying to move from long format data to wide format in order to do some correlation analyses.
But, dcast seems to create to rows for the first subject and splits the data across those two rows filling the created empty cells with NA.
The first 2 subjects were being duplicated when I was using alphanumeric subject codes, I went to numeric subject numbers and that has to down to only the first subject being duplicated.
the first few lines of the long format data frame:
Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
2 2 77 F 36.67 36.67 36.67 36.67 73 67 73 44 43 44 29 24
3 3 72 F 45.00 41.67 41.67 43.33 42 34 42 35 28 35 7 6
4 4 66 F 36.67 36.67 36.67 36.67 66 76 76 44 44 44 22 32
5 5 38 F 41.67 46.67 41.67 44.17 48 58 58 23 29 29 25 29
6 6 65 M 35.00 43.33 35.00 39.17 46 60 60 32 46 46 14 14
Best_SII_Diff rSII MoCA_Vis MoCA_Nam MoCA_Attn MoCA_Lang MoCA_Abst MoCA_Del_Rec MoCA_Ori MoCA_Tot PNT Semantic Aided PNT_Prop PNT_Prop_Mod
1 17 -0.4231157 5 3 6 2 2 2 6 26 0.971 0.029 Unaided 0.971 0.983
2 29 1.2739255 3 3 5 0 2 2 5 20 0.954 0.046 Unaided 0.960 0.966
3 7 -1.2777889 4 2 5 2 2 5 6 26 0.966 0.034 Unaided 0.960 0.982
4 32 1.5959701 5 3 6 3 2 5 6 30 0.983 0.017 Unaided 0.983 0.994
5 29 0.9492167 4 2 6 3 1 3 6 25 0.983 0.017 Unaided 0.983 0.994
6 14 -0.2936395 4 2 6 2 2 2 6 24 0.989 0.011 Unaided 0.989 0.994
PNT_S_Wt PNT_P_Wt
1 0.046 0.041
2 0.073 0.033
3 0.045 0.074
4 0.049 0.057
5 0.049 0.057
6 0.049 0.057
Creating varlist:
varlist <- list(colnames(subset(PNT_Data_All2, ,c(18:27,29:33))))
My dcast command:
Data_Wide <- dcast(as.data.table(PNT_Data_All2),Subject + Age + Gender + R_PTA + L_PTA + BE_PTA + Avg_PTA + L_Aided_SII + R_Aided_SII + Best_Aided_SII + L_Unaided_SII + R_Unaided_SII + Best_Unaided_SII + L_SII_Diff + R_SII_Diff + Best_SII_Diff + rSII ~ Aided, value.var=varlist)
The resulting first few lines of the wide format:
Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1: 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
2: 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
3: 2 77 F 36.67 36.67 36.67 36.67 73 67 73 44 43 44 29 24
4: 3 72 F 45.00 41.67 41.67 43.33 42 34 42 35 28 35 7 6
5: 4 66 F 36.67 36.67 36.67 36.67 66 76 76 44 44 44 22 32
6: 5 38 F 41.67 46.67 41.67 44.17 48 58 58 23 29 29 25 29
Notice Subject 1 has 2 entries. All of the other subjects seem correct
Is this a problem with my command/arguments? A bug in dcast?
Edit 1: Through the process of elimination, the extra entries only appear when I include the "rSII" variable. This is a variable that is calculated from a previous step in the script:
PNT_Data_All$rSII <- stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All))
PNT_Data_All <- PNT_Data_All[, colnames(PNT_Data_All)[c(1:17,34,18:33)]]
Is there something about that calculated variable that would mess up dcast for some subjects?
Edit 2 to add my workaround:
I ended up rounding the calculated variable to 3 digits after the decimal and that solved the problem. Everything is casting correctly now with no duplicates.
PNT_Data_All$rSII <- format(round(stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All)),3),nsmall=3)
I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.
When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)
I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.
matchup.func <- function (home, away, df) {
matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)
df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]
out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}
sample of data frame:
bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R
sample of errant output:
matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S
sample of good:
matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S
On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.
I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try
matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
, df$game.id, value = TRUE)
That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.
You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.
I have a data.frame named df.ordered that looks like:
labels gvs order color pvals
1 Adygei -2.3321916 1 1 0.914
2 Basque -0.8519079 2 1 0.218
3 French -0.9298674 3 1 0.000
4 Italian -2.8859587 4 1 0.024
5 Orcadian -1.4996229 5 1 0.148
6 Russian -1.5597359 6 1 0.626
7 Sardinian -1.4494841 7 1 0.516
8 Tuscan -2.4279528 8 1 0.420
9 Bedouin -3.1717421 9 2 0.914
10 Druze -0.5058627 10 2 0.220
11 Mozabite -2.6491331 11 2 0.200
12 Palestinian -0.7819299 12 2 0.552
13 Balochi -1.4095947 13 3 0.158
14 Brahui -1.2534511 14 3 0.162
15 Burusho 1.7958170 15 3 0.414
16 Hazara 2.2810477 16 3 0.152
17 Kalash -0.9258497 17 3 0.974
18 Makrani -0.9007551 18 3 0.226
19 Pathan 2.5543214 19 3 0.112
20 Sindhi 2.6614486 20 3 0.338
21 Uygur -1.2207974 21 3 0.652
22 Cambodian 2.3706977 22 4 0.118
23 Dai -0.9441980 23 4 0.686
24 Daur -1.0325107 24 4 0.932
25 Han -0.7381369 25 4 0.794
26 Hezhen -2.7590587 26 4 0.182
27 Japanese -0.5644325 27 4 0.366
28 Lahu -0.8449225 28 4 0.560
29 Miao -0.7237586 29 4 0.194
30 Mongola -0.9452944 30 4 0.768
31 Naxi -0.1625003 31 4 0.554
32 Oroqen -1.2035258 32 4 0.782
33 She -2.7758460 33 4 0.912
34 Tu -0.7703779 34 4 0.254
35 Tujia -1.0265275 35 4 0.912
36 Xibo -1.1163019 36 4 0.292
37 Yakut -3.2102686 37 4 0.030
38 Yi -0.9614190 38 4 0.838
39 Colombian -1.9659984 39 5 0.166
40 Karitiana -0.9195156 40 5 0.660
41 Maya 2.1239768 41 5 0.818
42 Pima -3.0895998 42 5 0.818
43 Surui -0.9377928 43 5 0.536
44 Melanesian -1.6961014 44 6 0.414
45 Papuan -0.7037952 45 6 0.386
46 BantuKenya -1.9311354 46 7 0.484
47 BantuSouthAfrica -1.8515908 47 7 0.016
48 BiakaPygmy -1.7657017 48 7 0.538
49 Mandenka -0.5423822 49 7 0.076
50 MbutiPygmy -1.6244801 50 7 0.054
51 San -0.9049735 51 7 0.478
52 Yoruba 2.0949378 52 7 0.904
I have made the following graph
I used the code:
jpeg("test3.jpg", 700,700)
df.ordered$color <- as.factor(df.ordered$color)
levels(df.ordered$color) <- c("blue","yellow3","red","pink","purple","green","orange")
plot(df.ordered$gvs, pch = 19, cex=2, col = as.character(df.ordered$color), xaxt="n")
axis(1, at=1:52, col=as.character(df.ordered$color),labels=df.ordered$labels, las=2)
dev.off()
I now want to scale the dots of the graph to the pvals column. I want the low pvalues to be larger dots, and the higher p-value to be the smaller dots. One issue is that some pvalues are 0. I was thinking of turning all pvals values that are 0.000 to 0.001 to fix this. Does anyone know how to do this? I want the graph to look similar to the graph in figure 5 here: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004412
The cex argument is vectorized, i.e., you can pass in a vector (of the same length of your data to plot). Take this as a simple example:
plot(1:5, cex = 1:5)
Now, it is completely up to you to define a relationship between cex and pvals. How about a + (1 - pvals) * (b - a)? This will map 1-pvals from [0,1] to [a,b]. For example, with a = 1, b = 5, you can try:
cex <- 1 + (1 - df.ordered$pvals) * (5 - 1)
I'm looking to have the p-values between 0.000 and 0.0010 to have cex = ~10, p-values between 0.010 and 0.20 to have cex = ~5, and p-values from 0.20-1.00 to have cex = ~0.5.
I recommend using cut():
fac <- cut(df.ordered$pvals, breaks = c(0, 0.001, 0.2, 1),
labels = c(10, 5, 0.5), right = FALSE)
cex <- c(10, 5, 0.5)[as.integer(fac)]
Adding to #zheyuan-li's answer, here is a normalization that puts the size of the points for p-values "equal" to 0 with size 2, and the point size of observations with p-values "equal" to 1 with size zero:
plot(df.ordered$gvs, pch = 19,
cex=2 * (1-df.ordered$pvals)/(df.ordered$pvals +1),
col = as.character(df.ordered$color), xaxt="n")
I have a data frame like this:
distance exclude
1.1 F
1.5 F
3 F
2 F
1 F
5 T
3 F
63 F
32 F
21 F
15 F
1 T
I want get the four boxplot stats of each segment of data in distance column separated by "T" in exclude column, here "T" serves as separator.
Can anyone help? Thanks so much!
First, let's create some fake data:
library(dplyr)
# Fake data
set.seed(49349)
dat = data.frame(distance=rnorm(500, 50, 10),
exclude=sample(c("T","F"), 500, replace=TRUE, prob=c(0.03,0.95)))
Now create a new group each time exclude = "T". Then, for each group, and calculate whatever statistics you wish and return the results in a data frame:
box.stats = dat %>%
mutate(group = cumsum(exclude=="T")) %>%
group_by(group) %>%
do(data.frame(n=length(.$distance),
out_90 = sum(.$distance > quantile(.$distance, 0.9)),
out_10 = sum(.$distance < quantile(.$distance, 0.1)),
MEAN = round(mean(.$distance),2),
SD = round(sd(.$distance),2),
out_2SD_high = sum(.$distance > mean(.$distance) + 2*sd(.$distance)),
round(t(quantile(.$distance, probs=c(0,0.1,0.25,0.5,0.75,0.9,1))),2)))
names(box.stats) = gsub("X(.*)\\.$", "p\\1", names(box.stats))
box.stats
group n out_90 out_10 MEAN SD out_2SD_high p0 p10 p25 p50 p75 p90 p100
1 0 15 2 2 46.21 8.78 0 28.66 36.03 41.88 46.04 52.33 56.30 61.98
2 1 36 4 4 50.03 10.01 0 21.71 38.78 44.63 51.13 56.66 61.58 67.84
3 2 80 8 8 50.36 9.00 1 20.30 38.10 45.95 51.28 56.51 61.74 70.44
4 3 9 1 1 55.62 8.58 0 42.11 47.10 49.19 54.54 63.63 65.84 67.88
5 4 16 2 2 47.70 7.79 0 29.03 39.89 43.60 49.26 52.92 56.97 58.02
6 5 66 7 7 49.86 9.93 2 24.84 36.00 45.05 50.51 55.65 61.41 75.27
7 6 44 5 5 50.35 10.39 1 31.72 36.36 43.49 50.95 55.78 64.88 73.64
8 7 80 8 8 49.18 9.24 1 27.62 37.86 42.06 50.34 56.60 59.66 72.13
9 8 31 3 3 52.56 11.18 0 25.78 39.94 44.10 51.32 62.02 66.35 70.40
10 9 60 6 6 50.31 9.82 1 25.43 37.44 44.53 50.31 56.78 62.36 71.77
11 10 33 4 4 49.99 9.78 2 32.74 38.72 42.56 49.60 55.75 62.86 72.20
12 11 30 3 3 48.26 11.47 1 30.03 37.68 40.24 45.65 55.42 60.18 79.36