I have 2 data frame. Codes are: year, pd, treatm and rep.
Variablea are LAI in the first data frame, cimer, himv, nőv are in the second.
I would like to add variable LAI to the other variables/ columns.
I am not sure how to set the correct ordeing of LAI data, while 1 data has 4 codes to define.
Could You help me to solve this problem, please?
Thank You very much!
Data frames are:
> sample1
year treatm pd rep LAI
1 2020 1 A 1 2.58
2 2020 1 A 2 2.08
3 2020 1 A 3 2.48
4 2020 1 A 4 2.98
5 2020 2 A 1 3.34
6 2020 2 A 2 3.11
7 2020 2 A 3 3.20
8 2020 2 A 4 2.56
9 2020 1 B 1 2.14
10 2020 1 B 2 2.17
11 2020 1 B 3 2.24
12 2020 1 B 4 2.29
13 2020 2 B 1 3.41
14 2020 2 B 2 3.12
15 2020 2 B 3 2.81
16 2020 2 B 4 2.63
17 2021 1 A 1 2.15
18 2021 1 A 2 2.25
19 2021 1 A 3 2.52
20 2021 1 A 4 2.57
21 2021 2 A 1 2.95
22 2021 2 A 2 2.82
23 2021 2 A 3 3.11
24 2021 2 A 4 3.04
25 2021 1 B 1 3.25
26 2021 1 B 2 2.33
27 2021 1 B 3 2.75
28 2021 1 B 4 3.09
29 2021 2 B 1 3.18
30 2021 2 B 2 2.75
31 2021 2 B 3 3.21
32 2021 2 B 4 3.57
> sample2
year.pd.treatm.rep.cimer.himv.nőv
1 2020,A,1,1,92,93,94
2 2020,A,2,1,91,92,93
3 2020,B,1,1,72,73,75
4 2020,B,2,1,73,74,75
5 2020,A,1,2,95,96,100
6 2020,A,2,2,90,91,94
7 2020,B,1,2,74,76,78
8 2020,B,2,2,71,72,74
9 2020,A,1,3,94,95,96
10 2020,A,2,3,92,93,96
11 2020,B,1,3,76,77,77
12 2020,B,2,3,74,75,76
13 2020,A,1,4,90,91,97
14 2020,A,2,4,90,91,94
15 2020,B,1,4,74,75,NA
16 2020,B,2,4,73,75,NA
17 2021,A,1,1,92,93,94
18 2021,A,2,1,91,92,93
19 2021,B,1,1,72,73,75
20 2021,B,2,1,73,74,75
21 2021,A,1,2,95,96,100
22 2021,A,2,2,90,91,94
23 2021,B,1,2,74,76,78
24 2021,B,2,2,71,72,74
25 2021,A,1,3,94,95,96
26 2021,A,2,3,92,93,96
27 2021,B,1,3,76,77,77
28 2021,B,2,3,74,75,76
29 2021,A,1,4,90,91,97
30 2021,A,2,4,90,91,94
31 2021,B,1,4,74,75,NA
32 2021,B,2,4,73,75,NA
You can use inner_join from dply
library(tidyverse)
inner_join(sample2,sample1, by=c("year","pd", "treatm", "rep"))
Output (first six lines)
year pd treatm rep cimer himv nov LAI
1: 2020 A 1 1 92 93 94 2.58
2: 2020 A 2 1 91 92 93 3.34
3: 2020 B 1 1 72 73 75 2.14
4: 2020 B 2 1 73 74 75 3.41
5: 2020 A 1 2 95 96 100 2.08
6: 2020 A 2 2 90 91 94 3.11
You can also use data.table
sample2[sample1, on=.(year,pd,treatm,rep)]
I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.
When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)
I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.
matchup.func <- function (home, away, df) {
matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)
df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]
out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}
sample of data frame:
bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R
sample of errant output:
matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S
sample of good:
matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S
On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.
I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try
matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
, df$game.id, value = TRUE)
That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.
You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.
I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.
The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.
Here is my code up until my road block:
HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)
Any help on how to do this would be greatly appreciated!
EDIT
I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.
Here is the output for head(PdataPD):
Player Date Rk Pos Tm X..H OppTm W.L GS MP FG FGA FG. X2P
22408 Aaron Brooks 2015-10-27 817 G CHI CLE W 0 16 3 9 0.333 3
22144 Aaron Brooks 2015-10-28 553 G CHI # BRK W 0 16 5 9 0.556 3
21987 Aaron Brooks 2015-10-30 396 G CHI # DET L 0 18 2 6 0.333 1
21456 Aaron Brooks 2015-11-01 4687 G CHI ORL W 0 16 3 11 0.273 3
21152 Aaron Brooks 2015-11-03 4383 G CHI # CHO L 0 17 5 8 0.625 1
20805 Aaron Brooks 2015-11-05 4036 G CHI OKC W 0 13 4 8 0.500 3
X2PA X2P. X3P X3PA X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408 8 0.375 0 1 0.000 0 0 NA 0 2 2 0 0 0 2 1 6 -0.9
22144 3 1.000 2 6 0.333 0 0 NA 0 1 1 3 1 0 1 4 12 8.5
21987 2 0.500 1 4 0.250 0 0 NA 0 4 4 4 0 0 0 1 5 5.2
21456 6 0.500 0 5 0.000 0 0 NA 2 1 3 1 1 1 1 4 6 1.0
21152 3 0.333 4 5 0.800 0 0 NA 0 0 0 4 1 0 0 4 14 12.6
20805 5 0.600 1 3 0.333 0 0 NA 1 1 2 0 0 0 0 1 9 5.6
FPTS H.A
22408 7.50 H
22144 20.25 A
21987 16.50 A
21456 14.75 H
21152 24.00 A
20805 12.00 H
And for head(TdataPD):
OppTm Date Rk X Opp Result MP FG FGA FG. X2P X2PA X2P. X3P X3PA
2105 ATL 2015-10-27 71 DET L 94-106 240 37 82 0.451 29 55 0.527 8 27
2075 ATL 2015-10-29 41 # NYK W 112-101 240 42 83 0.506 32 59 0.542 10 24
2047 ATL 2015-10-30 13 CHO W 97-94 240 36 83 0.434 28 60 0.467 8 23
2025 ATL 2015-11-01 437 # CHO W 94-92 240 37 88 0.420 30 59 0.508 7 29
2001 ATL 2015-11-03 413 # MIA W 98-92 240 37 90 0.411 30 69 0.435 7 21
1973 ATL 2015-11-04 385 BRK W 101-87 240 37 76 0.487 29 54 0.537 8 22
X3P. FT FTA FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12 15 0.800 94 37 96 0.385 25 67 0.373 12
2075 0.417 18 26 0.692 112 38 93 0.409 32 64 0.500 6
2047 0.348 17 22 0.773 97 36 88 0.409 24 58 0.414 12
2025 0.241 13 14 0.929 94 32 86 0.372 18 49 0.367 14
2001 0.333 17 22 0.773 98 38 86 0.442 33 58 0.569 5
1973 0.364 19 24 0.792 101 36 83 0.434 31 62 0.500 5
Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105 29 0.414 20 26 0.769 106
2075 29 0.207 19 21 0.905 101
2047 30 0.400 10 13 0.769 94
2025 37 0.378 14 15 0.933 92
2001 28 0.179 11 16 0.688 92
1973 21 0.238 10 13 0.769 87
If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.
It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:
merge(PData, TData, by=c("Date", "Team"))
The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.
For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA
I got a question in adding a new column, my data frame is called highway1,and i want to add a column named S/N, as slim divided by acpt, what can I do?
Thanks
> mydf$SN <- mydf$slim/mydf$acpt
> mydf
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy SN
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI 11.956522
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI 13.636364
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI 12.765957
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI 17.105263
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI 31.818182
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA 2.217742
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA 5.000000
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA 2.972973
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA 6.666667
I hope an explanation is not necessary for the above.
While $ is the preferred route, you can also consider cbind.
First, create the numeric vector and assign it to SN:
SN <- Data[,6]/Data[,9]
Now you use cbind to append the numeric vector as a column to the existing data frame:
Data <- cbind(Data, SN)
Again, using the dollar operator $ is preferred, but it doesn't hurt seeing what an alternative looks like.