Matching unequal data frames on a unique identifier - r

So I have all this data from the EPL and I'm currently trying to create a column for a teams "form" based on their last five games. A win counts at 1 point, a draw as .5, and a loss zero. I have a loop to do this for a single teams at a time but when I try to create one to merge them all together I can not get it to work for some reason. My data comes from: http://www.football-data.co.uk/englandm.php
For the purpose for making it more simple I will use only the data from the 2013/2014 season for the Premier League.
I import the data from the excel sheet and labeled it PL_20
library(RCurl)
URL <- "www.football-data.co.uk/mmz4281/1415/E0.csv"
x <- getURL(URL, ssl.verifypeer = FALSE)
PL_20 <- read.csv(textConnection(x))
# cleaning data, getting rid of the betting odds
data <- PL_20[,-c(1,10,11,24:65)]
#Get dates all in same date format
data$Date<-as.Date(data$Date, guess_formats(data$Date, "dmy"))
#sorting (used for all the seasons)
sorted <- data[order(data$Date,decreasing=TRUE),]
sorted$index <- seq(1:nrow(sorted))
# making a column for form
teams<-as.matrix(unique(data$HomeTeam))
test<-sorted
z<-1 # controls which team the form is being found for. Ideally I would have this cycle
# through all of the teams
current<-subset(sorted, HomeTeam==as.character(teams[z]) | AwayTeam==as.character(teams[z]))
current$h.form<-0
current$a.form<-0
current$recent<-0
for (i in 1:nrow(current)){
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="H") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="A")){
# current[i,7]<- "W"
current[i,24]<- 1
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="D") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="D"))
{
#current[i,7]<- "D"
current[i,24]<- .5
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="A") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="H"))
{
# current[i,7]<- "L"
current[i,24]<- 0
}
}
}
}
d<-0
for (d in 0:(nrow(current)-6))
{
if (as.character(current[nrow(current)-(5+d),2])==as.character(teams[z])){
current[(nrow(current)-(5+d)),22]<-as.numeric(sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24]))
}else{
if(as.character(current[nrow(current)-(5+d),3])==as.character(teams[z]))
{
current[(nrow(current)-(5+d)),23]<-sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24])
}
}
}
Now these ugly loops create a data frame called current which has three columns at the end: h.form, a.form, and recent. h.form is the form for the designated home team for that game and a.form is the form for the designated away team for that game. Recent is just the outcome of that game.
I would like to be able to combine all of the teams together so there is one observation per game and both h.form and a.form are populated with the correct values for their corresponding teams.
Your help is appreciated and if you have suggestions on how to clean up these loops that would be helpful as well.

Related

Attribute labels to scored items [function]

I don't know if the subject has already been find but here my problem :
I have a dataset from behaviors personality items scored from 1 to 8 and I would like to convert each scored according a range (e.g. 1-2 = Rare ; 3-5 = Occasionally ; 6-8 = Frequent).
I succeed to create new columns and put labels in it but I don't understand why I have same repetition in others columns :
Beh_data[,c(2,3,4,32,33,34)
enter image description here
You can see that columns with "_class" had the same outputs, and there are mistakes about correct match between labels and scores (e.g. row4 -- 8 put as Occasionally)
Here the function code :
l = unlist(names(Beh_data[,2:28]))
for (j in 1:length(l)) {
cl[j] = list(paste(l[j],"class",sep="_"))
for (k in 1:length(cl)) {
Beh_data[,cl[[k]] ] <- cl[[k]]
for(i in 1:nrow(Beh_data)){
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]<3, "Rare", Beh_data[,cl[[k]] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>2 & Beh_data[,l[j] ][i]<6, "Occasionally", Beh_data[,cl[[k] ] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>5, "Frequent", Beh_data[,cl[[k]] ][i])
}
}
}
I tried to see if it's could from a wrong annotation as cl[[k]] ] or something like this but it steels doesn't work
Do you have any ideas please ?
If you're open to a dplyr solution, I think its across and case_when functions are helpful here. It should also run faster since it's vectorized. This will create new columns like aff_sum_class which use the categorization you've specified.
library(dplyr)
Beh_data |>
mutate(across(aff_sum:qui_sum,
~case_when(. >= 6 ~ "Frequent",
. >= 3 ~ "Occasionally",
TRUE ~ "Rare"),
.names = "{.col}_class"))

How to create a loop that changes part of a column name in a data frame

I am trying to find Cronbach's Alpha for survey data containing a series of multi-item measures. Rather than have to manually write out every single multi-item measure, it looks like something a loop should be able to manage far more effectively, but it needs to change only part of the column name, according to the question number.
The basic idea as it currently sits in my head would be...
for (N in 4:22) {
ytqN <- data.frame(YT_Data$QNa, YT_Data$QNb, YT_Data$QNc)
alpha(ytqN)
}
The loop would then create new data frames for each multi item measure and run Cronbach's Alpha as it goes.
This doesn't work though. :(
ytq4 <- data.frame(YT_Data$Q4a, YT_Data$Q4b, YT_Data$Q4c)
alpha(ytq4)
ytq5 <- data.frame(YT_Data$Q5a, YT_Data$Q5b, YT_Data$Q5c)
alpha(ytq5)
ytq6 <- data.frame(YT_Data$Q6a, YT_Data$Q6b, YT_Data$Q6c)
alpha(ytq6)
ytq7 <- data.frame(YT_Data$Q7a, YT_Data$Q7b, YT_Data$Q7c)
alpha(ytq7)
ytq8 <- data.frame(YT_Data$Q8a, YT_Data$Q8b, YT_Data$Q8c)
alpha(ytq8)
ytq9 <- data.frame(YT_Data$Q9a, YT_Data$Q9b, YT_Data$Q9c)
alpha(ytq9)
ytq10 <- data.frame(YT_Data$Q10a, YT_Data$Q10b, YT_Data$Q10c)
alpha(ytq10)
ytq11 <- data.frame(YT_Data$Q11a, YT_Data$Q11b, YT_Data$Q11c)
alpha(ytq11)
ytq12 <- data.frame(YT_Data$Q12a, YT_Data$Q12b, YT_Data$Q12c)
alpha(ytq12)
ytq13 <- data.frame(YT_Data$Q13a, YT_Data$Q13b, YT_Data$Q13c)
alpha(ytq13)
ytq14 <- data.frame(YT_Data$Q14a, YT_Data$Q14b, YT_Data$Q14c)
alpha(ytq14)
ytq15 <- data.frame(YT_Data$Q15a, YT_Data$Q15b, YT_Data$Q15c)
alpha(ytq15)
ytq16 <- data.frame(YT_Data$Q16a, YT_Data$Q16b, YT_Data$Q16c)
alpha(ytq16)
ytq17 <- data.frame(YT_Data$Q17a, YT_Data$Q17b, YT_Data$Q17c)
alpha(ytq17)
ytq18 <- data.frame(YT_Data$Q18a, YT_Data$Q18b, YT_Data$Q18c)
alpha(ytq18)
ytq19 <- data.frame(8 - YT_Data$Q19a, YT_Data$Q19b, YT_Data$Q19c)
# Reverse code Q19a
alpha(ytq19)
ytq20 <- data.frame(YT_Data$Q20a, YT_Data$Q20b, YT_Data$Q20c)
alpha(ytq20)
ytq21 <- data.frame(YT_Data$Q21a, YT_Data$Q21b, YT_Data$Q21c)
alpha(ytq21)
ytq22 <- data.frame(YT_Data$Q22a, YT_Data$Q22b, YT_Data$Q22c)
alpha(ytq22)
The desired results would be a single output containing all the Cronbach's Alphas for the multi item measures for questions 4-22 in the data set I am currently working on executed via a single piece of code, rather than have to go question by question.
It's easier to help if you include your data, but I guess this should work:
alpha_list = list()
for(N in 4:22){
ytq = data.frame(YT_Data[paste0("Q",N,"a")],
YT_Data[paste0("Q",N,"b")],
YT_Data[paste0("Q",N,"c")])
alpha_list[[N]] = alpha(ytq)
}
We are using paste0() to create the column names while looping on N. alpha_list will be a list with the results given by alpha()

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

R & xml2: Locate elements by specific text value, store all children values in data.frame

I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00

Ordering Merged data frames

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much
Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Resources