Creating new columns from information contained in other row/column combinations - r

I've got a data frame with match up information (Team, Opponent), as well as the betting spread of that game in different sportsbooks. I've got one row for each of the Teams, so there will be two rows for each game. As an example, see the below data frame:
example <- data.frame(Team = c("Tennessee","Vanderbilt"),
Opponent = c("Vanderbilt","Tennessee"),
PointsBet = c(-13, 13),
DraftKings = c(-12.5, 12.5))
Team Opponent PointsBet DraftKings
1 Tennessee Vanderbilt -13 -12.5
2 Vanderbilt Tennessee 13 12.5
What I'm trying to do is create "Opponent_PointsBet" and "Opponent_DraftKings" columns. So for each row, not only do we have the Team's betting spread, but we also have the betting spread of the Opponent. In a small example like this, it's easy to manually do this, but my actual data set has hundreds of rows and about 25 other columns, each of which I'd like to copy.
Is it possible to take one row of data for a specific "Team", and apply those columns as new columns in the row of data where that team is being identified as the "Opponent"? My output would look like this:
Team Opponent PointsBet DraftKings Opp_PointsBet Opp_DraftKings
1 Tennessee Vanderbilt -13 -12.5 13 12.5
2 Vanderbilt Tennessee 13 12.5 -13 -12.5
Also one thing to note, the columns I'd like to duplicate aren't always going to be opposites, so I can't simply just multiply the value by -1 to get the Opp_ column.

We can create the two columns in base R. Create a position index to match the 'Team' with 'Opponent' and use that to rearrange the column values in 'PointsBet' and 'DraftKings' to create new columns
nm1 <- names(example)[3:4]
i1 <- with(example,match(Team, Opponent))
example[paste0("Opp_", nm1)] <- lapply(example[nm1], function(x) x[i1])
example
# Team Opponent PointsBet DraftKings Opp_PointsBet Opp_DraftKings
#1 Tennessee Vanderbilt -13 -12.5 13 12.5
#2 Vanderbilt Tennessee 13 12.5 -13 -12.5

Related

How to use group_by() to display earliest date

I have a tibble called master_table that is 488 rows by 9 variables. The two relevant variables to the problem are wrestler_name and reign_begin. There are multiple repeats of certain wrestler_name values. I want to reduce the rows to only be one instance of each unique wrestler_name value, decided by the earliest reign_begin date value. Sample tibble is linked below:
So, in this slice of the tibble, the end goal would be to have just five rows instead of eleven, and the single Ric Flair row, for example, would have a reign_begin date of 9/17/1981, the earliest of the four Ric Flair reign_begin values.
I thought that group_by would make the most sense, but the more I think about it and try to use it, the more I think it might not be the right path. Here are some things I tried:
group_by(wrestler_name) %>%
tibbletime::filter_time(reign_begin, 'start' ~ 'start')
#Trying to get it to just filter the first date it finds for each wrestler_name group, but did not work
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin)
#I know that this would not work, but its the place I'm basically stuck
edit: Per request, here is the head(master_table), which contains slightly different data, but it still expresses the issue:
1 Ric Flair NWA World Heavyweight Championship 40 8 69 1991-01-11 1991-03-21
2 Sting NWA World Heavyweight Championship 39 1 188 1990-07-07 1991-01-11
3 Ric Flair NWA World Heavyweight Championship 38 7 426 1989-05-07 1990-07-07
4 Ricky Steamboat NWA World Heavyweight Championship 37 1 76 1989-02-20 1989-05-07
5 Ric Flair NWA World Heavyweight Championship 36 6 452 1987-11-26 1989-02-20
6 Ronnie Garvin NWA World Heavyweight Championship 35 1 62 1987-09-25 1987-11-26
city_state country
1 East Rutherford, New Jersey USA
2 Baltimore, Maryland USA
3 Nashville, Tennessee USA
4 Chicago, Illinois USA
5 Chicago, Illinois USA
6 Detroit, Michigan USA
The common way to do this for databases involves a join:
earliest <- master_table %>%
group_by(wrestler_name) %>%
summarise(reign_begin = min(reign_begin)
master_table_2 <- master_table %>%
inner_join(earliest, by = c("wrestler_name", "reign_begin"))
No filter is required as an inner join only include overlaps.
The above approach is often required for database because of how they calculate summaries. But as #Martin_Gal suggests R can handle this a different way because it stores the data in memory.
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin == min(reign_begin))
You may also find having the lubridate package installed assist for working with dates.

Subsetting to Output only certain CHAR names in a single column

I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.
Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

Resources