Arrange dataframe for pairwise correlations - r

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?

This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.

There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Related

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

How to specific rows from a split list in R based on column condition

I am new to R and to programming in general and am looking for feedback on how to approach what is probably a fairly simple problem in R.
I have the following dataset:
df <- data.frame(county = rep(c("QU","AN","GY"), 3),
park = (c("Downtown","Queens", "Oakville","Squirreltown",
"Pinhurst", "GarbagePile","LottaTrees","BigHill",
"Jaynestown")),
hectares = c(12,42,6,18,92,6,4,52,12))
df<-transform(df, parkrank = ave(hectares, county,
FUN = function(x) rank(x, ties.method = "first")))
Which returns a dataframe looking like this:
county park hectares parkrank
1 QU Downtown 12 2
2 AN Queens 42 1
3 GY Oakville 6 1
4 QU Squirreltown 18 3
5 AN Pinhurst 92 3
6 GY GarbagePile 6 2
7 QU LottaTrees 4 1
8 AN BigHill 52 2
9 GY Jaynestown 12 3
I want to use this to create a two-column data frame that lists each county and the park name corresponding to a specific rank (e.g. if when I call my function I add "2" as a variable, shows the second biggest park in each county).
I am very new to R and programming and have spent hours looking over the built in R help files and similar questions here on stack overflow but I am clearly missing something. Can anyone give a simple example of where to begin? It seems like I should be using split then lapply or maybe tapply, but everything I try leaves me very confused :(
Thanks.
Try,
df2 <- function(A,x) {
# A is the name of the data.frame() and x is the rank No
df <- A[A[,4]==x,]
return(df)
}
> df2(df,2)
county park hectares parkrank
1 QU Downtown 12 2
6 GY GarbagePile 6 2
8 AN BigHill 52 2

How to sort alphabetically rows of a data frame? [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 8 years ago.
I am tring to sort c alphabetically if x[i]== x[i+1]. I used order() function but it changes the x column as well. I want to order the entire row:
best <- function(state){
HospitalName<-vector()
StateName<-vector()
HeartAttack<-vector()
k<-1
outcome<-read.csv("outcome-of-care-measures.csv",colClasses= "character")
temp<-(outcome[,c(2,7,11,17,23)])
for (i in 1:nrow(temp)){
if(identical(state,temp[i,2])==TRUE){
HospitalName[k]<-temp[i,1]
StateName[k]<-temp[i,2]
HeartAttack[k]<-as.numeric(temp[i,4])
k<-k+1
}}
frame<-data.frame(cbind(HospitalName,StateName,HeartAttack))
library(dplyr)
frame %>%
group_by(as.numeric(as.character(frame[,3]))) %>%
arrange(frame[,1])
}
Output:
HospitalName StateName HeartAttack
1 FORT DUNCAN MEDICAL CENTER TX 8.1
2 TOMBALL REGIONAL MEDICAL CENTER TX 8.5
3 CYPRESS FAIRBANKS MEDICAL CENTER TX 8.7
4 DETAR HOSPITAL NAVARRO TX 8.7
5 METHODIST HOSPITAL,THE TX 8.8
6 MISSION REGIONAL MEDICAL CENTER TX 8.8
7 BAYLOR ALL SAINTS MEDICAL CENTER AT FW TX 8.9
8 SCOTT & WHITE HOSPITAL-ROUND ROCK TX 8.9
9 THE HEART HOSPITAL BAYLOR PLANO TX 9
10 UT SOUTHWESTERN UNIVERSITY HOSPITAL TX 9
.. ... ... ...
Variables not shown: as.numeric(as.character(frame[, 3])) (dbl)
Output does not contain the HeartAttack Column and I do not understand why?
One solution with dplyr:
library(dplyr)
df %>%
group_by(x) %>%
arrange(c)
Or as #Akrun mentions in the comments below just
df %>%
arrange(x,c)
if you are not interested in grouping. Depends on what you want.
Output:
Source: local data frame [5 x 2]
Groups: x
x c
1 2 A
2 2 D
3 3 B
4 3 C
5 5 E
There is another solution in base R but it will only work if your x column is ordered as is, or if you don't mind changing the order it has:
> df[order(df$x, df$c), , drop = FALSE]
x c
2 2 A
1 2 D
4 3 B
3 3 C
5 5 E

Using dplyr to subset a data based on common values in one column from two data frames

I have never really used dplyr and wondering how I can use it in the following context. So, I have following two data frames:
trainData <- read.csv("train.csv", header = TRUE, stringsAsFactors = FALSE)
subscriptionData <- read.csv("subscriptions.csv", header = TRUE, stringsAsFactors = FALSE)
> head(trainData)
account.id total
1 001i000000NuOGY 0
2 001i000000NuS8r 0
3 001i000000NuPGw 0
4 001i000000NuO7a 0
5 001i000000NuQ2f 0
6 001i000000NuOQz 0
> head(subscriptionData)
account.id season package no.seats location section price.level total multiple.subs
1 001i000000LhyR3 2009-2010 Quartet 2 San Francisco Premium Orchestra 1 1.0 no
2 001i000000NuOeY 2000-2001 Full 2 San Francisco Orchestra 2 2.0 no
3 001i000000NuNvb 2001-2002 Full 2 Berkeley Saturday Balcony Front 3 2.0 no
4 001i000000NuOIz 1993-1994 Quartet 1 Contra Costa Orchestra 2 0.5 no
5 001i000000NuNVE 1998-1999 Full 2 Berkeley Sunday Balcony Rear 4 2.0 no
Now I want to take subset of subscriptionData based on the account.id of trainData. I basically want to take subset of subscriptionData with account.id that are also present in trainData.
I know it's a very basic question but I am totally new dplyr and have no clue.
You want a semi join:
subscriptionData %>% semi_join(trainData, by = "account.id")

Resources