Creating a function to supply parameters to another function that exists - r

Right now, I have a main function (let's call it performance()) that has as its arguments player1, player2, and team_of_interest.
I have a data set that looks like this:
> head(roster_van, 3)
team_name team venue num_first_last
1 VANCOUVER CANUCKS VAN Home 5 SBISA, LUCA
2 VANCOUVER CANUCKS VAN Home 8 TANEV, CHRISTOPHER
3 VANCOUVER CANUCKS VAN Home 14 BURROWS, ALEXANDRE
game_date game_id season session player_number
1 2016-10-15 2016020029 20162017 R 5
2 2016-10-15 2016020029 20162017 R 8
3 2016-10-15 2016020029 20162017 R 14
team_num first_name last_name player_name
1 VAN5 LUCA SBISA LUCA.SBISA
2 VAN8 CHRISTOPHER TANEV CHRIS.TANEV
3 VAN14 ALEXANDRE BURROWS ALEX.BURROWS
name_match player_position
1 LUCASBISA D
2 CHRISTOPHERTANEV D
3 ALEXANDREBURROWS L
This is the roster data for a hockey games played in a season.
I want to create another function (let's call it players()) that loops through every unique pair of players in a hockey team and provides their names and team to the player1, player2, and team_of_interest arguments inside the performance() function.
I've started off with this, but don't know what next to do:
name_pairs <- function(x,y) {
x <- seq(1,19, by = 2)
y <- x+1
}

merge can make quick work of generating a cartesian join out of your dataframe.
With a shortened version of your sample dataframe and a guess at the team_of_interest column.
library(tidyverse)
roster_van <- tibble(team = "VAN",
team_num = c(5, 8, 14),
player_name = c("LUCA.SBISA", "CHRIS.TANEV", "ALEX.BURROWS"),
player_position = c("D", "D", "L"),
team_of_interest = c("SL BLUES", "BOS BRUINS", "CGY FLAMES")
)
roster_van
> roster_van
# A tibble: 3 x 5
team team_num player_name player_position team_of_interest
<chr> <dbl> <chr> <chr> <chr>
1 VAN 5 LUCA.SBISA D SL BLUES
2 VAN 8 CHRIS.TANEV D BOS BRUINS
3 VAN 14 ALEX.BURROWS L CGY FLAMES
If you only want a few of the columns repeated, then only rename the column names you wish to see joined again onto the original dataframe before you filter off the equal self joins.
roster_van_pairs <-
roster_van %>%
merge(roster_van %>%
select(team,
team_num_paired = team_num,
player_name_paired = player_name
)
) %>%
filter(player_name != player_name_paired)
roster_van_pairs
> roster_van_pairs
team team_num player_name player_position team_of_interest team_num_paired player_name_paired
1 VAN 5 LUCA.SBISA D SL BLUES 8 CHRIS.TANEV
2 VAN 5 LUCA.SBISA D SL BLUES 14 ALEX.BURROWS
3 VAN 8 CHRIS.TANEV D BOS BRUINS 5 LUCA.SBISA
4 VAN 8 CHRIS.TANEV D BOS BRUINS 14 ALEX.BURROWS
5 VAN 14 ALEX.BURROWS L CGY FLAMES 5 LUCA.SBISA
6 VAN 14 ALEX.BURROWS L CGY FLAMES 8 CHRIS.TANEV
If you want to go with a bulk approach which will join all the columns in again, you can execute a full rename of all the columns with the code below:
roster_van_copy <- roster_van
# provenience the data quickly
colnames(roster_van_copy) <- colnames(roster_van_copy) %>% paste0(., "_paired")
This makes the cross join code more concise, too:
roster_van_all_columns_paired <-
roster_van %>%
merge(roster_van_copy) %>%
filter(player_name != player_name_paired)
I imagine this will leave you with more columns than necessary, but they are very easy to remove with a select(-c(<col_x:col_y)) after all.
roster_van_all_columns_paired
> roster_van_all_columns_paired
team team_num player_name player_position team_of_interest team_paired team_num_paired player_name_paired
1 VAN 8 CHRIS.TANEV D BOS BRUINS VAN 5 LUCA.SBISA
2 VAN 14 ALEX.BURROWS L CGY FLAMES VAN 5 LUCA.SBISA
3 VAN 5 LUCA.SBISA D SL BLUES VAN 8 CHRIS.TANEV
4 VAN 14 ALEX.BURROWS L CGY FLAMES VAN 8 CHRIS.TANEV
5 VAN 5 LUCA.SBISA D SL BLUES VAN 14 ALEX.BURROWS
6 VAN 8 CHRIS.TANEV D BOS BRUINS VAN 14 ALEX.BURROWS
player_position_paired team_of_interest_paired
1 D SL BLUES
2 D SL BLUES
3 D BOS BRUINS
4 D BOS BRUINS
5 L CGY FLAMES
6 L CGY FLAMES
Base R approach could look like this:
roster.van.all.copy.baseR <- merge(roster_van, roster_van_copy)
roster.van.all.baseR <- roster.van.all.copy.baseR[ which(roster.van.all.copy.baseR$player_name != roster.van.all.copy.baseR$player_name_paired), ]

Related

How to create a table with multiple classification in R

I am having problems in create a report table with a data frame like this:
id sex age location
1 m 0-17 Miami
2 f 18-64 Los Angeles
3 f over64 Ontario
4 m 18-64 Paris
5 m 18-64 Ontario
6 m over64 Miami
7 f over64 Miami
8 f 18-64 Los Angeles
9 m 18-64 Other
10 m over64 Other
my desired table should look like this:
Deired Table
Any idea how to do it.
I think if you look through the packages gt and gtsummary, you'll find what you're looking for. For example, this is close to the example you provided as your desired output.
library(gtsummary)
library(gt)
df1 <- read.table(header = T,
text = "Id Sex Age Location # notice I capitalized names
1 m 0-17 Miami
2 f 18-64 Los.Angeles
3 f over64 Ontario
4 m 18-64 Paris
5 m 18-64 Ontario
6 m over64 Miami
7 f over64 Miami
8 f 18-64 Los.Angeles
9 m 18-64 Other
10 m over64 Other", sep = " ")
tbl_summary(df1[,2:4]) %>% add_n() %>% bold_labels
This is the output table.
If you want to customize it, pipe in as_gt() and then use any of the functions in the gt package to customize it further.

Looping over a data frame and adding a new column in R with certain logic

I have a data frame which contains information about sales branches, customers and sales.
branch <- c("Chicago","Chicago","Chicago","Chicago","Chicago","Chicago","LA","LA","LA","LA","LA","LA","LA","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa")
customer <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)
sales <- c(33816,24534,47735,1467,39389,30659,21074,20195,45165,37606,38967,41681,47465,3061,23412,22993,34738,19408,11637,36234,23809)
data <- data.frame(branch, customer, sales)
What I need to accomplish is to iterate over each branch, take each customer in the branch and divide the sales for that customer by the total of the branch. I need to do it to find out how much each customer is contributing towards the total sales of the corresponding branch. E.g. for customer 1 I would like to divide 33816/177600 and store this value in a new column. (177600 is the total of chicago branch)
I have tried to write a function to iterate over each row in a for loop but I am not sure how to do it at a branch level. Any guidance is appreciated.
Consider base R's ave for new column of inline aggregate which also considers same customer with multiple records within the same branch:
data$customer_contribution <- ave(data$sales, data$customer, FUN=sum) /
ave(data$sales, data$branch, FUN=sum)
data
# branch customer sales customer_contribution
# 1 Chicago 1 33816 0.190405405
# 2 Chicago 2 24534 0.138141892
# 3 Chicago 3 47735 0.268778153
# 4 Chicago 4 1467 0.008260135
# 5 Chicago 5 39389 0.221784910
# 6 Chicago 6 30659 0.172629505
# 7 LA 7 21074 0.083576241
# 8 LA 8 20195 0.080090263
# 9 LA 9 45165 0.179117441
# 10 LA 10 37606 0.149139610
# 11 LA 11 38967 0.154537126
# 12 LA 12 41681 0.165300433
# 13 LA 13 47465 0.188238887
# 14 Tampa 14 3061 0.017462291
# 15 Tampa 15 23412 0.133560003
# 16 Tampa 16 22993 0.131169705
# 17 Tampa 17 34738 0.198172193
# 18 Tampa 18 19408 0.110718116
# 19 Tampa 19 11637 0.066386372
# 20 Tampa 20 36234 0.206706524
# 21 Tampa 21 23809 0.135824795
Or less wordy:
data$customer_contribution <- with(data, ave(sales, customer, FUN=sum) /
ave(sales, branch, FUN=sum))
We can use dplyr::group_by and dplyr::mutate to calculate fractional sales of total by branch.
library(dplyr);
library(magrittr);
data %>%
group_by(branch) %>%
mutate(sales.norm = sales / sum(sales))
## A tibble: 21 x 4
## Groups: branch [3]
# branch customer sales sales.norm
# <fct> <dbl> <dbl> <dbl>
# 1 Chicago 1. 33816. 0.190
# 2 Chicago 2. 24534. 0.138
# 3 Chicago 3. 47735. 0.269
# 4 Chicago 4. 1467. 0.00826
# 5 Chicago 5. 39389. 0.222
# 6 Chicago 6. 30659. 0.173
# 7 LA 7. 21074. 0.0836
# 8 LA 8. 20195. 0.0801
# 9 LA 9. 45165. 0.179
#10 LA 10. 37606. 0.149

How to create an index in a data frame based on another data frame in R

I am trying to create an index in DF1 based off of DF2. In DF2 I have a column called ID, and what I want to do is search DF1$Name and if it contains a value from DF2$Wine then to fill in the ID from DF2$ID to DF1$ID.
DF1 = allwines
a <- c("Malbec", "Syrah", "Cabernet Sauvignon", "Merlot")
b <- c(1, 2, 3, 4)
allwines <- data.frame(a, b)
> allwines
a b
1 Malbec 1
2 Syrah 2
3 Cabernet Sauvignon 3
4 Merlot 4
DF2 = wines
c <- c("Charles Smith", "K Vintners", "K Vintners", "Two Vintners", "K Vintners", "Kerloo", "Betz Family", "Efeste" )
d <- c("Royal City Syrah", "Cattle King Syrah", "Klein Syrah", "Make Haste Cinsault", "The Hidden Syrah", "Stone Tree Malbec", "Le Parrain Cabernet Sauvignon", "Big Papa Cabernet Sauvignon")
wines <- data.frame(c, d)
> wines
c d
1 Charles Smith Royal City Syrah
2 K Vintners Cattle King Syrah
3 K Vintners Klein Syrah
4 Two Vintners Make Haste Cinsault
5 K Vintners The Hidden Syrah
6 Kerloo Stone Tree Malbec
7 Betz Family Le Parrain Cabernet Sauvignon
8 Efeste Big Papa Cabernet Sauvignon
Desired Output
> desired
c d ID
1 Charles Smith Royal City Syrah 2
2 K Vintners Cattle King Syrah 2
3 K Vintners Klein Syrah 2
4 Two Vintners Make Haste Cinsault NA
5 K Vintners The Hidden Syrah 2
6 Kerloo Stone Tree Malbec 1
7 Betz Family Le Parrain Cabernet Sauvignon 3
8 Efeste Big Papa Cabernet Sauvignon 3
My attempts have just been generating an ID row full of NA.
The idea is to search through the wine names in the wines row, and match them with wines from allwines, for example Syrah from allwines$a would match Royal City Syrah, Cattle King Syrah, and Klein Syrah in wines$d
If the names match up exactly between df2$Wine and df1$Name, you can simply join on those columns to obtain what you want.
Before you create the list of all NAs, try this:
library(dplyr)
newdf <- left_join(df1, df2, by = c('Name', 'Wine'))
newdf should now contain all of the original rows from df1, and the corresponding ID if it is found in df2.
This is assuming, of course, that everything is formatted correctly and that the names match.

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Getting "raw" data from frequency table

I've been looking around for some data about naming trends in USA. I managed to get top 1000 names for babies born in 2008. The data is formated in this manor:
male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
I want to get a data.frame with 2 variables: name and gender.
This can be done with looping, but I consider it rather inefficient way of solving this problem. I reckon that some reshape function will suite my needs.
Let's presuppose that this tab-delimited data is saved into a data.frame named bnames. Looping can be done with function:
tmp <- character()
for (i in 1:nrow(bnames)) {
tmp <- c(tmp, rep(bnames[i,1], bnames[i,2]))
}
But I want to achieve this with vector-based approach. Any suggestions?
So one quick version would be to transform the data.frame and use the rbind() function
to get what you want.
dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")
This will give you:
name gender value name gender value
1 Jacob m 22272 Emma f 18587
2 Michael m 20298 Isabella f 18377
3 Ethan m 20004 Emily f 17217
4 Joshua m 18924 Madison f 16853
5 Daniel m 18717 Ava f 16850
6 Alexander m 18423 Olivia f 16845
7 Anthony m 18158 Sophia f 15887
8 William m 18149 Abigail f 14901
9 Christopher m 17783 Elizabeth f 11815
10 Matthew m 17337 Chloe f 11699
Now you can use rbind():
dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])
which leads to:
name gender value
1 Jacob m 22272
2 Michael m 20298
3 Ethan m 20004
4 Joshua m 18924
5 Daniel m 18717
6 Alexander m 18423
7 Anthony m 18158
8 William m 18149
9 Christopher m 17783
10 Matthew m 17337
11 Emma f 18587
12 Isabella f 18377
13 Emily f 17217
14 Madison f 16853
15 Ava f 16850
16 Olivia f 16845
17 Sophia f 15887
18 Abigail f 14901
19 Elizabeth f 11815
20 Chloe f 11699
Direct vector-based solution (replace the loop) will be
# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)
# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]
It's based on fact that rep can do at once thing you do in loop.
But for final result you should combine mropa and gd047 answers.
Or with my solution:
data_final <- data.frame(
name = c(
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
[EDIT] Simplify:
data_final <- data.frame(
name = rep(
c(bnames$male.name, bnames$female.name),
times = c(bnames$n.male, bnames$n.female)
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
I think (if I have understood correctly) that mropa's solution needs one more step to get what you want
library(plyr)
data <- ddply(dataNGV, .(name,gender),
function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))
Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

Resources