How would you loop this in R? - r

I have this kinda simple task I'm having hard time looping.
So, lets assume I have this tibble:
library(tidyverse)
dat <- tibble(player1 = c("aa","bb","cc"), player2 = c("cc","aa","bb"))
My goal here, is to make three new columns ( for each unique "player" I have) and assign value of 1 to the column, if the player is "player1", -1 if the player is "player2" and 0 otherwise.
Previously, I have been doing it like this:
dat %>% mutate( aa = ifelse(player1 == "aa",1,ifelse(player2 == "aa",-1,0)),
bb = ifelse(player1 == "bb",1,ifelse(player2 == "bb",-1,0)),
cc = ifelse(player1 == "cc",1,ifelse(player2 == "cc",-1,0)))
This works, but now I have hundreds of different "players", so it would seem silly to do this manually like that. I have tried and read about loops in R, but I just can't get this one right.

Using model.matrix() from base R:
dat[unique(dat$player1)] <-
model.matrix(~0+ player1, data = dat) - model.matrix(~0+ player2, data = dat)
dat
player1 player2 aa bb cc
<chr> <chr> <dbl> <dbl> <dbl>
1 aa cc 1 0 -1
2 bb aa -1 1 0
3 cc bb 0 -1 1
This assumes you have all players in both columns. Otherwise you would need to convert them to factors with the appropriate levels and replace unique with levels.

We could go from initial structure to "long"(-er) format, with one row per (game, player), recode to 1/-1, and then go wide again with the desired output:
dat %>%
mutate(game_id = row_number()) %>%
gather("role", "player", -game_id) %>%
mutate(role = recode(role, "player1" = 1L, "player2" = -1L)) %>%
spread(player, role, fill = 0L)
#> # A tibble: 3 x 4
#> game_id aa bb cc
#> <int> <int> <int> <int>
#> 1 1 1 0 -1
#> 2 2 -1 1 0
#> 3 3 0 -1 1

You can use pivot_longer() to stack those columns starting with "player" and then pivot it to wide. The advantage is that you can do recoding within pivot_wider() by the argument values_fn.
library(tidyverse)
dat %>%
rowid_to_column("id") %>%
pivot_longer(starts_with("player")) %>%
pivot_wider(names_from = value, names_sort = TRUE,
values_from = name, values_fill = 0,
values_fn = function(x) c(1, -1)[match(x, c("player1", "player2"))])
# # A tibble: 3 x 4
# id aa bb cc
# <int> <dbl> <dbl> <dbl>
# 1 1 1 0 -1
# 2 2 -1 1 0
# 3 3 0 -1 1
Note: Development on gather()/spread() is complete, and for new code we recommend switching to pivot_longer()/_wider(), which is easier to use, more featureful, and still under active development.

Related

In R, indicate whether another column has more than one unique value per ID

Background
Here's a dataset, d:
d <- data.frame(ID = c("a","a","b","b"),
product_code = c("B78","X31","C12","C12"),
stringsAsFactors=FALSE)
It looks like this:
The Problem and Desired Output
I'm trying to make an indicator column multiple_products that's marked 1 for IDs which have more than one unique product_code and 0 for those that don't. Here's what I'm looking for:
My attempts haven't worked yet, though.
What I've Tried
Here's my current code:
d <- d %>%
group_by(ID) %>%
mutate(multiple_products = if_else(length(unique(d$product_code)) > 1, 1, 0)) %>%
ungroup()
And this is the result:
Any thoughts?
The d$ should be taken out as this will extract the whole column by removing the group attributes. Also, there is n_distinct. In addition, there is no need for ifelse or if_else as logical values (TRUE/FALSE) can be directly coerced to 1/0 as these are storage values by either using as.integer or +
library(dplyr)
d %>%
group_by(ID) %>%
mutate(multiple_products = +(n_distinct(product_code) > 1)) %>%
ungroup()
-output
# A tibble: 4 x 3
ID product_code multiple_products
<chr> <chr> <int>
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0
solution with data.table;
library(data.table)
setDT(d)
d[,multiple_products:=rleid(product_code),by=ID][
,multiple_products:=ifelse(max(multiple_products)>1,1,0),by=ID]
d
output;
ID product_code multiple_products
<chr> <chr> <int>
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0
A base R option using ave
transform(
d,
multiple_products = +(ave(match(product_code, unique(product_code)), ID, FUN = var) > 0
)
)
gives
ID product_code multiple_products
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0

Taking a column and converting to counts per id in R

I have these two datasets, related to a call center:
my_call <- data.frame(id = c(1:6),
call_id = c(rep(200,3),rep(300,3)),
result = c("answering machine","call back","call_back",
"still workable","transfer call","do not call"),
code_result = c("am","cb","cb","sw","tc","dc"))
my_lead <- data.frame(lead_id = c(200,300),
lead_source = c("bpos","zeta"))
> my_call
id call_id result code_result
1 1 200 answering machine am
2 2 200 call back cb
3 3 200 call_back cb
4 4 300 still workable sw
5 5 300 transfer call tc
6 6 300 do not call dc
> my_lead
lead_id lead_source
1 200 bpos
2 300 zeta
I want to join these two datasets, by call_id and lead id, but I want that code_result to pivot wider so as to count the results per id. This is the expected result:
lead_id lead_source am cb sw tc dc
1 200 bpos 1 2 0 0 0
2 300 zeta 0 0 1 1 1
I think a left join could be ok but I'm stuck in how to do it, and if I have to type all the results (am, cb, sw, tc, dc) or if it's possible that R can do it automatically. Any help will be greatly appreciated.
Join and cast the data to wide format -
library(dplyr)
library(tidyr)
left_join(my_call, my_lead, by = c('call_id' = 'lead_id')) %>%
pivot_wider(names_from = code_result, values_from = code_result,
values_fn = length, values_fill = 0,
id_cols = c(call_id, lead_source))
# call_id lead_source am cb sw tc dc
# <dbl> <chr> <int> <int> <int> <int> <int>
#1 200 bpos 1 2 0 0 0
#2 300 zeta 0 0 1 1 1
Does this work:
library(dplyr)
library(tidyr)
my_call %>% inner_join(my_lead, by = c('call_id' = 'lead_id')) %>%
group_by(call_id, code_result,lead_source) %>% summarise(Count = n()) %>%
pivot_wider(id_cols = c(call_id,lead_source), names_from = code_result, values_from = Count ) %>%
mutate(across(everything(), ~ replace_na(., 0)))
`summarise()` has grouped output by 'call_id', 'code_result'. You can override using the `.groups` argument.
# A tibble: 2 x 7
# Groups: call_id [2]
call_id lead_source am cb dc sw tc
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 200 bpos 1 2 0 0 0
2 300 zeta 0 0 1 1 1
You have a couple of good answers already, but you can also use count() rather than group_by and summarise, and values_fill argument in pivot_wider
my_lead %>% inner_join(my_call, by = c("lead_id" = "call_id")) %>%
count(lead_id,code_result, lead_source) %>%
pivot_wider(names_from = "code_result", values_from = "n", values_fill = 0)

R/dplyr: Mutate based on multiple dynamic variable names

I have a list of data frames each of which contains multiple variables that contain surface area values (ending in "_area"). For each surface area variable there is corresponding conversion factor (ending in “_unit”) that I want to use to calculate a third variable that contains the area in a standard unit of measurement. I want these variables to end in “_area_ha”.
Below are my sample data frames:
a <- tibble(a1_area = c(1,1,1), a2_area_unit = c(1,1,0.5), a2_area = c(1,1,1),
a1_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
b <- tibble(b1_area = c(1,1,1), b1_area_unit = c(1,1,0.5), b2_area = c(1,1,1),
b2_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
ab_list <- list(a, b)
names(ab_list) <- c("a", "b")
I know how to do to this with the help of a loop but would like to understand how this can be done in the tidyverse/dplyr logic. My loop (which gives me the desired output) looks like this:
df_names <- names(ab_list)
for (d in df_names) {
df <- ab_list[[d]]
var_names <- names(select(df, matches("_area$")))
for (v in var_names) {
int <- df %>% select(all_of(v),)
int2 <- df %>% select(matches(paste0(names(int), "_unit")))
int3 <- int*int2
names(int3) <- paste0(names(int), "_ha")
df <- cbind(df, int3)
rm(int, int2, int3)
}
ab_list[[d]] <- tibble(df)
rm(df)
}
> ab_list
$`a`
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
I have tried using lapply and mutate_at but my approach does not work. If I understand correctly, this is because my environment is nested and I cannot access x in the function that calculates the variable "ha".
ab_list %>%
lapply(function(x) mutate_at(x, vars(matches("_area$")), list(ha = ~.*x[[paste0(names(.),"_unit")]])))
Error: Column `a1_area_ha` must be length 3 (the number of rows) or one, not 0
Is there a way to get the function within mutate_at to access a variable from the parent data frame based on the name of initial variable within the function?
I would of course be happy about any other suggestion for a tidyverse approach to calculate the "_ha" variables based on dynamic variable names.
Great question. Below is a base R solution. I am sure it can be adapted to a tidyverse solution (e.g., with purrr::map2()). Here I built a function that does a basic test and then used it with lapply(). Note: the answer is tailored for your example, so you'll need to adapt it if you have different column names for the value / units. Hope this helps!!
val_by_unit <- function(data) {
df <- data[order(names(data))]
# Selecting columns for values and units
val <- df[endsWith(names(df), "area")]
unit <- df[endsWith(names(df), "unit")]
# Check names are multiplying correctly
if(!all(names(val) == sub("_unit", "", names(unit)))) {
stop("Not all areas have a corresponding unit")
}
# Multiplying corresponding columns
output <- Map(`*`, val, unit)
# Renaming output and adding columns
data[paste0(names(output), "_ha")] <- output
data
}
Results:
lapply(ab_list, val_by_unit)
$a
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
The tidyverse functions work best with 'long' formatted data where each of your rows represents a unique data point. To do this, you will want to use the tidyr::pivot_longer function:
# Join dataframes
dplyr::bind_cols(a, b) %>%
# Convert to area columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('area'),
names_to = 'site',
values_to = 'area'
) %>%
# Convert unit columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('unit'),
names_to = 'site2',
values_to = 'unit'
) %>%
# Just extract first 2 characters of the site column to get unique ID
dplyr::mutate(
site = stringr::str_sub(site, 1, 2)
) %>%
# Remove redundant columns
dplyr::select(abc, site, area, unit) %>%
# Calculate area in HA
dplyr::mutate(
area_ha = area * unit
)
Once your data is in long format, you can just use dplyr::mutate to multiply your area column by the unit column to get an area_ha column. If you want to convert your data back to its original format, you can use tidyr::pivot_wider to convert the data back to a wide format, which would give you columns with names a1_area_ha, a2_area_ha, etc.

How to merge multiple variables and create a new data set?

https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.
You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

Use a specific value in summarise (dplyr) without filtering it out

I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1

Resources