Taking a column and converting to counts per id in R - r

I have these two datasets, related to a call center:
my_call <- data.frame(id = c(1:6),
call_id = c(rep(200,3),rep(300,3)),
result = c("answering machine","call back","call_back",
"still workable","transfer call","do not call"),
code_result = c("am","cb","cb","sw","tc","dc"))
my_lead <- data.frame(lead_id = c(200,300),
lead_source = c("bpos","zeta"))
> my_call
id call_id result code_result
1 1 200 answering machine am
2 2 200 call back cb
3 3 200 call_back cb
4 4 300 still workable sw
5 5 300 transfer call tc
6 6 300 do not call dc
> my_lead
lead_id lead_source
1 200 bpos
2 300 zeta
I want to join these two datasets, by call_id and lead id, but I want that code_result to pivot wider so as to count the results per id. This is the expected result:
lead_id lead_source am cb sw tc dc
1 200 bpos 1 2 0 0 0
2 300 zeta 0 0 1 1 1
I think a left join could be ok but I'm stuck in how to do it, and if I have to type all the results (am, cb, sw, tc, dc) or if it's possible that R can do it automatically. Any help will be greatly appreciated.

Join and cast the data to wide format -
library(dplyr)
library(tidyr)
left_join(my_call, my_lead, by = c('call_id' = 'lead_id')) %>%
pivot_wider(names_from = code_result, values_from = code_result,
values_fn = length, values_fill = 0,
id_cols = c(call_id, lead_source))
# call_id lead_source am cb sw tc dc
# <dbl> <chr> <int> <int> <int> <int> <int>
#1 200 bpos 1 2 0 0 0
#2 300 zeta 0 0 1 1 1

Does this work:
library(dplyr)
library(tidyr)
my_call %>% inner_join(my_lead, by = c('call_id' = 'lead_id')) %>%
group_by(call_id, code_result,lead_source) %>% summarise(Count = n()) %>%
pivot_wider(id_cols = c(call_id,lead_source), names_from = code_result, values_from = Count ) %>%
mutate(across(everything(), ~ replace_na(., 0)))
`summarise()` has grouped output by 'call_id', 'code_result'. You can override using the `.groups` argument.
# A tibble: 2 x 7
# Groups: call_id [2]
call_id lead_source am cb dc sw tc
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 200 bpos 1 2 0 0 0
2 300 zeta 0 0 1 1 1

You have a couple of good answers already, but you can also use count() rather than group_by and summarise, and values_fill argument in pivot_wider
my_lead %>% inner_join(my_call, by = c("lead_id" = "call_id")) %>%
count(lead_id,code_result, lead_source) %>%
pivot_wider(names_from = "code_result", values_from = "n", values_fill = 0)

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

Adding a Proportion Column with Dplyr

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1
X <- (1:10)
Y<- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
(the code below was provided by a stack exchange member, thank you!)
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...
My expected output would be something like
Z=0 Z=1 PropZ=1
a 2 2 .5
b 1 2 .66
c 0 3 1
Perhaps this helps
library(dplyr)
library(tidyr)
test_df %>%
group_by(Y, Z) %>%
summarise(N = n(), .groups = 'drop') %>%
left_join(test_df %>%
group_by(Y) %>%
summarise(Prop = mean(Z == 1), .groups = 'drop')) %>%
pivot_wider(names_from = Z, values_from = N, values_fill = 0)
-output
# A tibble: 3 x 4
# Y Prop `0` `1`
# <chr> <dbl> <int> <int>
#1 a 0.5 2 2
#2 b 0.667 1 2
#3 c 1 0 3
test_df %>% group_by(Y) %>%
summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())
I am not sure if what is your expected output, but below might be some options
u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
Z
Y 0 1
a 2 2
b 1 2
c 0 3
or
> prop.table(u)
Z
Y 0 1
a 0.2 0.2
b 0.1 0.2
c 0.0 0.3
To calculate proportions of 1 for each letter you can use rowSums.
transform(res, prop_1 = `1`/rowSums(res[-1]))
In dplyr :
library(dplyr)
res %>%
ungroup %>%
mutate(prop_1 = `1`/rowSums(.[-1]))
# Y `0` `1` prop_1
# <chr> <int> <int> <dbl>
#1 a 2 2 0.5
#2 b 1 2 0.667
#3 c 0 3 1

How would you loop this in R?

I have this kinda simple task I'm having hard time looping.
So, lets assume I have this tibble:
library(tidyverse)
dat <- tibble(player1 = c("aa","bb","cc"), player2 = c("cc","aa","bb"))
My goal here, is to make three new columns ( for each unique "player" I have) and assign value of 1 to the column, if the player is "player1", -1 if the player is "player2" and 0 otherwise.
Previously, I have been doing it like this:
dat %>% mutate( aa = ifelse(player1 == "aa",1,ifelse(player2 == "aa",-1,0)),
bb = ifelse(player1 == "bb",1,ifelse(player2 == "bb",-1,0)),
cc = ifelse(player1 == "cc",1,ifelse(player2 == "cc",-1,0)))
This works, but now I have hundreds of different "players", so it would seem silly to do this manually like that. I have tried and read about loops in R, but I just can't get this one right.
Using model.matrix() from base R:
dat[unique(dat$player1)] <-
model.matrix(~0+ player1, data = dat) - model.matrix(~0+ player2, data = dat)
dat
player1 player2 aa bb cc
<chr> <chr> <dbl> <dbl> <dbl>
1 aa cc 1 0 -1
2 bb aa -1 1 0
3 cc bb 0 -1 1
This assumes you have all players in both columns. Otherwise you would need to convert them to factors with the appropriate levels and replace unique with levels.
We could go from initial structure to "long"(-er) format, with one row per (game, player), recode to 1/-1, and then go wide again with the desired output:
dat %>%
mutate(game_id = row_number()) %>%
gather("role", "player", -game_id) %>%
mutate(role = recode(role, "player1" = 1L, "player2" = -1L)) %>%
spread(player, role, fill = 0L)
#> # A tibble: 3 x 4
#> game_id aa bb cc
#> <int> <int> <int> <int>
#> 1 1 1 0 -1
#> 2 2 -1 1 0
#> 3 3 0 -1 1
You can use pivot_longer() to stack those columns starting with "player" and then pivot it to wide. The advantage is that you can do recoding within pivot_wider() by the argument values_fn.
library(tidyverse)
dat %>%
rowid_to_column("id") %>%
pivot_longer(starts_with("player")) %>%
pivot_wider(names_from = value, names_sort = TRUE,
values_from = name, values_fill = 0,
values_fn = function(x) c(1, -1)[match(x, c("player1", "player2"))])
# # A tibble: 3 x 4
# id aa bb cc
# <int> <dbl> <dbl> <dbl>
# 1 1 1 0 -1
# 2 2 -1 1 0
# 3 3 0 -1 1
Note: Development on gather()/spread() is complete, and for new code we recommend switching to pivot_longer()/_wider(), which is easier to use, more featureful, and still under active development.

How to merge multiple variables and create a new data set?

https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.
You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

summaries extracted from data frame info

data <-
STUDY ID BASE CYCLE1 DIED PROG
1 1 100 30 No Yes
1 2 NA 20 Yes No
1 3 16 NA Yes Yes
1 4 15 10 Yes Yes
I wanted to make a summary of the following:
how many subjects have both baseline and CYCLE1 value?
Of those in 1, how many had DIED?
Of those in 1, how many had DIED or PROG?
Answers:
2-subjects (50% of subjects) ==> subjects 1 & 4
1-subject (25%) ===> this is subject 4
2-subjects (50%) ==> subjectys 1 & 4
A summary table by STUDY for this would be great (showing the number and percentage).
I am using Rstudio.
If it is based on the first filter
library(dplyr)
library(stringr)
data %>%
group_by(STUDY) %>%
filter(!is.na(BASE) & !is.na(CYCLE1)) %>%
summarise(ID = str_c(ID, collapse=", "),
n1 = n(),
n2 = sum(DIED== "Yes"),
n3 = sum(DIED == "Yes"|PROG == "Yes"))
# A tibble: 1 x 5
# STUDY ID n1 n2 n3
# <int> <chr> <int> <int> <int>
#1 1 1, 4 2 1 2
if we need the percentage as well
out <- data %>%
group_by(STUDY) %>%
mutate(i1 = !is.na(BASE) & !is.na(CYCLE1),
perc1 = 100 * mean(i1),
n1 = sum(i1),
i2 = DIED == "Yes" & i1,
perc2 = 100 * mean(i2),
n2 = sum(i2),
i3 = (DIED == "Yes"|PROG == "Yes") & i1,
perc3 = 100 * mean(i3),
n3 = sum(i3)) %>%
filter(i1) %>%
select(STUDY, ID, matches("perc"), matches("n")) %>%
mutate(ID = toString(ID)) %>%
slice(1)
# A tibble: 1 x 8
# Groups: STUDY [1]
# STUDY ID perc1 perc2 perc3 n1 n2 n3
# <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#1 1 1, 4 50 25 50 2 1 2
It can be further modified to format the output
library(tidyr) # 0.8.3.9000
out %>%
pivot_longer(cols = perc1:n3, names_to = c( "perc", "n"),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
group_by(STUDY, ID, n) %>%
summarise(value = sprintf("%d (%d%%)", last(value), first(value))) %>%
select(-n)

Resources