Count multi-response answers aginst a vector in R - r

I have a multi-response question from a survey.
The data look like this:
|respondent| friend |
|----------|-----------------|
| 001 | John, Mary |
|002 | Sue, John, Peter|
Then, I want to count, for each respondent, how many male and female friends they have.
I imagine I need to create separate vectors of male and female names, then check each cell in the friend column against these vectors and count.
Any help is appreciated.

This should be heavily caveated, because many common names are frequently used by different genders. Here I use the genders applied in american social security data in the babynames package as a proxy. Then I merge that with my data and come up with a weighted count based on likelihood. In the dataset, fairly common names including Casey, Riley, Jessie, Jackie, Peyton, Jaime, Kerry, and Quinn are almost evenly split between genders, so in my approach those add about half a female friend and half a male friend, which seems to me the most sensible approach when the name alone doesn't add much information about gender.
library(tidyverse) # using dplyr, tidyr
gender_freq <- babynames::babynames %>%
filter(year >= 1930) %>% # limiting to people <= 92 y.o.
count(name, sex, wt = n) %>%
group_by(name) %>%
mutate(share = n / sum(n)) %>%
ungroup()
tribble(
~respondent, ~friend,
"001", "John, Mary, Riley",
"002", "Sue, John, Peter") %>%
separate_rows(friend, sep = ", ") %>%
left_join(gender_freq, by = c("friend" = "name")) %>%
count(respondent, sex, wt = share)
## A tibble: 4 x 3
# respondent sex n
# <chr> <chr> <dbl>
#1 001 F 1.53
#2 001 M 1.47
#3 002 F 1.00
#4 002 M 2.00

Assuming you have a list that links a name with gender, you can split up your friend column, merge the result with your list and summarise on the gender:
library(tidyverse)
df <- tibble(
respondent = c('001', '002'),
friend = c('John, Mary', 'Sue, John, Peter')
)
names_df <- tibble(
name = c('John', 'Mary', 'Sue','Peter'),
gender = c('M', 'F', 'F', 'M')
)
df %>%
mutate(friend = strsplit(as.character(friend), ", ")) %>%
unnest(friend) %>%
left_join(names_df, by = c('friend' = 'name')) %>%
group_by(respondent) %>%
summarise(male_friends = sum(gender == 'M'),
female_friends = sum(gender == 'F'))
resulting in
# A tibble: 2 x 3
respondent male_friends female_friends
* <chr> <int> <int>
1 001 1 1
2 002 2 1

Related

Create a function to get summary statistics of a data frame in R

I have below data frame df3.
City
Income
Cost
Age
NY
1237
2432
43
NY
6352
8632
32
Boston
6487
2846
54
NJ
6547
7353
42
Boston
7564
7252
21
NY
9363
7563
35
Boston
3262
7352
54
NY
9473
8667
76
NJ
6234
4857
31
Boston
5242
7684
39
NJ
7483
4748
47
NY
9273
6573
53
I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.
variable
Mean
SD
Income
XX
XX
Cost
XX
XX
Age
XX
XX
XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)
ST <- function(c) {
if (df3$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.
From ?summarize_at,
Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...
I'll demo the use of across.
summarize can be given multiple (named) functions at once, I'll show that, too.
Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.
Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.
ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
# variable mu sigma
# <chr> <dbl> <dbl>
# 1 Income 7140. 3550.
# 2 Cost 6773. 2576.
# 3 Age 47.8 17.7
Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:
NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).
It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)
From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in
ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
# City variable mu sigma
# <chr> <chr> <dbl> <dbl>
# 1 Boston Income 5639. 1847.
# 2 Boston Cost 6284. 2299.
# 3 Boston Age 42 15.7
# 4 NY Income 7140. 3550.
# 5 NY Cost 6773. 2576.
# 6 NY Age 47.8 17.7
Edit: I added the rounding.
ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}
ST("Boston")
# A tibble: 3 × 4
City variable mean sd
<chr> <chr> <dbl> <dbl>
1 Boston Age 42 15.7
2 Boston Cost 6284. 2299.
3 Boston Income 5639. 1847.

R: How to modify a CreateTable() function for reiterated observations and with the wrongs index?

I'm trying creating a table on the following dataset which I'm reporting here the very first fifty observations. Here following it is reported the dataset I'm working on.
enter link description here
There are some typos for age and gnder variable that I susggest to fix as follows:
colnames(d)[8] <- 'COND'
d$gender = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M")
library(libr)
d <- datastep(d, {
if (is.na(age)) {
age <- 21
}}
)
I'm trying to create a summary table by using the following code:
CreateTableOne(
vars = c('TASK', 'COND', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = c('ID'),
factorVars = c('gender'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
na.omit() %>%
kableone()
obtaning this table
However how you see from this function, as I have many observation for the same subject, I count just 54 IDs and therefore the number of females and males is incorrect.
length(unique(d$ID))
[1] 54
Anyone knows how to fix it? And furthermore as the 'age' and 'T1.ACC' have non-normal distribution anyone knows how I could replace them with median and Q1 and Q3, for example?
I would like to help you. However, there are the following problems with the data you provide:
The variable COND is missing
Only one unique value of the TASK variable (the CreateTableOne function does not accept variables with one unique value).
Only one unique value for the variable age.
The variable ID is repeated several times.
However, even without changing your data, you can see what your problem is. If you have data in this form, you cannot use CreateTableOne! This is because it counts every occurrence of the value m and every occurrence of the value k. And since you have multiple entries for one person, the CreateTableOne function will count each occurrence separately.
Please take a look at the solution I have proposed here How to describe unique values of grouped observations for several vars?.
Update 1
OKAY. Let's try to face your data.
You have 54 patients with different IDs.
data_Confidence_in_Action %>% distinct(ID) %>% nrow()
#[1] 54
However, note that one ID appears to be incorrect.
data_Confidence_in_Action %>% distinct(ID) %>%
mutate(lenID = str_length(ID)) %>% filter(lenID!=5)
# A tibble: 1 x 2
# ID lenID
# <chr> <int>
#1 P1419 dots 10
However, we can leave it as it is. Correct it yourself if you have to.
However, remember that you have as many as 8 different genders. Be careful because in our country the gender ideology is not well received ;-)
data_Confidence_in_Action %>% distinct(gender)
# A tibble: 8 x 1
# gender
# <chr>
#1 k
#2 kobieta
#3 M
#4 K
#5 m¦Ö+-czyzna
#6 21
#7 m
#8 M¦Ö+-czyzna
This, unfortunately, needs to be fixed.
Unfortunately, patient P1440 was assigned age by gender. So what is the gender of the P1440?
data_Confidence_in_Action %>% filter(gender==21) %>% distinct(ID, gender, age)
# A tibble: 1 x 3
# ID gender age
# <chr> <chr> <dbl>
#1 P1440 21 NA
data_Confidence_in_Action %>% distinct(ID, gender) %>%
group_by(gender) %>% summarise(n = n())
# A tibble: 8 x 2
# gender n
# <chr> <int>
#1 21 1
#2 k 36
#3 K 3
#4 kobieta 9
#5 m 1
#6 M 1
#7 m¦Ö+-czyzna 2
#8 M¦Ö+-czyzna 1
As you can see, you have more women. So let P1440 be a woman. Will be OK?
Finally, notice that the two variables have inconvenient names. It is about Condition (whether a person responded) and Go / Nogo (whether a person should respond).
Let's fix it all in one go.
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
age = ifelse(is.na(age), 21, age)
) %>% rename(Condition=`Condition (whether a person responded)`,
Go.Nogo = `Go/Nogo (whether a person should respond)`)
Finally, let's change some of the variables from chr to factor, but don't replace the correct levels. I hope I took it wisely.
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
ID = ID %>% fct_inorder(),
gender = gender %>% fct_infreq(),
t1.key = t1.key %>% fct_infreq(),
Condition = Condition %>% fct_infreq(),
CR.key = CR.key %>% fct_infreq(),
TASK = TASK %>% fct_infreq(),
Go.Nogo = Go.Nogo %>% fct_infreq(),
difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
)
With the data organized in such a way, let's get to the heart of the problem. What do you really want to analyze. Note that for variables such as TASK, Condition, and t1.key, there are both valid values for each applicant.
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
nunique.TASK = length(unique(TASK)),
nunique.Condition = length(unique(Condition)),
nunique.t1.key = length(unique(t1.key))
) %>% distinct(nunique.TASK, nunique.Condition, nunique.t1.key)
# A tibble: 1 x 3
# nunique.TASK nunique.Condition nunique.t1.key
# <int> <int> <int>
#1 2 2 2
However, if we look at the proportions of the occurrence of different values in these variables, they are different in each patient.
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.TASK = sum(TASK=="left")/sum(TASK=="right")) %>%
distinct()
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.Condition = sum(Condition=="NR")/sum(Condition=="R"))%>%
distinct()
data_Confidence_in_Action %>% group_by(ID) %>% summarise(
prop.t1.key = sum(t1.key=="None")/sum(t1.key=="space"))%>%
distinct()
So write clearly what and how you want to summarize because it is not clear to me what you want to get.
Update 2
OKAY. I can see that you are beginning to understand something.
Still, I don't know what you want to sum up.
Look below. First, let's collect all the code to prepare the data
library(tidyverse)
library(readxl)
library(tableone)
data_Confidence_in_Action <- read_excel("data_Confidence in Action.xlsx")
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
age = ifelse(is.na(age), 21, age)
) %>% rename(Condition=`Condition (whether a person responded)`,
Go.Nogo = `Go/Nogo (whether a person should respond)`)
data_Confidence_in_Action = data_Confidence_in_Action %>%
mutate(
ID = ID %>% fct_inorder(),
gender = gender %>% fct_infreq(),
t1.key = t1.key %>% fct_infreq(),
Condition = Condition %>% fct_infreq(),
CR.key = CR.key %>% fct_infreq(),
TASK = TASK %>% fct_infreq(),
Go.Nogo = Go.Nogo %>% fct_infreq(),
difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
)
And now the summary. If we do this:
CreateTableOne(
data = data_Confidence_in_Action,
vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = 'gender',
factorVars = c('TASK', 'Condition', 't1.key'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
kableone()
output
| |Overall |k |m |p |test |
|:-----------------------|:------------|:------------|:------------|:------|:----|
|n |41713 |37823 |3890 | | |
|TASK = right (%) |20832 (49.9) |18889 (49.9) |1943 (49.9) |0.992 | |
|Condition = R (%) |20033 (48.0) |18130 (47.9) |1903 (48.9) |0.241 | |
|t1.key = space (%) |20033 (48.0) |18130 (47.9) |1903 (48.9) |0.241 | |
|T1.response (mean (SD)) |0.48 (0.50) |0.48 (0.50) |0.49 (0.50) |0.241 | |
|age (mean (SD)) |20.74 (2.67) |20.75 (2.70) |20.60 (2.33) |0.001 | |
|T1.ACC (mean (SD)) |0.70 (0.46) |0.70 (0.46) |0.73 (0.45) |<0.001 | |
we get a summary for all observations that is n == 41713. And since there are many observations for each patient, such a summary is of little use. At least I think so.
However, we can summarize for a few selected patients.
CreateTableOne(
data = data_Confidence_in_Action %>%
filter(ID %in% c('P1323', 'P1403', 'P1404')) %>%
mutate(ID = ID %>% fct_drop()),
vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'),
strata = c('ID'),
factorVars = c('TASK', 'Condition', 't1.key'),
argsApprox = list(correct = FALSE),
smd = TRUE,
addOverall = TRUE,
test = TRUE) %>%
kableone()
output
| |Overall |P1323 |P1403 |P1404 |p |test |
|:-----------------------|:------------|:------------|:------------|:------------|:------|:----|
|n |2323 |775 |776 |772 | | |
|TASK = right (%) |1164 (50.1) |390 (50.3) |386 (49.7) |388 (50.3) |0.969 | |
|Condition = R (%) |1168 (50.3) |385 (49.7) |435 (56.1) |348 (45.1) |<0.001 | |
|t1.key = space (%) |1168 (50.3) |385 (49.7) |435 (56.1) |348 (45.1) |<0.001 | |
|T1.response (mean (SD)) |0.50 (0.50) |0.50 (0.50) |0.56 (0.50) |0.45 (0.50) |<0.001 | |
|age (mean (SD)) |19.66 (0.94) |19.00 (0.00) |19.00 (0.00) |21.00 (0.00) |<0.001 | |
|T1.ACC (mean (SD)) |0.70 (0.46) |0.67 (0.47) |0.77 (0.42) |0.65 (0.48) |<0.001 | |
This makes more sense now, but is separate for each patient.
Alternatively, you can do this summary without using CreateTableOne, e.g. yes
data_Confidence_in_Action %>% group_by(gender, ID) %>%
summarise(
age = min(age)) %>% group_by(gender) %>%
summarise(
n = n(),
Min = min(age),
Q1 = quantile(age,1/4,8),
mean = mean(age),
median = median(age),
Q3 = quantile(age,3/4,8),
Max = max(age),
IQR = IQR(age),
Kurt = e1071::kurtosis(age),
skew = e1071::skewness(age),
SD = sd(age))
output
# A tibble: 2 x 12
gender n Min Q1 mean median Q3 Max IQR Kurt skew SD
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 k 49 19 19 20.8 20 21 32 2 7.47 2.79 2.73
2 m 5 19 19 20.6 19 21 25 2 -1.29 0.823 2.61
Think carefully and write down what you really expect. Unless, of course, this topic is still interesting for you.

Is there a dplyr function to determine the most commonly encountered categorical value within a group?

I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:
For example to take the table Cus1
Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
)
And generate the table Cus_Summary:
Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
Product = c("COKE", "BURGER", "CHICKEN"),
Store = c("NYC", "Chicago", "LA")
)
Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?
I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).
How about this?
Cus %>%
group_by(Customer) %>%
summarise(
Product = first(names(sort(table(Product), decreasing = TRUE))),
Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
Note that in the case of ties this selects the first entry in alphabetical order.
Update
To randomly select an entry from tied top frequency entries we could define a custom function
top_random <- function(x) {
tbl <- sort(table(x), decreasing = T)
top <- tbl[tbl == max(tbl)]
return(sample(names(top), 1))
}
Then the following randomly selects one of the tied top entries:
Cus %>%
group_by(Customer) %>%
summarise(
Product = top_random(Product),
Store = top_random(Store))
In my solution, if there are more than one most frequent value, all are presented:
library(tidyverse)
Cus %>%
gather('type', 'value', -Customer) %>%
group_by(Customer, type, value) %>%
count() %>%
group_by(Customer) %>%
filter(n == max(n)) %>%
nest() %>%
mutate(
Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
) %>%
select(-data)
Result is:
# A tibble: 3 x 3
Customer Product Store
<fct> <chr> <chr>
1 C-01 COKE NYC
2 C-02 BURGER Chicago, Detroit
3 C-03 CHICKEN, FISH LA, San Fran
If you have many columns and want to find out maximum occurrence in all the columns you could use gather to convert the data in long format, count the occurrence for each column, group_by Customer and column and keep only the rows with maximum count and then spread it back to wide format.
library(tidyverse)
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
slice(which.max(n)) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
EDIT
In case of ties if we want to randomly select ties we can filter all the max values and then use sample_n function to select random rows.
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
filter(n == max(n)) %>%
sample_n(1) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 FISH San Fran
Using SO's favourite Mode function (though you could use any):
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
In base R
aggregate(. ~ Customer, lapply(Cus,as.character), Mode)
# Customer Product Store
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
using dplyr
library(dplyr)
Cus %>%
group_by(Customer) %>%
summarise_all(Mode)
# # A tibble: 3 x 3
# Customer Product Store
# <fctr> <fctr> <fctr>
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA

R: Generate a table of win/loss records against specific players

Let's say I have the following data:
dat <- read.table(text="p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T)
I'm trying to use dplyr to output a summary table of some specific player's (e.g. jon's) statistics against every other player in the dataframe. So, the output should be:
joe: 1-0
james: 1-0
ken: 0-1
I want to use 'group_by' to work with a corpus of joe games, but don't know how to implement conditional group_by's (e.g. group_by joe if p1 or p2 == joe). I could mutate to create a dummy column that is equal to 1 if either of those conditions are true, and group_by that, but was hoping there was a more parsimonious strategy. And then, the only way I can see of counting a 'win' for Joe is to use an ifelse statement whereby if p1 == Joe and outcome == 1-0 or p2 == Joe and outcome == 0-1, then count that as a win for Joe. However, not sure how to do these if statements within dplyr piping.
This would be a dplyr solution that allows for multiple games between jon and the other players (not just one game). It basically filters all games that jon was part of and extracts the opponent via mutate and ifelse. It then summarizes the number of wins and losses after grouping by opponent. In the end I paste the overall result for each opponent and only select this pasted column:
dat %>% mutate(p1 = as.character(p1), p2 = as.character(p2)) %>%
filter((p1 == "jon")|(p2 == "jon")) %>%
mutate(opponent= ifelse(p1 == "jon",p2,p1)) %>%
group_by(opponent) %>%
summarize(Wins = sum((outcome == "1-0" & p1 == "jon") |
(outcome == "0-1" & p2 == "jon")) ,
Losses = n() - Wins) %>%
mutate(Outcome = paste(opponent, ": ",Wins, "-", Losses)) %>%
select(Outcome)
I had to add the as.character mutate to properly return the opponents in the ifelse. Otherwise the variables p1 and p2 would still be factor and the numbers would be returned instead of the labels (i.e. names of the players).
Here's an alternative tidyverse solution:
# example data
dat <- read.table(text="
p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T, stringsAsFactors=F)
library(tidyverse)
# reshape your dataset
dat2 = dat %>%
mutate(game_id = row_number()) %>% # add game id
unite(p, p1, p2, sep="-") %>% # combine player names
separate_rows(p, outcome) # separate rows using name and scores
# get summary stats for jon
dat2 %>%
group_by(game_id) %>% # for each game id
filter("jon" %in% p) %>% # keep games that jon played
summarise(pl = p[p != "jon"], # get the name of the other player
outcome = paste0(outcome[p=="jon"], "-", outcome[p!="jon"])) # combine the scores (jon vs. other)
# # A tibble: 3 x 3
# game_id pl outcome
# <int> <chr> <chr>
# 1 1 joe 1-0
# 2 2 james 0-1
# 3 4 ken 0-1
Assuming you can reshape you original dataset once, in beginning, you can create a function using the second part:
GetSummaryStats = function(x) {
dat2 %>%
group_by(game_id) %>%
filter(x %in% p) %>%
summarise(pl = p[p != x],
outcome = paste0(outcome[p==x], "-", outcome[p!=x])) }
and call it like this:
GetSummaryStats("jon")
for any player you like.

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources