How to append text to a column based on conditions? - r

I have an empty column designated for categorising entries in my data frame. Categories are not exclusive, i.e. one entry can have multiple categories.
animals categories
1 monkey
2 humpback whale
3 river trout
4 seagull
The categories column should have categories based on the animal's properties. I know the properties based on vectors. The elements in the vectors aren't necessarily a perfect match.
mammals <- c("whale", "monkey", "dog")
swimming <- c("whale", "trout", "dolphin")
How do I get the following result, ideally without looping?
animals categories
1 monkey mammal
2 humpback whale mammal,swimming
3 river trout swimming
4 seagull

This may be done with fuzzyjoin after creating a key/val dataset - lst from dplyr returns a named list, which is converted to a two column dataset with enframe, unnest the list column, grouped by 'animals', paste the 'categories' to a single string and then do a join (regex_left_join) with the original dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
keydat <- lst(mammals, swimming) %>%
enframe(name = 'categories', value = 'animals') %>%
unnest(animals) %>%
group_by(animals) %>%
summarise(categories = toString(categories))
regex_left_join(df1, keydat, by= 'animals', ignore_case = TRUE) %>%
transmute(animals = animals.x, categories)
# A tibble: 4 × 2
animals categories
<chr> <chr>
1 monkey mammals
2 humpback whale mammals, swimming
3 river trout swimming
4 seagull <NA>
data
df1 <- tibble(animals = c('monkey', 'humpback whale', 'river trout', 'seagull'))

A base R option using stack + aggregate + grepl
lut <- aggregate(
. ~ values,
type.convert(
stack(list(mammals = mammals, swimming = swimming)),
as.is = TRUE
),
toString
)
p <- sapply(
lut$values,
grepl,
x = df$animals
)
df$categories <- lut$ind[replace(rowSums(p * col(p)), rowSums(p) == 0, NA)]
which gives
> df
animals categories
1 monkey mammals
2 humpback whale mammals, swimming
3 river trout swimming
4 seagull <NA>
Data
df <- data.frame(animals = c("monkey", "humpback whale", "river trout", "seagull"))

Related

Is there a R function to group the categorical values to type of values

I have a data frame that includes all different types of goods, for example, apples, bananas, potatoes, tuna, salmon, oranges, and many more.
All these goods are under a variable "Item".
I am looking for a solution in R that can create a new variable as "Item Category" with Fruits, Vegetables, Seafood and assign all the items according to their category.
You may prepare a list for each category and match them in a case_when statement.
library(dplyr)
df <- data.frame(item = c('apples', 'bananas', 'potatoes', 'tuna', 'salmon', 'oranges'))
df <- df %>%
mutate(item_category = case_when(item %in% c('apples', 'bananas', 'oranges') ~ 'Fruits',
item %in% c('potatoes') ~ 'Vegetables',
item %in% c('tuna', 'salmon') ~ 'SeaFood'))
df
# item item_category
#1 apples Fruits
#2 bananas Fruits
#3 potatoes Vegetables
#4 tuna SeaFood
#5 salmon SeaFood
#6 oranges Fruits
You can use fct_collapse() in forcats to collapse factor levels into manually defined groups:
library(dplyr)
# Refer to #RonakShah's example
df <- data.frame(item = c('apples', 'bananas', 'potatoes', 'tuna', 'salmon', 'oranges'))
df %>%
mutate(
item_category = forcats::fct_collapse(item,
'Fruits' = c('apples', 'bananas', 'oranges'),
'Vegetables' = c('potatoes'),
'SeaFood' = c('tuna', 'salmon'))
)
or passing a named list to rename levels with !!!:
lev <- list('Fruits' = c('apples', 'bananas', 'oranges'),
'Vegetables' = c('potatoes'),
'SeaFood' = c('tuna', 'salmon'))
df %>%
mutate(item_category = forcats::fct_collapse(item, !!!lev))
# item item_category
# 1 apples Fruits
# 2 bananas Fruits
# 3 potatoes Vegetables
# 4 tuna SeaFood
# 5 salmon SeaFood
# 6 oranges Fruits

Count multi-response answers aginst a vector in R

I have a multi-response question from a survey.
The data look like this:
|respondent| friend |
|----------|-----------------|
| 001 | John, Mary |
|002 | Sue, John, Peter|
Then, I want to count, for each respondent, how many male and female friends they have.
I imagine I need to create separate vectors of male and female names, then check each cell in the friend column against these vectors and count.
Any help is appreciated.
This should be heavily caveated, because many common names are frequently used by different genders. Here I use the genders applied in american social security data in the babynames package as a proxy. Then I merge that with my data and come up with a weighted count based on likelihood. In the dataset, fairly common names including Casey, Riley, Jessie, Jackie, Peyton, Jaime, Kerry, and Quinn are almost evenly split between genders, so in my approach those add about half a female friend and half a male friend, which seems to me the most sensible approach when the name alone doesn't add much information about gender.
library(tidyverse) # using dplyr, tidyr
gender_freq <- babynames::babynames %>%
filter(year >= 1930) %>% # limiting to people <= 92 y.o.
count(name, sex, wt = n) %>%
group_by(name) %>%
mutate(share = n / sum(n)) %>%
ungroup()
tribble(
~respondent, ~friend,
"001", "John, Mary, Riley",
"002", "Sue, John, Peter") %>%
separate_rows(friend, sep = ", ") %>%
left_join(gender_freq, by = c("friend" = "name")) %>%
count(respondent, sex, wt = share)
## A tibble: 4 x 3
# respondent sex n
# <chr> <chr> <dbl>
#1 001 F 1.53
#2 001 M 1.47
#3 002 F 1.00
#4 002 M 2.00
Assuming you have a list that links a name with gender, you can split up your friend column, merge the result with your list and summarise on the gender:
library(tidyverse)
df <- tibble(
respondent = c('001', '002'),
friend = c('John, Mary', 'Sue, John, Peter')
)
names_df <- tibble(
name = c('John', 'Mary', 'Sue','Peter'),
gender = c('M', 'F', 'F', 'M')
)
df %>%
mutate(friend = strsplit(as.character(friend), ", ")) %>%
unnest(friend) %>%
left_join(names_df, by = c('friend' = 'name')) %>%
group_by(respondent) %>%
summarise(male_friends = sum(gender == 'M'),
female_friends = sum(gender == 'F'))
resulting in
# A tibble: 2 x 3
respondent male_friends female_friends
* <chr> <int> <int>
1 001 1 1
2 002 2 1

Filtering dataframe to only show 1 pair of two variables

I have information on physicians working in different hospitals at different points in time. I would like to output a dataframe with that informs each pair of physician in each hospital. I would like to see each pair only once in the dataset; meaning that if physicians A and B work together in the same hospital I would like to see either the pair A-B or the pair B-A, but not both.
Consider the very simple example of hospitals x-y-w, periods 1-2 and physicians A-B-C-D.
mydf <- data.frame(hospital = c("x","x","x","x","x","y","y","y","w","w","w","w"),
period = c(1,1,1,2,2,1,2,2,1,1,2,2),
physician = c("A","B","C","A","B","A","A","C","C","D","A","D"))
Below I manage to get all pairs, however each pair shows twice (swapping between from and to). How could get each pair only showing up once in the output?
pairs_df <- mydf %>%
rename(from = physician) %>%
left_join(mydf, by=c("hospital","period")) %>%
rename(to = physician) %>%
filter(from!=to)
We can use pmin/pmax with duplicated to sort the elements rowwise between the 'from', 'to' columns, apply the duplicated, negate (!) in filter to return the unique rows
library(dplyr)
pairs_df %>%
filter(!duplicated(cbind(pmax(from, to), pmin(from, to))))
Or use base R
subset(pairs_df, !duplicated(cbind(pmax(from, to), pmin(from, to))))
-output
hospital period from to
1 x 1 A B
2 x 1 A C
4 x 1 B C
11 w 1 C D
13 w 2 A D
NOTE: Here, we assume that the columns are character class based on the input data i.e. data.frame construct uses stringsAsFactors = FALSE by default (>= R 4.0.0), but previously it was TRUE by default. If the columns are factor, then we could convert to character class with type.convert
pairs_df <- type.convert(pairs_df, as.is = TRUE)
Or before the filter convert those factor to character
pairs_df %>%
mutate(across(where(is.factor), as.character)) %>%
filter(!duplicated(cbind(pmax(from, to), pmin(from, to))))
Another option is using igraph
get.data.frame(
simplify(
graph_from_data_frame(
pairs_df[c("from", "to", "hospital", "period")],
directed = FALSE
),
edge.attr.comb = "first"
)
)
which gives
from to hospital period
1 A B x 1
2 A C x 1
3 A D w 2
4 B C x 1
5 C D w 1

R: Generate a table of win/loss records against specific players

Let's say I have the following data:
dat <- read.table(text="p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T)
I'm trying to use dplyr to output a summary table of some specific player's (e.g. jon's) statistics against every other player in the dataframe. So, the output should be:
joe: 1-0
james: 1-0
ken: 0-1
I want to use 'group_by' to work with a corpus of joe games, but don't know how to implement conditional group_by's (e.g. group_by joe if p1 or p2 == joe). I could mutate to create a dummy column that is equal to 1 if either of those conditions are true, and group_by that, but was hoping there was a more parsimonious strategy. And then, the only way I can see of counting a 'win' for Joe is to use an ifelse statement whereby if p1 == Joe and outcome == 1-0 or p2 == Joe and outcome == 0-1, then count that as a win for Joe. However, not sure how to do these if statements within dplyr piping.
This would be a dplyr solution that allows for multiple games between jon and the other players (not just one game). It basically filters all games that jon was part of and extracts the opponent via mutate and ifelse. It then summarizes the number of wins and losses after grouping by opponent. In the end I paste the overall result for each opponent and only select this pasted column:
dat %>% mutate(p1 = as.character(p1), p2 = as.character(p2)) %>%
filter((p1 == "jon")|(p2 == "jon")) %>%
mutate(opponent= ifelse(p1 == "jon",p2,p1)) %>%
group_by(opponent) %>%
summarize(Wins = sum((outcome == "1-0" & p1 == "jon") |
(outcome == "0-1" & p2 == "jon")) ,
Losses = n() - Wins) %>%
mutate(Outcome = paste(opponent, ": ",Wins, "-", Losses)) %>%
select(Outcome)
I had to add the as.character mutate to properly return the opponents in the ifelse. Otherwise the variables p1 and p2 would still be factor and the numbers would be returned instead of the labels (i.e. names of the players).
Here's an alternative tidyverse solution:
# example data
dat <- read.table(text="
p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T, stringsAsFactors=F)
library(tidyverse)
# reshape your dataset
dat2 = dat %>%
mutate(game_id = row_number()) %>% # add game id
unite(p, p1, p2, sep="-") %>% # combine player names
separate_rows(p, outcome) # separate rows using name and scores
# get summary stats for jon
dat2 %>%
group_by(game_id) %>% # for each game id
filter("jon" %in% p) %>% # keep games that jon played
summarise(pl = p[p != "jon"], # get the name of the other player
outcome = paste0(outcome[p=="jon"], "-", outcome[p!="jon"])) # combine the scores (jon vs. other)
# # A tibble: 3 x 3
# game_id pl outcome
# <int> <chr> <chr>
# 1 1 joe 1-0
# 2 2 james 0-1
# 3 4 ken 0-1
Assuming you can reshape you original dataset once, in beginning, you can create a function using the second part:
GetSummaryStats = function(x) {
dat2 %>%
group_by(game_id) %>%
filter(x %in% p) %>%
summarise(pl = p[p != x],
outcome = paste0(outcome[p==x], "-", outcome[p!=x])) }
and call it like this:
GetSummaryStats("jon")
for any player you like.

Duplicated rows when aggregating data in dplyr()

I'm trying to create a set of cross-linguistic data by joining three datasets together in dplyr(). Two of the datasets are 'dictionaries' of sorts - they are word lists that I want to attach to speakers. There are 15 speakers and so a number of repetitions throughout the data, while each word only appears once in each of the dictionaries.
When I join two using left_join(), I get replicated cells. I know I can remove the duplicated cells, but I sense that there must be something simple that I'm doing wrong to create this issue.
Example data is as follows:
French <- c("un", "deux", "trois", "chien")
English <- c("one", "two", "three", "dog")
type <- c("number", "number", "number", "animal")
speaker <- c(1, 1, 1, 4)
df.fr = data.frame(speaker, French)
df.en = data.frame(speaker, English)
df.type = data.frame(English, type)
I want to create a new dataset, new.df, by joining df.en and df.fr by speaker, and then joining that to df.type by English.
Preferably I would use dplyr() to do this. When I do the following, I get duplicated rows:
new.data <- df.fr %>% left_join(df.en)
which generates
speaker French English
1 1 un one
2 1 un two
3 1 un three
4 1 deux one
5 1 deux two
6 1 deux three
7 1 trois one
8 1 trois two
9 1 trois three
10 4 chien dog
When really I just want it to join 'un' to 'one', 'deux' to 'two', etc:
speaker French English type
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal
Aside from cbinding the three datasets, you can create a unique id for each speaker for both df.fr and df.en and join on speaker + id:
library(dplyr)
df.fr %>%
group_by(speaker) %>%
mutate(id = 1:n()) %>%
left_join(df.en %>% group_by(speaker) %>% mutate(id = 1:n()),
by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
If you have more than two language datasets, you can also write a more general solution using map and reduce from purrr:
library(purrr)
list(df.fr, df.en) %>%
map(~ group_by(., speaker) %>% mutate(id = 1:n())) %>%
reduce(left_join, by = c("speaker", "id")) %>%
left_join(df.type) %>%
select(-id)
Result:
# A tibble: 4 x 4
# Groups: speaker [2]
speaker French English type
<dbl> <fctr> <fctr> <fctr>
1 1 un one number
2 1 deux two number
3 1 trois three number
4 4 chien dog animal

Resources