Combining two variables to create new variable - r

I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome

my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))

We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))

df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))

Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)
I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL
Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

Summarising data from when groups are not the same

I have the following dataframe:
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2,2,2),
group = c("S_1","G_1","G_2","G_3","M_1","M_2","G_1","G_2","S_1","S_2","M_1","M_2"),
CODE = c(0,1,0,0,1,1,0,1,0,0,1,1)
)
ID group CODE
1 1 S_1 0
2 1 G_1 1
3 1 G_2 0
4 1 G_3 0
5 1 M_1 1
6 1 M_2 1
7 2 G_1 0
8 2 G_2 1
9 2 S_1 0
10 2 S_2 0
11 2 M_1 1
12 2 M_2 1
I would like to summarize the CODE column such that for each ID, I end up with one row:
ID CODE
1 1 100,11,0
2 2 01,11,00
for ID==1, I would like to paste G_1,G_2,G_3 without a delimiter (in numeric order). Same goes for M_1 and M_2 and then S_1. Lastly, I would like to add the summarized G, M, and S into one row separating these by a comma (in alphabetic order).
I could potentially remove the numbers and do group_by(group) %>% summarise(CODE=paste(CODE, collapse="")) for the first step. Though I would like the final string to be in alphabetic order.
We can use tidyr::separate to get data in group in different columns based on delimiter (_) and then summarise first by ID and group1 and then by ID to get one string for each ID.
library(dplyr)
df %>%
arrange(ID,group) %>%
tidyr::separate(group, into = c('group1', 'group2'), sep = "_") %>%
group_by(ID, group1) %>%
summarise(CODE = paste(CODE, collapse = "")) %>%
summarise(CODE = toString(CODE))
# A tibble: 2 x 2
# ID CODE
# <dbl> <chr>
#1 1 100, 11, 0
#2 2 01, 11, 00
Without using separate, we can remove everything after "_" and use it as group.
df %>%
arrange(ID,group) %>%
mutate(group = sub('_.*', '', group)) %>%
group_by(ID, group) %>%
summarise(CODE = paste(CODE, collapse = "")) %>%
summarise(CODE = toString(CODE))
Base R solution:
# Order the dataframe and genericise the group vector:
ordered_df <- within(df[with(df, order(ID, group)), ], {
group <- gsub("_.*", "", group)
}
)
# Summarise the dataframe:
aggregate(CODE~ID, do.call("rbind", lapply(split(ordered_df, paste0(ordered_df$ID, ordered_df$group)),
function(x){
data.frame(ID = unique(x$ID), CODE = paste0(x$CODE, collapse = ""))
}
)
), paste, collapse = ",")

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks
I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.
Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2
Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

Rename a dataframe Column with text from within the column itself

Given a (simplified) dataframe with format
df <- data.frame(a = c(1,2,3,4),
b = c(4,3,2,1),
temp1 = c("-","-","-","foo: 3"),
temp2 = c("-","bar: 10","-","bar: 4")
)
a b temp1 temp2
1 4 - -
2 3 - bar: 10
3 2 - -
4 1 foo: 3 bar: 4
I need to rename all temp columns with the names contained within the column, My end goal is to end up with this:
a b foo bar
1 4 - -
2 3 - 10
3 2 - -
4 1 3 4
the df column names and the data contained within them will be unknown, however the columns that need changing will contain temp and the delimiter will always be a ":"
As such I can easily remove the name from within the columns using dplyr like this:
df <- df %>%
mutate_at(vars(contains("temp")), ~(substr(., str_locate(., ":")+1,str_length(.))))
but first I need to rename the columns based on some function method, that scans the column and returns the value(s) within it, ie.
rename_at(vars(contains("temp")), ~(...some function.....))
As per the example given there's no guarantee that specific rows will have data so I can't simply grab value from row 1
Any ideas welcome.
Thanks in advance
One possibility involving dplyr and tidyr could be:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values")
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 <NA> <NA>
2 2 3 <NA> 10
3 3 2 <NA> <NA>
4 4 1 3 4
If you want to further replace the NAs with -:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values") %>%
mutate_at(vars(-a, -b), ~ replace_na(., "-"))
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 - -
2 2 3 - 10
3 3 2 - -
4 4 1 3 4
This will do the job:
colnames(df)[which(grepl("temp", colnames(df)))] <- unique(unlist(sapply(df[,grepl("temp", colnames(df))],
function(x){gsub("[:].*",
"",
grep("\\w+",
x,
value = TRUE))})))

how to count repetitions of first occuring value with dplyr

I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)

Resources