Looping and concatenating based on a condition in R - r

I'm new to R and still struggling with loops.
I'm trying to create a loop where, based on a condition (variable_4 == 1), it will concatenate the content of variable_5, separated by comma.
data1 <- data.frame(
ID = c(123:127),
agent_1 = c('James', 'Lucas','Yousef', 'Kyle', 'Marisa'),
agent_2 = c('Sophie', 'Danielle', 'Noah', 'Alex', 'Marcus'),
agent_3 = c('Justine', 'Adrienne', 'Olivia', 'Janice', 'Josephine'),
Flag_1 = c(1,0,1,0,1),
Flag_2 = c(0,1,0,0,1),
Flag_3 = c(1,0,1,0,1)
)
data1$new_var<- ""
for(i in 2:10){
variable_4 <- paste0("flag_", i)
variable_5 <- paste0("agent_", i)
data1 <- data1 %>%
mutate(!! new_var = case_when(variable_4 == 1,paste(new_var, variable_5, sep=",")))
}
I've created new_var in a previous step because the code was giving me an error that the variable was not found. Ideally, the loop will accumulate the contents of variable_5, only if variable_4 is equal 1 and the result would be big string, separate by comma.
The loop will paste in the new var only the name of the agents which the flags are = 1. If Flag_1=1, then paste the name of the agent in the new_var, if not, ignore. If flag_2 =1, then concatenate the name of the agent in the new var, separating by comma, if not, then ignore...

You shouldn't need to use a loop for this. The data is in wide format which makes it harder, but if we convert to long format, we can easily find a vectorized solution rather than using a loop.
The pivot_longer function is useful here which requires tidyr version >= 1.0.0.
library(tidyr)
library(dplyr)
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_") %>%
group_by(ID) %>%
mutate(new_var = paste0(agent[Flag==1], collapse = ',')) %>%
pivot_wider(names_from = c("group"),
values_from = c('agent', 'Flag'),
names_sep = '_') %>%
ungroup() %>%
select(ID, starts_with('agent'), starts_with('Flag'), new_var)
## A tibble: 5 x 8
# ID agent_1 agent_2 agent_3 Flag_1 Flag_2 Flag_3 new_var
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 123 James Sophie Justine 1 0 1 James,Justine
#2 124 Lucas Danielle Adrienne 0 1 0 Danielle
#3 125 Yousef Noah Olivia 1 0 1 Yousef,Olivia
#4 126 Kyle Alex Janice 0 0 0 ""
#5 127 Marisa Marcus Josephine 1 1 1 Marisa,Marcus,Josephine
Details:
pivot_longer puts our data into a more natural format where each row represents one observation of the variables agent and flag, rather than several:
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_")
## A tibble: 15 x 4
# ID group agent Flag
# <int> <chr> <chr> <chr>
# 1 123 1 James 1
# 2 123 2 Sophie 0
# 3 123 3 Justine 1
# 4 124 1 Lucas 0
# 5 124 2 Danielle 1
# 6 124 3 Adrienne 0
# ...
For each ID, we can then paste together the agents which have flag values of 1. This is easy now that our variables are contained in single columns.
Lastly, we revert back to the wide format with pivot_wider. We also ungroup the data we previously grouped, and re-order the columns to the desired format.

There are a few different ways to do this in BaseR or the tidyverse, or a combination of both, if you stick to using tidyverse then consider this:
I have used mtcars as your dataframe instead!
#load dplyr or tidyverse
library(tidyverse)
# create data as mtcars
df <- mtcars
# create two new columns flag and agent as rownumbers
df <- df %>%
mutate(flag = paste0("flag", row_number())) %>%
mutate(agent = paste0("agent", row_number()))
# using case when in mutate statement
df2 <- df %>%
mutate(new_column = ifelse(flag == "flag1", yes = paste0(agent, " this is a new variable"), no = flag))
print(df2)
an ifelse statement might be more appropriate if you have one case - but if you have many then use case_when instead.

Related

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Mutate column based on list of lists in R

I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:
dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))
dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')
And then I gather the data frame like so:
dummy_data <- dummy_data %>%
gather("part", "number", 2:ncol(.))
What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:
dummydata <- dummydata %>%
mutate(Group = case_when(
str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
))
However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?
Any help is much appreciated, thanks!
It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>%
# // reshape to long format - pivot_longer instead of gather
pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>%
# // join with the tibble/data.frame converted dummylist
regex_left_join(dummylist %>%
enframe(name = 'Group', value = 'part') %>%
unnest(part)) %>%
rename(part = part.x) %>%
select(-part.y)
-output
# A tibble: 160 × 4
id part number Group
<int> <chr> <int> <chr>
1 1 test1_10 72 shipments
2 1 test2_11 62 shipments
3 1 test3_12 17 shipments
4 1 check1_20 89 arrivals
5 1 check2_21 54 arrivals
6 1 sound1_30 39 departures
7 1 sound2_31 94 departures
8 1 sound3_32 95 departures
9 2 test1_10 77 shipments
10 2 test2_11 4 shipments
# … with 150 more rows
If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:
lookup <- sapply(
names(dummylist),
\(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
) |>
setNames(nm = NULL) |>
unlist()
lookup
# test1_ test2_ test3_ check1_ check2_ sound1_ sound2_ sound3_
# "shipments" "shipments" "shipments" "arrivals" "arrivals" "departures" "departures" "departures"
Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:
dummy_data |>
pivot_longer(-id, names_to = 'part', values_to = 'number') |>
mutate(group = lookup[gsub('^(\\w+_).*$', '\\1', part)])
# # A tibble: 160 × 4
# id part number group
# <int> <chr> <int> <chr>
# 1 1 test1_10 91 shipments
# 2 1 test2_11 74 shipments
# 3 1 test3_12 46 shipments
# 4 1 check1_20 62 arrivals
# 5 1 check2_21 7 arrivals
# 6 1 sound1_30 35 departures
# 7 1 sound2_31 23 departures
# 8 1 sound3_32 84 departures
# 9 2 test1_10 59 shipments
# 10 2 test2_11 73 shipments
# # … with 150 more rows

How to use slice in dplyr to keep the rows with NA values in R

I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA

Rename a dataframe Column with text from within the column itself

Given a (simplified) dataframe with format
df <- data.frame(a = c(1,2,3,4),
b = c(4,3,2,1),
temp1 = c("-","-","-","foo: 3"),
temp2 = c("-","bar: 10","-","bar: 4")
)
a b temp1 temp2
1 4 - -
2 3 - bar: 10
3 2 - -
4 1 foo: 3 bar: 4
I need to rename all temp columns with the names contained within the column, My end goal is to end up with this:
a b foo bar
1 4 - -
2 3 - 10
3 2 - -
4 1 3 4
the df column names and the data contained within them will be unknown, however the columns that need changing will contain temp and the delimiter will always be a ":"
As such I can easily remove the name from within the columns using dplyr like this:
df <- df %>%
mutate_at(vars(contains("temp")), ~(substr(., str_locate(., ":")+1,str_length(.))))
but first I need to rename the columns based on some function method, that scans the column and returns the value(s) within it, ie.
rename_at(vars(contains("temp")), ~(...some function.....))
As per the example given there's no guarantee that specific rows will have data so I can't simply grab value from row 1
Any ideas welcome.
Thanks in advance
One possibility involving dplyr and tidyr could be:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values")
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 <NA> <NA>
2 2 3 <NA> 10
3 3 2 <NA> <NA>
4 4 1 3 4
If you want to further replace the NAs with -:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values") %>%
mutate_at(vars(-a, -b), ~ replace_na(., "-"))
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 - -
2 2 3 - 10
3 3 2 - -
4 4 1 3 4
This will do the job:
colnames(df)[which(grepl("temp", colnames(df)))] <- unique(unlist(sapply(df[,grepl("temp", colnames(df))],
function(x){gsub("[:].*",
"",
grep("\\w+",
x,
value = TRUE))})))

How to pass a variable name to dplyr's group_by()

I can calculate the rank of the values (val) in my dataframe df within the group name1 with the code:
res <- df %>% arrange(val) %>% group_by(name1) %>% mutate(RANK=row_number())
Instead of writing the column "name1" in the code, I want to pass it as variable, eg crit = "name1". However, the code below does not work since crit1 is assumed to be the column name instead of a variable name.
res <- df %>% arrange(val) %>% group_by(crit1) %>% mutate(RANK=row_number())
How can I pass crit1 in the code?
Thanks.
We can use group_by_
library(dplyr)
df %>%
arrange(val) %>%
group_by_(.dots=crit1) %>%
mutate(RANK=row_number())
#Source: local data frame [10 x 4]
#Groups: name1, name2 [7]
# val name1 name2 RANK
# <dbl> <chr> <chr> <int>
#1 -0.848370044 b c 1
#2 -0.583627199 a a 1
#3 -0.545880758 a a 2
#4 -0.466495124 b b 1
#5 0.002311942 a c 1
#6 0.266021979 c a 1
#7 0.419623149 c b 1
#8 0.444585270 a c 2
#9 0.536585304 b a 1
1#0 0.847460017 a c 3
Update
group_by_ is deprecated in the recent versions (now using dplyr version - 0.8.1), so we can use group_by_at which takes a vector of strings as input variables
df %>%
arrange(val) %>%
group_by_at(crit1) %>%
mutate(RANK=row_number())
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!)
df %>%
arrange(val) %>%
group_by(!!! rlang::syms(crit1)) %>%
mutate(RANK = row_number())
data
set.seed(24)
df <- data.frame(val = rnorm(10), name1= sample(letters[1:3], 10, replace=TRUE),
name2 = sample(letters[1:3], 10, replace=TRUE),
stringsAsFactors=FALSE)
crit1 <- c("name1", "name2")
Update with dplyr 1.0.0
The new across syntax eliminates the need for !!! rlang::syms(). So you can now simplify the code by:
df %>%
arrange(val) %>%
group_by(across(all_of(crit1))) %>%
mutate(RANK = row_number())
Facing a similar task I could successfully work with these two options.
Use across():
for (crit in names(df)) {
print(df |>
# all_of() is not needed here
group_by(across(crit)) |>
count())
}
Use syms() and !!:
crits = syms(names(df))
for (crit in crits) {
print(df |>
# the use of !! instead of !!! is now encouraged
group_by(!!crit) |>
count())
}

Resources