Updating Values in New Data with Old Data - r

I have the following data frame:
library(dplyr)
old_data = data.frame(id = c(1,2,3), var1 = c(11,12,13))
> old_data
id var1
1 1 11
2 2 12
3 3 13
I want to replace the values in the 2nd row of "old_data" with data in "new_data" (i.e. rows in "old_data" where the id variables matches ):
new_data = data.frame(id = c(4,2,5), var1 = c(11,15,13))
> new_data
id var1
1 4 11
2 2 15
3 5 13
Using the answer found here (Update rows of data frame in R), I tried to do this with the "dplyr" library:
update = old_data %>%
rows_update(new_data, by = "id")
But this gave me the following error:
Error: Attempting to update missing rows.
Run `rlang::last_error()` to see where the error occurred.
This is what I am trying to get:
id var1
1 1 11
2 2 15
3 3 13
Can someone please tell me what I am doing wrong?
Thanks!

A little bit messy but this works (on this sample data at least)
old_data %>%
left_join(new_data,by="id") %>%
mutate(var1 = if_else(!is.na(var1.y),var1.y,var1.x)) %>%
select(id,var1)
# id var1
#1 1 11
#2 2 15
#3 3 13

A base R approach using match -
inds <- match(old_data$id, new_data$id)
old_data$var1[!is.na(inds)] <- na.omit(new_data$var1[inds])
old_data
# id var1
#1 1 11
#2 2 15
#3 3 13

A data.table approach (with turning the data table back into a dataframe):
library(data.table)
as.data.frame(setDT(old_data)[new_data, var1 := .(i.var1), on = "id"])
Output
id var1
1 1 11
2 2 15
3 3 13
An alternative tidyverse option using rows_update. You can filter new_data to only have ids that appear in old_data. Then, you can update those values, like you had previously tried. Essentially, new_data must only have id values that appear in old_data.
library(tidyverse)
old_data %>%
rows_update(., new_data %>% filter(id %in% old_data$id), by = "id")
Data
old_data <-
structure(list(id = c(1, 2, 3), var1 = c(11, 12, 13)),
class = "data.frame",
row.names = c(NA,-3L))
new_data <-
structure(list(id = c(4, 2, 5), var1 = c(11, 15, 13)),
class = "data.frame",
row.names = c(NA,-3L))

We can use dplyr::rows_update if we first use a semi_join on new_data to filter only those ids that are included in old_data.
library(dplyr)
old_data %>%
rows_update(new_data %>%
semi_join(old_data, by = "id"),
by = "id")
#> id var1
#> 1 1 11
#> 2 2 15
#> 3 3 13
Created on 2021-12-29 by the reprex package (v0.3.0)

Related

Get column names into a new variable based on conditions

I have a data frame like this and I am doing this on R. My problems can be divided into two steps.
SUBID
ABC
BCD
DEF
192838
4
-3
2
193928
-6
-2
6
205829
4
-5
9
201837
3
4
4
I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:
SUBID
ABC
BCD
DEF
output
192838
4
-3
2
"BCD"
193928
-6
-2
6
"ABC","BCD"
205829
4
-5
9
"BCD"
201837
3
4
4
" "
And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).
SUBID
output
19
2 ("ABC","BCD")
20
1 ("BCD")
Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know.
I would appreciate any help since right now I am not sure where to start on this. Thank you!
Another way:
library(dplyr)
library(tidyr)
df <- df %>% pivot_longer(-SUBID)
df1 <- df %>%
group_by(SUBID) %>%
summarise(output = paste(name[value < 0L], collapse = ','))
df2 <- df %>%
group_by(SUBID = substr(SUBID, 1, 2)) %>%
summarise(output_count = n_distinct(name[value < 0L]),
output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))
Outputs (two columns are created in the second case, one with just the count and another following your example):
df1
# A tibble: 4 x 2
SUBID output
<int> <chr>
1 192838 "BCD"
2 193928 "ABC,BCD"
3 201837 ""
4 205829 "BCD"
df2
# A tibble: 2 x 3
SUBID output_count output
<chr> <int> <chr>
1 19 2 2 (BCD,ABC,BCD)
2 20 1 1 (BCD)
This answers the first part of your question, the second one, I didn't understand
df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
df
SUBID ABC BCD DEF output
1 192838 4 3 -2 DEF
2 193928 -6 -2 6 ABC,BCD
3 205829 4 -5 9 BCD
4 201837 3 4 4
For the second part, try this:
id <- sapply(strsplit(sub("\\W+", "", df$output), split = ""), function(x){
sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )
data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
SUBID output string
1 19 3 DEF
2 19 2 ABC,BCD
3 20 3 BCD
4 20 0
I added the variable string for you make sure your count of unique values is ok.
One option is to take advantage of dplyr::cur_data() to access the names() of the data and subset based on your criteria. Then you can take advantage of tibble list-columns to hold on to a set of column names of arbitrary length and finally calculate the number of unique values in that list.
library(tidyverse)
d <- structure(list(SUBID = c(192838, 193928, 205829, 201837), ABC = c(4, -6, 4, 3), BCD = c(-3, -2, -5, 4), DEF = c(2, 6, 9, 4)), row.names = c(NA, -4L), class = "data.frame")
d %>%
rowwise() %>%
mutate(neg_col_names = list(names(cur_data())[cur_data() < 0])) %>%
group_by(ID_grp = str_sub(SUBID, 1, 2)) %>%
summarize(neg_col_count = n_distinct(unlist(c(neg_col_names))))
#> # A tibble: 2 × 2
#> ID_grp neg_col_count
#> <chr> <int>
#> 1 19 2
#> 2 20 1
Created on 2022-11-22 with reprex v2.0.2

Add a column with single value per group

i have a grouped tibble with several columns. i now want to add a new column that has the same value for every row within a group but a different value for each group, basically giving the groups names. these per group values are supplied from a vector.
ideally i want to do this in generic way, so it works in a function based on the number of groups the input has.
any help would be much appreciated, here is a very basic and reduced example of the tibble and vector. (the original tibble has character, int, and dbl columns)
df <- tibble(a = c(1,2,3,1,3,2)) %>% group_by(a)
names <- c("owl", "newt", "zag")
desired_output <– tibble(a = c(1, 2, 3, 1, 3, 2),
name = c("owl", "newt", "zag", "owl", "zag", "newt"))
as the output i would like to have the same tibble just with another column for all in group 1 = owl, 2 = newt, and 3 = zag
Just take a as indices:
library(dplyr)
df %>%
mutate(name = names[a])
# # A tibble: 6 × 2
# a name
# <dbl> <chr>
# 1 1 owl
# 2 2 newt
# 3 3 zag
# 4 1 owl
# 5 3 zag
# 6 2 newt
You can also use recode() if a cannot be used as indices.
df %>%
mutate(name = recode(a, !!!setNames(names, 1:3)))
Data
df <- tibble(a = c(1,2,3,1,3,2))
names <- c("owl", "newt", "zag")
Something like this?
library(dplyr)
names = c("owl", "newt", "zag")
df %>%
group_by(a) %>%
mutate(new_col = case_when(a == 1 ~ names[1],
a == 2 ~ names[2],
a == 3 ~ names[3]))
a new_col
<dbl> <chr>
1 1 owl
2 2 newt
3 3 zag
4 1 owl
5 2 newt
6 3 zag
7 2 newt
8 3 zag
9 1 owl
10 2 newt
11 1 owl
12 3 zag
13 2 newt
14 3 zag
data:
df <- structure(list(a = c(1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L))
You could use factor()with mutate()
names = c("owl", "newt", "zag")
dat = data.frame(a = c(1,2,3,1,2,3,2,3,1,2,1,3,2,3))
dat %>% mutate(label = factor(a, levels = c(1,2,3), labels = names))
Just make sure the order in levels corresponds to the order in labels (i.e 1 = "owl")

Combining two variables to create new variable

I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2

how to count repetitions of first occuring value with dplyr

I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)

mutate_at does not create variable suffixes in some cases?

I have been playing with dplyr::mutate_at to create new variables by applying the same function to some of the columns. When I name my function in the .funs argument, the mutate call creates new columns with a suffix instead of replacing the existing ones, which is a cool option that I discovered in this thread.
df = data.frame(var1=1:2, var2=4:5, other=9)
df %>% mutate_at(vars(contains("var")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other var1_sqrt var2_sqrt
#### 1 1 4 9 1.000000 2.000000
#### 2 2 5 9 1.414214 2.236068
However, I noticed that when the vars argument used to point my columns returns only one column instead of several, the resulting new column drops the initial name: it gets named sqrt instead of other_sqrt here:
df %>% mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3
I would like to understand why this behaviour happens, and how to avoid it because I don't know in advance how many columns the contains() will return.
EDIT:
The newly created columns must inherit the original name of the original columns, plus the suffix 'sqrt' at the end.
Thanks
Here is another idea. We can add setNames(sub("^sqrt$", "other_sqrt", names(.))) after the mutate_at call. The idea is to replace the column name sqrt with other_sqrt. The pattern ^sqrt$ should only match the derived column sqrt if there is only one column named other, which is demonstrated in Example 1. If there are more than one columns with other, such as Example 2, the setNames would not change the column names.
library(dplyr)
# Example 1
df <- data.frame(var1 = 1:2, var2 = 4:5, other = 9)
df %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
# Example 2
df2 <- data.frame(var1 = 1:2, var2 = 4:5, other1 = 9, other2 = 16)
df2 %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
Or we can design a function to check how many columns contain the string other before manipulating the data frame.
mutate_sqrt <- function(df, string){
string_col <- grep(string, names(df), value = TRUE)
df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
if (length(string_col) == 1){
df2 <- df2 %>% setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
}
return(df2)
}
mutate_sqrt(df, "other")
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
mutate_sqrt(df2, "other")
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
I just figured out a (not so clean) way to do it;
I add a extra dummy variable to the dataset, with a name that ensures that it will be selected and that we don't fall into the 1-variable case, and after the calculation I remove the 2 dummies, like this:
df %>% mutate(other_fake=NA) %>%
mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt)) %>%
select(-contains("other_fake"))
#### var1 var2 other other_sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3

Resources