I would like to convert repeating values in a vector into NA's, such that I keep the position of the first occurrence of each new value.
I can find lots of posts on how to solve the removal of duplicate rows, but no posts that solve this issue.
Can you help me convert the column "problem" into the values in the column "desire"?
dplyr solutions are preferred.
library(tidyverse)
df <- tribble(
~frame, ~problem, ~desire,
1, NA, NA,
2, "A", "A",
3, NA, NA,
4, "B", "B",
5, "B", NA,
6, NA, NA,
7, "C", "C",
8, "C", NA,
9, NA, NA,
10, "E", "E")
df
# A tibble: 10 x 3
frame problem desire
<dbl> <chr> <chr>
1 1 NA NA
2 2 A A
3 3 NA NA
4 4 B B
5 5 B NA
6 6 NA NA
7 7 C C
8 8 C NA
9 9 NA NA
10 10 E E
_____EDIT with "Base R"/ "dplyr" solution___
Ronak Shah's solution works. Here it is within a dplyr workflow in case anyone is interested:
df %>%
mutate(
solved = replace(problem, duplicated(problem), NA))
# A tibble: 10 x 4
frame problem desire solved
<dbl> <chr> <chr> <chr>
1 1 NA NA NA
2 2 A A A
3 3 NA NA NA
4 4 B B B
5 5 B NA NA
6 6 NA NA NA
7 7 C C C
8 8 C NA NA
9 9 NA NA NA
10 10 E E E
Using data.table rleid, we can replace the duplicated values to NA.
library(data.table)
df$answer <- replace(df$problem, duplicated(rleid(df$problem)), NA)
# frame problem desire answer
# <dbl> <chr> <chr> <chr>
# 1 1 NA NA NA
# 2 2 A A A
# 3 3 NA NA NA
# 4 4 B B B
# 5 5 B NA NA
# 6 6 NA NA NA
# 7 7 C C C
# 8 8 C NA NA
# 9 9 NA NA NA
#10 10 E E E
For a complete base R option we can use rle instead of rleid to create sequence
df$answer <- replace(df$problem, duplicated(with(rle(df$problem),
rep(seq_along(values), lengths))), NA)
As in the example shown if all the similar values are always together we can use only duplicated
df$problem <- replace(df$problem, duplicated(df$problem), NA)
We can use data.table
library(data.table)
setDT(df)[duplicated(rleid(problem)), problem := NA][]
Related
I have created the following dataframe in R
library(tidyR)
library(dplyr)
DF11<- data.frame("ID"= c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F<-c(5, 7,9,6,7,8,9,10)
DF11$X_A<-c(7, 8,9,3,6,7,9,10)
The dataframe looks as follows
ID X_F X_A
A 5 7
A 7 8
A 9 9
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
ID is the grouping variable. I would like to use dplyr to create the following dataframe.
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 7 8
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
The resultant dataframe should take DF11 and then group the X_F column using ID column. Next it should complete X_F group-wise from 0 to the minimum value of X_F by group, and then from the maximum value of X_F to maximum value X_F +3.
I tried the following code and was able to solve it partially.
DF112<-DF11%>%group_by(ID)%>%complete(X_F=seq(0, max(X_F)+3, by =1))
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 6 NA
A 7 8
A 8 NA
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
How do I get the desired output mentioned above. I request someone to guide me.
It would work to pass two vectors into your complete function call, one to do the lower values and one to do the upper:
library(tidyr)
library(dplyr)
DF11 <- data.frame("ID" = c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F <- c(5, 7, 9, 6, 7, 8, 9, 10)
DF11$X_A <- c(7, 8, 9, 3, 6, 7, 9, 10)
DF11 %>%
group_by(ID) %>%
complete(X_F = c(seq(0, min(X_F) - 1 , by = 1), seq(max(X_F) + 1, max(X_F) + 3, by = 1))) |>
arrange(ID, X_F)
# A tibble: 25 × 3
# Groups: ID [2]
ID X_F X_A
<chr> <dbl> <dbl>
1 A 0 NA
2 A 1 NA
3 A 2 NA
4 A 3 NA
5 A 4 NA
6 A 5 7
7 A 7 8
8 A 9 9
9 A 10 NA
10 A 11 NA
11 A 12 NA
12 B 0 NA
13 B 1 NA
14 B 2 NA
15 B 3 NA
16 B 4 NA
17 B 5 NA
18 B 6 3
19 B 7 6
20 B 8 7
21 B 9 9
22 B 10 10
23 B 11 NA
24 B 12 NA
25 B 13 NA
Created on 2022-11-01 with reprex v2.0.2
Question
In R, can I used a vector that holds the names of data frame columns to avoid repeated code?
vec_columns <- c("col1", "col2", "col8", "col54")
Background
I started off looking to solve the problem that is answered in this question, which works on its own terms:
Coalesce columns and create another column to specify source
But in my specific use case, I have many columns and the ones that I want to coalesce() are not adjacent to each other, so the <tidy-select> used in that solution doesn't work for me.
Modified Example from Original Question
In the original question, the OP was using contiguous columns that began in column #1, but I have leading columns that are not part of the coalesce(), plus the columns I want to coalesce() are separated from each other.
df_2 <-
data.frame(
Name = c("A", "B", "C", "D", "E"), #Adding a name column not in original posted question
group_1 = c(NA, NA, NA, NA, 2),
group_2 = c(NA, 4, NA, NA, 1),
group_3 = c(NA, NA, 5, NA, NA),
group_4 = c(1, NA, NA, 2, NA),
group_5 = c(NA, 3, NA, NA, NA)
)
> df_2
Name group_1 group_2 group_3 group_4 group_5
1 A NA NA NA 1 NA
2 B NA 4 NA NA 3
3 C NA NA 5 NA NA
4 D NA NA NA 2 NA
5 E 2 1 NA NA NA
This solution below I created and does exactly what I want for output:
df_2 %>%
mutate(one_col = coalesce(group_2, group_3, group_5)) %>%
rowwise() %>%
mutate(group_col = c("group_2", "group_3", "group_5")[!is.na(c_across(c(group_2, group_3, group_5)))][1])
# A tibble: 5 x 8
# Rowwise:
Name group_1 group_2 group_3 group_4 group_5 one_col group_col
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 A NA NA NA 1 NA NA NA
2 B NA 4 NA NA 3 4 group_2
3 C NA NA 5 NA NA 5 group_3
4 D NA NA NA 2 NA NA NA
5 E 2 1 NA NA NA 1 group_2
The Problem
But as you can see, I have to repeat those column names 3 times. Future proofing myself, I envision where I might want or 10 columns in the coalesce(). I want to just set a variable holding a vector of column names once, but doing the obvious myvec <- c("group_2", "group_3", "group_5") and inserting it doesn't work.
EDIT
Reading the comments, I got to an answer that meets the need, but I'm reluctant to answer it myself because credit goes to the commenter(s).
This achieves what I wanted:
myvec <- c("group_2", "group_3","group_5")
df_2 %>%
mutate(one_col = coalesce(!!!select(.,myvec))) %>% rowwise() %>%
mutate(group_col = myvec[!is.na(c_across(myvec))][1])
# A tibble: 5 x 8
# Rowwise:
Name group_1 group_2 group_3 group_4 group_5 one_col group_col
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 A NA NA NA 1 NA NA NA
2 B NA 4 NA NA 3 4 group_2
3 C NA NA 5 NA NA 5 group_3
4 D NA NA NA 2 NA NA NA
5 E 2 1 NA NA NA 1 group_2
The main goal here is really to get the group_col column which records which column was used in the coalesce(), but with the flexibility to handle non-adjacent columns, and also to change the order of the columns in the coalesce(). For example, below I reverse the order of the myvec so that row 2, "B" selects 3 and group_5, instead of 4 and group_2 in the above example.
> myvec2 <- c("group_5", "group_3","group_2")
> df_2 %>%
+ mutate(one_col = coalesce(!!!select(.,myvec2))) %>% rowwise() %>%
+ mutate(group_col = myvec2[!is.na(c_across(myvec2))][1])
Note: Using an external vector in selections is ambiguous.
i Use `all_of(myvec2)` instead of `myvec2` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
# A tibble: 5 x 8
# Rowwise:
Name group_1 group_2 group_3 group_4 group_5 one_col group_col
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 A NA NA NA 1 NA NA NA
2 B NA 4 NA NA 3 3 group_5
3 C NA NA 5 NA NA 5 group_3
4 D NA NA NA 2 NA NA NA
5 E 2 1 NA NA NA 1 group_2
I am mildly concerned about the message I received about external vector in selections is ambiguous, so perhaps I'll need to use that all_of() function, but this did work.
This must be easy but my brain is blocked!
I have this dataframe:
col1
<chr>
1 A
2 B
3 NA
4 C
5 D
6 NA
7 NA
8 E
9 NA
10 F
df <- structure(list(col1 = c("A", "B", NA, "C", "D", NA, NA, "E",
NA, "F")), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I want to add a column with uniqueID only for values that are not NA with tidyverse.
Expected output:
col1 uniqueID
<chr> <dbl>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
I have tried: n(), row_number(), cur_group_id ....
We could do this easily in data.table. Specify the condition in i i.e. non-NA elements in 'col1', create the column 'uniqueID' with the sequence of elements by assignment (:=)
library(data.table)
setDT(df)[!is.na(col1), uniqueID := seq_len(.N)]
-output
df
col1 uniqueID
1: A 1
2: B 2
3: <NA> NA
4: C 3
5: D 4
6: <NA> NA
7: <NA> NA
8: E 5
9: <NA> NA
10: F 6
In dplyr, we can use replace
library(dplyr)
df %>%
mutate(uniqueID = replace(col1, !is.na(col1),
seq_len(sum(!is.na(col1)))))
-output
# A tibble: 10 x 2
col1 uniqueID
<chr> <chr>
1 A 1
2 B 2
3 <NA> <NA>
4 C 3
5 D 4
6 <NA> <NA>
7 <NA> <NA>
8 E 5
9 <NA> <NA>
10 F 6
Another approach:
library(dplyr)
df %>%
mutate(UniqueID = cumsum(!is.na(col1)),
UniqueID = if_else(is.na(col1), NA_integer_, UniqueID))
# A tibble: 10 x 2
col1 UniqueID
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
A base R option using match + na.omit + unique
transform(
df,
uniqueID = match(col1, na.omit(unique(col1)))
)
gives
col1 uniqueID
1 A 1
2 B 2
3 <NA> NA
4 C 3
5 D 4
6 <NA> NA
7 <NA> NA
8 E 5
9 <NA> NA
10 F 6
A weird tidyverse solution:
library(dplyr)
df %>%
mutate(id = ifelse(is.na(col1), 0, 1),
id = cumsum(id == 1),
id = ifelse(is.na(col1), NA, id))
# A tibble: 10 x 2
col1 id
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
I have a large dataframe with responses to a questionnaire. My minimal working example (below) has the responses to 3 questions as well as the delay in responding to the questionnaire from which the answers are drawn
df <- data.frame(ID = LETTERS[1:10],
Q1 = sample(0:10, 10, replace=T),
Q2 = sample(0:10, 10, replace=T),
Q3 = sample(0:10, 10, replace=T),
Delay = 1:10
)
I'd like to change the responses with a delay > 3 to NA's. I can accomplish this easily enough for a single question:
df %>%
mutate(Q1 = ifelse(Delay >3, NA, Q1))
which gives me
ID Q1 Q2 Q3 Delay
1 A 5 6 9 1
2 B 8 1 5 2
3 C 8 4 6 3
4 D NA 7 1 4
5 E NA 8 10 5
6 F NA 9 4 6
7 G NA 1 6 7
8 H NA 8 9 8
9 I NA 9 1 9
10 J NA 5 7 10
I'd like instead to do this for all three questions with one statement (in my real life problem, I have over 20 questions, so it's tedious to do each question separately). I therefore create a vector of questions:
q_vec <- c("Q1", "Q2", "Q3")
and then tried variants of my earlier code such as
df %>%
mutate(all_of(q_vec) = ifelse(Delay >3, NA, ~))
but nothing worked.
What is the correct syntax for this?
Many thanks in advance
Thomas Philips
We can use across :
library(dplyr)
q_vec <- c("Q1", "Q2", "Q3")
df %>% mutate(across(all_of(q_vec), ~ifelse(Delay >3, NA, .)))
# ID Q1 Q2 Q3 Delay
#1 A 1 5 0 1
#2 B 9 9 6 2
#3 C 5 7 1 3
#4 D NA NA NA 4
#5 E NA NA NA 5
#6 F NA NA NA 6
#7 G NA NA NA 7
#8 H NA NA NA 8
#9 I NA NA NA 9
#10 J NA NA NA 10
Or in base R :
df[q_vec][df$Delay > 3, ] <- NA
I've to data frame, let's say A and B.
The table A is constructed like this :
ID a b c d
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
And the table B is constructed like this :
A B
a 1
a 2
a 3
b 2
b 6
b 8
b 9
c 1
c 6
c 11
d 5
d 4
Basically what i'd like to do is to for the ID change NA in 1 (in table A) if in the table B 1(column B) is associated with a(column A).
I'm not sure this is the best way to do this maybe using a matrix could be simpler.
I think what you want to convert B to a dense table with 1 present if that combination is present in B. You can do that by recognising that B is the same data but with the value of the cells left out. We need to add that in and then spread to convert from long to wide:
library(tidyverse)
tbl_b <- tibble(
A = c("a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "d", "d"),
B = c(1, 2, 3, 2, 6, 8, 9, 1, 6, 11, 5, 4)
)
tbl_b %>%
mutate(value = 1) %>%
spread(A, value)
#> # A tibble: 9 x 5
#> B a b c d
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 NA 1 NA
#> 2 2 1 1 NA NA
#> 3 3 1 NA NA NA
#> 4 4 NA NA NA 1
#> 5 5 NA NA NA 1
#> 6 6 NA 1 1 NA
#> 7 8 NA 1 NA NA
#> 8 9 NA 1 NA NA
#> 9 11 NA NA 1 NA
Created on 2019-02-22 by the reprex package (v0.2.1)