I have two columns about the IDs of participants in my study. The column ID contains progressive order of numbers as the subjects were all distinct people. The second column new_ID contains the information about which IDs correspond to the same person. Unfortunately they are not in the progressive order.
ID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6)
new_ID <- c(8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10)
data.frame(ID, new_ID)
# ID new_ID
#1 1 8
#2 1 8
#3 1 8
#4 1 8
#5 2 10
#6 2 10
#7 2 10
#8 2 10
#9 2 10
#10 2 10
#11 3 8
#12 3 8
#13 3 8
#14 3 8
#15 3 8
#16 4 4
#17 4 4
#18 4 4
#19 4 4
#20 4 4
#21 4 4
#22 5 5
#23 5 5
#24 5 5
#25 5 5
#26 6 10
#27 6 10
#28 6 10
#29 6 10
#30 6 10
#31 6 10
#32 6 10
I reported below what I would like to achieve, i.e. assigning the new ID (final_ID) based on the information in the two first columns. Any helps will be appreciated (best if using dplyr)!
# ID new_ID ID_final
#1 1 8 1
#2 1 8 1
#3 1 8 1
#4 1 8 1
#5 2 10 2
#6 2 10 2
#7 2 10 2
#8 2 10 2
#9 2 10 2
#10 2 10 2
#11 3 8 1
#12 3 8 1
#13 3 8 1
#14 3 8 1
#15 3 8 1
#16 4 4 4
#17 4 4 4
#18 4 4 4
#19 4 4 4
#20 4 4 4
#21 4 4 4
#22 5 5 5
#23 5 5 5
#24 5 5 5
#25 5 5 5
#26 6 10 2
#27 6 10 2
#28 6 10 2
#29 6 10 2
#30 6 10 2
#31 6 10 2
#32 6 10 2
Here's a data.table solution as well.
EDIT: added a dplyr solution too at the request of the OP.
library(data.table)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6)
new_ID <- c(8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10)
d <- data.table(ID, new_ID)
d[, ID_final := min(.SD[,ID]), new_ID]
d
#> ID new_ID ID_final
#> 1: 1 8 1
#> 2: 1 8 1
#> 3: 1 8 1
#> 4: 1 8 1
#> 5: 2 10 2
#> 6: 2 10 2
#> 7: 2 10 2
#> 8: 2 10 2
#> 9: 2 10 2
#> 10: 2 10 2
#> 11: 3 8 1
#> 12: 3 8 1
#> 13: 3 8 1
#> 14: 3 8 1
#> 15: 3 8 1
#> 16: 4 4 4
#> 17: 4 4 4
#> 18: 4 4 4
#> 19: 4 4 4
#> 20: 4 4 4
#> 21: 4 4 4
#> 22: 5 5 5
#> 23: 5 5 5
#> 24: 5 5 5
#> 25: 5 5 5
#> 26: 6 10 2
#> 27: 6 10 2
#> 28: 6 10 2
#> 29: 6 10 2
#> 30: 6 10 2
#> 31: 6 10 2
#> 32: 6 10 2
#> ID new_ID ID_final
library(dplyr)
df <- data.frame(ID, new_ID)
df <- df %>% group_by(new_ID) %>%
mutate(ID_final = min(ID))
df
#> # A tibble: 32 x 3
#> # Groups: new_ID [4]
#> ID new_ID ID_final
#> <dbl> <dbl> <dbl>
#> 1 1 8 1
#> 2 1 8 1
#> 3 1 8 1
#> 4 1 8 1
#> 5 2 10 2
#> 6 2 10 2
#> 7 2 10 2
#> 8 2 10 2
#> 9 2 10 2
#> 10 2 10 2
#> # ... with 22 more rows
Created on 2019-09-30 by the reprex package (v0.3.0)
What you want to do is find the correct ID for each new_ID, and then join to that mapping.
final_id_map <- df %>% group_by(new_ID) %>% summarise(ID_final=min(ID))
> final_id_map
# A tibble: 4 x 2
new_ID ID_final
<dbl> <dbl>
1 4 4
2 5 5
3 8 1
4 10 2
Then you can just do a
df %>% join(final_id_map)
to produce the desired output.
Related
Hi I have a dataframe as such,
df= structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
df$total = apply ( df, 1,sum )
df$row = seq ( 1, nrow ( df ))
so the dataframe looks like this.
> df
a b c d e f total row
1 1 1 6 6 1 2 17 1
2 3 3 3 2 2 3 16 2
3 4 4 6 4 4 4 26 3
4 6 2 5 5 5 2 25 4
5 3 6 3 3 6 2 23 5
6 2 7 6 7 7 7 36 6
7 5 2 5 2 6 5 25 7
8 1 6 3 6 3 2 21 8
what I want to do is figure the first leading row where the total is greater than the current. For example for row 1 the total is 17 and the nearest leading row >= 17 would be row 3.
I could loop through each row but it gets really messy. Is this possible?
thanks in advance.
We can do this in 2 steps with dplyr. First we set grouping to rowwise, which applies the operation on each row (basically it makes it work like we were doing an apply loop through the rows), then we find all the rows where total is larger than that row's total. Then we drop those that come before the current row and pick the first (which is the next one):
library(dplyr)
df %>%
rowwise() %>%
mutate(nxt = list(which(.$total > total)),
nxt = nxt[nxt > row][1])
# A tibble: 8 × 9
# Rowwise:
a b c d e f total row nxt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 6 6 1 2 17 1 3
2 3 3 3 2 2 3 16 2 3
3 4 4 6 4 4 4 26 3 6
4 6 2 5 5 5 2 25 4 6
5 3 6 3 3 6 2 23 5 6
6 2 7 6 7 7 7 36 6 NA
7 5 2 5 2 6 5 25 7 NA
8 1 6 3 6 3 2 21 8 NA
my_df <- tibble(
b1 = c(2, 6, 3, 6, 4, 2, 1, 9, NA),
b2 = c(NA, 4, 6, 2, 6, 6, 1, 1, 7),
b3 = c(5, 9, 8, NA, 2, 3, 9, 5, NA),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, 7, 9, 5, 1, 1, 2, NA),
b10 = c(14, 2, 4, 2, 1, 1, 1, 1, 5))
Hi Guys,
Hope you are all good. I have a df like this (very big one), and I want to tell R to add 10 to the values in b1 if there is 2 in either b6, 67, b8 or b9.
Thanks once again in anticipation.
We can create a logical condition in case_when by taking the row sums of subset of columns b6:b9 to find if the row have at least 2 in any of the row then add 10 to b1 or else return the original column
library(dplyr)
my_df <- my_df %>%
mutate(b1 = case_when(rowSums(select(cur_data(), b6:b9) == 2,
na.rm = TRUE) > 0 ~ b1 + 10, TRUE ~ b1))
-output
my_df
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or may also use if_any
my_df %>%
mutate(b1 = case_when(if_any(b6:b9, `%in%`, 2) ~ b1 + 10, TRUE ~ b1))
-output
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or the same in base R
i1 <- rowSums(my_df[6:9] == 2, na.rm = TRUE) > 0
my_df$b1[i1] <- my_df$b1[i1] + 10
Or with Reduce/lapply and %in%
i1 <- Reduce(`|`, lapply(my_df[6:9], `%in%`, 2))
my_df$b1[i1] <- my_df$b1[i1] + 10
You can also use the following solution:
library(dplyr)
library(purrr)
my_df %>%
pmap_df(~ {x <- c(...)[6:9];
y <- c(...)[1]
if(any(2 %in% x[!is.na(x)])) {
y + 10
} else {
y
}
}) %>%
bind_cols(my_df[-1])
# A tibble: 9 x 10
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
Or we can use this thanks to a great suggestion by dear #akrun:
my_df %>%
mutate(b1 = ifelse(pmap_lgl(select(cur_data(), b6:b9), ~ 2 %in% c(...)), b1 + 10, b1))
Like your previous question, you can also use rowwise() here
my_df %>% rowwise() %>%
mutate(b1 = ifelse(any(c_across(b6:b9) == 2, na.rm = T), b1 + 10, b1))
# A tibble: 9 x 10
# Rowwise:
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 NA 5 NA 2 9 1 NA 4 14
2 16 4 9 6 12 2 3 8 5 2
3 3 6 8 NA 1 4 7 4 7 4
4 6 2 NA 10 7 6 7 5 9 2
5 4 6 2 12 8 7 4 1 5 1
6 12 6 3 8 5 6 2 4 1 1
7 11 1 9 3 5 6 2 1 1 1
8 19 1 5 6 6 7 9 3 2 1
9 NA 7 NA 2 NA 9 5 6 NA 5
I have a dataset with the following layout:
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like this:
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
You could also use pivot_longer:
df %>%
rename_with(~str_replace(.x, "(.)(_\\d)", "\\2:\\1")) %>%
pivot_longer(everything(), names_sep = ':', names_to = c(".value", "group")) %>%
arrange(group)
# A tibble: 20 x 5
group ABC1_1 ABC1_2 ABC2_1 ABC2_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 4 2 2
2 a 5 5 5 5
3 a 3 5 3 5
4 a 4 4 5 1
5 a 3 2 3 2
6 a 4 5 4 1
7 a 5 5 5 5
8 a 2 1 3 1
9 a 2 2 2 3
10 a 1 4 3 4
11 b 4 2 1 2
12 b 2 3 2 3
13 b 1 3 2 3
14 b 1 2 4 2
15 b 5 2 5 1
16 b 3 3 3 3
17 b 2 2 2 1
18 b 1 1 4 1
19 b 1 4 1 2
20 b 5 2 4 2
If you desire to go the Base R way, you could do:
reshape(df, split(names(df), sub("._", "_", names(df))), dir="long")
time ABC1a_1 ABC1a_2 ABC2a_1 ABC2a_2 id
1.1 1 1 4 2 2 1
2.1 1 5 5 5 5 2
3.1 1 3 5 3 5 3
4.1 1 4 4 5 1 4
5.1 1 3 2 3 2 5
6.1 1 4 5 4 1 6
7.1 1 5 5 5 5 7
8.1 1 2 1 3 1 8
9.1 1 2 2 2 3 9
10.1 1 1 4 3 4 10
1.2 2 4 2 1 2 1
2.2 2 2 3 2 3 2
3.2 2 1 3 2 3 3
4.2 2 1 2 4 2 4
5.2 2 5 2 5 1 5
6.2 2 3 3 3 3 6
7.2 2 2 2 2 1 7
8.2 2 1 1 4 1 8
9.2 2 1 4 1 2 9
10.2 2 5 2 4 2 10
Then you can change the names.
If you care about the names from the very beginning:
df1 <- setNames(df, gsub("(.)(_\\d)", "\\2.\\1", names(df)))
reshape(df1, names(df1), dir = "long")
time ABC1_1 ABC1_2 ABC2_1 ABC2_2 id
1 a 1 4 2 2 1
2 a 5 5 5 5 2
3 a 3 5 3 5 3
4 a 4 4 5 1 4
5 a 3 2 3 2 5
6 a 4 5 4 1 6
7 a 5 5 5 5 7
8 a 2 1 3 1 8
9 a 2 2 2 3 9
10 a 1 4 3 4 10
11 b 4 2 1 2 1
12 b 2 3 2 3 2
13 b 1 3 2 3 3
14 b 1 2 4 2 4
15 b 5 2 5 1 5
16 b 3 3 3 3 6
17 b 2 2 2 1 7
18 b 1 1 4 1 8
19 b 1 4 1 2 9
20 b 5 2 4 2 10
A base R solution to collapse it:
res <- as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), unlist))
rownames(res) <- NULL
# >res
# ABC1_1 ABC1_2 ABC2_1 ABC2_2
# 1 1 4 2 2
# 2 5 5 5 5
# 3 3 5 3 5
# 4 4 4 5 1
# 5 3 2 3 2
# 6 4 5 4 1
# 7 5 5 5 5
# 8 2 1 3 1
# 9 2 2 2 3
# 10 1 4 3 4
# 11 4 2 1 2
# 12 2 3 2 3
# 13 1 3 2 3
# 14 1 2 4 2
# 15 5 2 5 1
# 16 3 3 3 3
# 17 2 2 2 1
# 18 1 1 4 1
# 19 1 4 1 2
# 20 5 2 4 2
identical(df2, res)
# [1] TRUE
Using rowSums as the function to combine column values would be better I guess:
> as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), rowSums))
ABC1_1 ABC1_2 ABC2_1 ABC2_2
1 5 6 3 4
2 7 8 7 8
3 4 8 5 8
4 5 6 9 3
5 8 4 8 3
6 7 8 7 4
7 7 7 7 6
8 3 2 7 2
9 3 6 3 5
10 6 6 7 6
You could do:
library(tidyverse)
map_dfr(c("a", "b"),
~df %>%
select(contains(.x, ignore.case = FALSE)) %>%
rename_all(funs(str_remove_all(., .x))))
#ABC1_1 ABC1_2 ABC2_1 ABC2_2
#1 1 4 2 2
#2 5 5 5 5
#3 3 5 3 5
#4 4 4 5 1
# ..
Depending on your actual data, you could replace c("a", "b") with letters[1:2] or unique(str_extract(colnames(df), "[a-z]")).
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
arrange(name) %>%
mutate(name = gsub("[a-z]_", "_", name)) %>%
pivot_wider(values_fn = list) %>%
unchop(everything())
pivot_longer will put all the column names into a single column name that you can then edit by removing the lowercase letter preceding the underscore.
Then when you pivot back to a wide format the columns will automatically group. The output of pivot_wider are list-columns, unchop will convert these lists into a longer dataframe.
Output
ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl>
1 1 4 2 2
2 5 5 5 5
3 3 5 3 5
4 4 4 5 1
5 3 2 3 2
6 4 5 4 1
7 5 5 5 5
8 2 1 3 1
9 2 2 2 3
10 1 4 3 4
11 4 2 1 2
12 2 3 2 3
13 1 3 2 3
14 1 2 4 2
15 5 2 5 1
16 3 3 3 3
17 2 2 2 1
18 1 1 4 1
19 1 4 1 2
20 5 2 4 2
I have a dataframe:
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
y <- c(2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
df <- data.frame(x, y)
Now i want to change values in x, but only for 10 % of all values in x when y equals 2. For example
set.seed(999)
df[sample(which(df$y == 2), round(0.1 * length(which(df$y == 2)))), ]
x y
11 11 2
14 14 2
For exactly this cases I want to add + 1000. The result should look like:
x y
1 1 2
2 2 2
3 3 2
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 2
10 10 2
11 1011 2
12 12 2
13 13 2
14 1014 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
I am able to edit the sub-sample, but i dont know how to add the result to the dataframe "df" on a neat way. I am grateful for any help!
One simple way using base R could be
#Get indices when y = 2
inds <- df$y == 2
#set.seed(123)
#Get random indices whose value you need to change
inds_to_change <- sample(which(inds), round(0.1 * sum(inds)))
#Change the value
df$x[inds_to_change] <- df$x[inds_to_change] + 1000
df
# x y
#1 1 2
#2 2 2
#3 3 2
#4 4 0
#5 5 0
#6 6 0
#7 7 0
#8 8 0
#9 9 2
#10 1010 2
#11 11 2
#12 12 2
#13 13 2
#14 14 2
#15 15 2
#16 16 2
#17 1017 2
#18 18 2
#19 19 2
#20 20 2
I am attempting to create a df with a new variable called 'epi' (stands for episode)... which is based on the 'days.since.last' variable. when the value of 'days.since.last' is greater than 90, I want the episode variable to increase by 1.
Here is the original df
deid session.number days.since.last
1 1 1 0
2 1 2 7
3 1 3 12
4 5 1 0
5 5 2 7
6 5 3 14
7 5 4 93
8 5 5 5
9 5 6 102
10 12 1 0
11 12 2 21
12 12 3 104
13 12 4 4
Created from
help <- data.frame(deid = c(1, 1, 1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12),
session.number = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4),
days.since.last = c(0, 7, 12, 0, 7, 14, 93, 5, 102, 0, 21, 104, 4))
This is the output I am hoping to achieve
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
My best attempt is the below code, however, it does not change the first value of each new episode (they remain at 0)...
help$epi <- as.numeric(0)
tmp <- gapply(help, form = ~ deid, FUN = function(x)
{
spanSeq <- rle(x$days.since.last <= 90)$lengths[rle(x$days.since.last <= 90)$values == TRUE]
x$epi[x$days.since.last <= 90] <- rep(seq_along(spanSeq), times = spanSeq)
rm(spanSeq)
x
})
help2 <- do.call("rbind", tmp)
rownames(help2)<-c(1:length(help2$deid))
Any assistance is greatly appreciated!
You could do this with dplyr like this:
library(dplyr)
help %>% group_by(deid) %>% mutate(epi = cumsum(ifelse(days.since.last>90,1,0))+1)
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
Essentially, the group_by does everything by group for your 'deid' variable. We assign a 1 or a 0 for each 'days.since.last' that is over 90. Then we create a new variable that is the cumulative sum of these 1's and 0's. By adding one to it we get your desired result.