I am analyzing a very large survey in which I want to combine four parts of the survey, through several combinations of 4 questions. Below I have created a small example. A little background: a respondent either answered q2, q5, q8 or q9, because they only filled in 1 of 4 parts of the survey based on their answer in q1 (not shown here).Therefore, only one of the four columns contains an answer (1 or 2), while the others contain NAs. q2, q5, q8, q9 are similar questions that have the same answer options, which is why I want to combine them to make my dataset less wide and make it easier to further analyze the data.
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
# running df shows:
q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
1 NA NA NA 1 NA NA NA 1
2 NA NA NA 2 NA NA NA 2
3 NA NA 1 NA NA NA 1 NA
4 NA NA 2 NA NA NA 2 NA
5 NA 1 NA NA NA 1 NA NA
6 NA 2 NA NA NA 2 NA NA
7 1 NA NA NA 1 NA NA NA
8 2 NA NA NA 2 NA NA NA
My desired end result would be a dataframe with only columns for questions starting with q2_ (so, in the example that would be q2_1 and q2_2; in reality there's about 20 for this question), but with the NAs replaced for the answer options from the corresponding q5_, q8_, and q_9.
# desired end result
q2_1 q2_2
1 1 1
2 1 2
3 1 1
4 2 2
5 1 1
6 2 2
7 1 1
8 2 2
For single questions, i've done this using the code below, but this is very manual and because q2, q5, q8, and q9 both go up to _20, I'm looking for a way to automate this more.
# example single question
library(tidyverse)
df <- df %>%
mutate(q2_1 = case_when(!is.na(q2_1) ~ q2_1,
!is.na(q5_1) ~ q5_1,
!is.na(q8_1) ~ q8_1,
!is.na(q9_1) ~ q9_1))
I hope I explained myself well enough and looking forward for some directions!
Here's one way, using coalesce:
df %>%
mutate(q2_1 = do.call(coalesce, across(ends_with('_1'))),
q2_2 = do.call(coalesce, across(ends_with('_2')))) %>%
select(q2_1, q2_2)
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
#> q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
#> 1 NA NA NA 1 NA NA NA 1
#> 2 NA NA NA 2 NA NA NA 2
#> 3 NA NA 1 NA NA NA 1 NA
#> 4 NA NA 2 NA NA NA 2 NA
#> 5 NA 1 NA NA NA 1 NA NA
#> 6 NA 2 NA NA NA 2 NA NA
#> 7 1 NA NA NA 1 NA NA NA
#> 8 2 NA NA NA 2 NA NA NA
library(tidyverse)
suffix <- str_c("_", 1:2)
map_dfc(.x = suffix,
.f = ~ transmute(df, !!str_c("q2", .x) := rowSums(across(ends_with(.x
)), na.rm = T)))
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
Created on 2022-04-04 by the reprex package (v2.0.1)
Related
I am analyzing panel data with R now, and the data format is as follows.
pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer dep3 bmi mmse
1 3122 1 2 <NA> NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 3122 1 NA 1 NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 3122 1 NA <NA> 3 NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 3122 1 NA <NA> NA 71 <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
5 3122 1 NA <NA> NA NA 1 NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
6 3122 1 NA <NA> NA NA <NA> 11 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
The data are repeated measurements, and there are many missing values. If only the observed values are left at every year, the loss of the number is large, so I want to select and analyze only subjects who have been measured more than once among the 'mmse' variables.
I tried to check the change of the variable of interest through the following code, but it didn't work.
df %>%
arrange(pid, wave) %>%
group_by(pid) %>%
mutate(
mmse_change = mmse - lag(mmse),
mmse_increase = mmse_change > 0,
mmse_decrease = mmse_change < 0
)
I need the above object to analyze the baseline characteristic. How can I extract subjects with this condition?
We could do something like this:
df %>%
filter(!is.na(mmse)) %>% # just keep rows with non-NA in mmse
count(pid) %>% # count how many observations per pid
filter(n > 1) %>% # keep those pid's appearing more than once
select(pid) %>% # just keep the pid column
left_join(df) # get `df` for just those pid's
Another approach without join is to group_by(pid) and then filter all groups where max(row_number()) > 1.
Below I changed your initial data so that it can be used for this problem (your original data has only NAs in mmse and please put your data in reproducible code next).
library(tidyverse)
# initial data slightly changed:
df <- tribble(~pid, ~wave, ~edu, ~marri, ~rela, ~age, ~apt, ~sido, ~dongy, ~urban, ~stat1, ~stat2, ~exer, ~dep3, ~bmi, ~mmse,
3122 , 1, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1,
3122 , 1, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2,
3122 , 1, NA, NA, NA, 71, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 3,
3124 , 1, NA, NA, NA, NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, 5)
df %>%
filter(!is.na(mmse)) %>%
group_by(pid) %>%
filter(max(row_number()) > 1) %>%
ungroup()
#> # A tibble: 3 x 16
#> pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 3122 1 2 NA NA NA NA NA NA NA NA NA NA
#> 2 3122 1 NA NA 3 NA NA NA NA NA NA NA NA
#> 3 3122 1 NA NA NA NA 1 NA NA NA NA NA NA
#> # ... with 3 more variables: dep3 <lgl>, bmi <lgl>, mmse <dbl>
Created on 2022-09-21 by the reprex package (v2.0.1)
I have the following data frame:
data <- structure(list(Date = structure(c(-17897, -17896, -17895, -17894,
-17893, -17892, -17891, -17890, -17889, -17888, -17887, -17887,
-17886, -17885, -17884, -17883, -17882, -17881, -17880, -17879,
-17878, -17877, -17876, -17875, -17874, -17873, -17872, -17871,
-17870, -17869, -17868, -17867, -17866, -17865, -17864), class = "Date"),
duration = c(NA, NA, NA, 5, NA, NA, NA, 5, NA, NA, 1, 1,
NA, NA, 3, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4, NA, NA, 4, NA, NA), name = c(NA, NA, NA, "Date_beg",
NA, NA, NA, "Date_end", NA, NA, "Date_beg", "Date_end", NA,
NA, "Date_beg", NA, "Date_end", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "Date_beg", NA, NA, "Date_end", NA, NA
)), row.names = c(NA, -35L), class = c("tbl_df", "tbl", "data.frame"
))
And looks like:
Date duration name
<date> <dbl> <chr>
1 1921-01-01 NA NA
2 1921-01-02 NA NA
3 1921-01-03 NA NA
4 1921-01-04 5 Date_beg
5 1921-01-05 NA NA
6 1921-01-06 NA NA
7 1921-01-07 NA NA
8 1921-01-08 5 Date_end
9 1921-01-09 NA NA
10 1921-01-10 NA NA
...
I want to replace the NA values in column name that are between rows with Date_beg and Date_end with the word "event".
I have tried this:
data %<>% mutate(name = ifelse(((lag(name) == 'Date_beg')|(lag(name) == 'event')) &
But only the first row after Date_beg changes. It is quite easy with a for-loop, but I wanted to use a more R-like method.
There is probably a better way using data.table::nafill, but as you're using tidyverse functions, I would do it by creating an extra event column using tidyr::fill and then pulling it through to the name column where name is NA:
library(tidyr)
data %>%
mutate(
events = ifelse(
fill(data, name)$name == "Date_beg",
"event",
NA),
name = coalesce(name, events)
) %>%
select(-events)
You can do it by looking at the indices where there have been more "Date_beg" than "Dat_end" with:
data$name[lag(cumsum(data$name == "Date_beg" & !is.na(data$name))) -
cumsum(data$name == "Date_end" & !is.na(data$name)) >0] <- "event"
print(data, n=20)
# # A tibble: 35 x 3
# Date duration name
# <date> <dbl> <chr>
# 1 1921-01-01 NA NA
# 2 1921-01-02 NA NA
# 3 1921-01-03 NA NA
# 4 1921-01-04 5 Date_beg
# 5 1921-01-05 NA event
# 6 1921-01-06 NA event
# 7 1921-01-07 NA event
# 8 1921-01-08 5 Date_end
# 9 1921-01-09 NA NA
# 10 1921-01-10 NA NA
# 11 1921-01-11 1 Date_beg
# 12 1921-01-11 1 Date_end
# 13 1921-01-12 NA NA
# 14 1921-01-13 NA NA
# 15 1921-01-14 3 Date_beg
# 16 1921-01-15 NA event
# 17 1921-01-16 3 Date_end
# 18 1921-01-17 NA NA
# 19 1921-01-18 NA NA
# 20 1921-01-19 NA NA
# # ... with 15 more rows
Lagging the first index by one is required so that you don't overwrite the "Date_beg" at the start of each run.
Another dplyr approach using the cumsum function.
If the row in the name column in NA, it'll add 0 to the cumsum, otherwise add 1. Therefore the values under Date_beg will always be odd numbers (0 + 1) and the values under Date_end will always be even numbers (0 + 1 + 1). Then replace values that are odd in the ref column AND not NA in the name column with "event".
library(dplyr)
data %>%
mutate(ref = cumsum(ifelse(is.na(name), 0, 1)),
name = ifelse(ref %% 2 == 1 & is.na(name), "event", name)) %>%
select(-ref)
I am using R.
I have 4 different databases. Each one have values for my variables. Some of the bases have more values than others. So I want to use first the one that has the most values and lastly the one that have the least values. The data looks like this...
Variables A B C D
John 2 4
Mike 6
Walter 7
Jennifer 9 8
Amanda 3
Carlos 9
Michael 3
James 5
Kevin 4
Dennis 7
Frank
Steven
Joseph
Elvis 2
Maria 1
So, in roder to fill the data a need to create a new column that first uses the data of column B because is the one that contains the most values, then A, then C and then D and the ones that are missing need to be NA's. Also I need to add another column that gives me the reference of the data. In other words if I am using the column B to the that of John I need a column that tells me that the data pertains to column B.
The column should look like this...
Variables E D
John 4 B
Mike 6 B
Walter 7 B
Jennifer 9 B
Amanda 3 A
Carlos 9 A
Michael 3 B
James 5 D
Kevin 4 A
Dennis 7 C
Frank NA NA
Steven NA NA
Joseph NA NA
Elvis 2 B
Maria 1 B
With tidyverse you can do the following...
Use pivot_longer to put into long form. Make name an ordered factor by "B", "A", "C", and "D". Then when you arrange, you can get the first value by this order within each person's name.
This assumes your missing data are NA. If they are instead blank character values, you can filter those out with filter(value != "") instead of drop_na(value).
library(tidyverse)
df %>%
pivot_longer(cols = -Variables) %>%
mutate(name = ordered(name, levels = c('B', 'A', 'C', 'D'))) %>%
group_by(Variables) %>%
drop_na(value) %>%
arrange(name) %>%
summarise(E = first(value),
New_D = first(name)) %>%
right_join(df)
Output
Variables E New_D A B C D
<chr> <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
1 Amanda 3 A 3 NA NA NA
2 Carlos 9 A 9 NA NA NA
3 Dennis 7 C NA NA 7 NA
4 Elvis 2 B NA 2 NA NA
5 James 5 D NA NA NA 5
6 Jennifer 9 B NA 9 8 NA
7 John 4 B 2 4 NA NA
8 Kevin 4 A 4 NA NA NA
9 Maria 1 B NA 1 NA NA
10 Michael 3 B NA 3 NA NA
11 Mike 6 B NA 6 NA NA
12 Walter 7 B NA 7 NA NA
13 Frank NA NA NA NA NA NA
14 Steven NA NA NA NA NA NA
15 Joseph NA NA NA NA NA NA
Data
df <- structure(list(Variables = c("John", "Mike", "Walter", "Jennifer",
"Amanda", "Carlos", "Michael", "James", "Kevin", "Dennis", "Frank",
"Steven", "Joseph", "Elvis", "Maria"), A = c(2, NA, NA, NA, 3,
9, NA, NA, 4, NA, NA, NA, NA, NA, NA), B = c(4, 6, 7, 9, NA,
NA, 3, NA, NA, NA, NA, NA, NA, 2, 1), C = c(NA, NA, NA, 8, NA,
NA, NA, NA, NA, 7, NA, NA, NA, NA, NA), D = c(NA, NA, NA, NA,
NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-15L))
This question already has answers here:
R: coalescing a large data frame
(2 answers)
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
I have a df that looks something like this:
id <- c(1:8)
born.swis <- c(0, 1, NA, NA, NA, 2, NA, NA)
born2005 <- c(NA, NA, 2, NA, NA, NA, NA, NA)
born2006 <- c(NA, NA, NA, 1, NA, NA, NA, NA)
born2007 <- c(NA, NA, NA, NA, NA, NA, NA, 1)
born2008 <- c(NA, NA, NA, NA, NA, NA, 2, NA)
born2009 <- c(NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(id, born.swis, born2005, born2006, born2007, born2008, born2009)
I'm trying to mutate born.swis based on the values of the other variables. Basically, I want the value bornswis to be filled with the value one of the other variables IF born.id is NA and IF it is not NA for that variable. Something like this:
id <- c(1:8)
born.swis <- c(0, 1, 2, 1, NA, 2, 2,1)
df.desired <- data.frame(id, born.swis)
I tried several things with mutate and ifelse, like this:
df <- df%>%
mutate(born.swis = ifelse(is.na(born.swis), born2005, NA,
ifelse(is.na(born.swis), born2006, NA,
ifelse(is.na(born.swis), born2007, NA,
ifelse(is.na(born.swis), born2008, NA,
ifelse(is.na(born.swis), born2009, NA,)
)))))
and similar things, but I'm not able to reach my desired outcome.
Any ideas?
Many thanks!
One dplyr option could be:
df %>%
mutate(born.swis_res = coalesce(!!!select(., starts_with("born"))))
id born.swis born2005 born2006 born2007 born2008 born2009 born.swis_res
1 1 0 NA NA NA NA NA 0
2 2 1 NA NA NA NA NA 1
3 3 NA 2 NA NA NA NA 2
4 4 NA NA 1 NA NA NA 1
5 5 NA NA NA NA NA NA NA
6 6 2 NA NA NA NA NA 2
7 7 NA NA NA NA 2 NA 2
8 8 NA NA NA 1 NA NA 1
Or with dplyr 1.0.0:
df %>%
mutate(born.swis_res = Reduce(coalesce, across(starts_with("born"))))
In base R, you can use max.col :
df[cbind(1:nrow(df), max.col(!is.na(df[-1])) + 1 )]
#[1] 0 1 2 1 NA 2 2 1
max.col gives the column position of the first non-NA value in each row (exlcuding first column), we create a matrix with row-index and use it to subset df.
base R
df$born.swis <- apply(df[-1], 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm = T)))
My data is bit typical and I need find out field/Column order that follow some pattern.
For Instance, One field(say sub3) has values till some rows and followed by NULL values, then another field will continue with some values(like Sub1) and then follows null values.
And in some cases I may have multiple fields having values at two rows(like Sub2 and Sub4).
In below case the solution is vector of field names which follow the pattern c(Sub3,Sub1,c(Sub2,Sub4),Sub5)
Here is the reproducible format of data and Snapshot of data.
structure(list(RollNo = 1:10, Sub1 = c(NA, NA, NA, NA, 3L, 2L,
NA, NA, NA, NA), Sub2 = c(NA, NA, NA, NA, NA, NA, "A", "B", NA,
NA), Sub3 = c(4L, 3L, 5L, 6L, NA, NA, NA, NA, NA, NA), Sub4 = c(NA,
NA, NA, NA, NA, NA, 2L, 5L, NA, NA), Sub5 = c(NA, NA, NA, NA,
NA, NA, NA, NA, 7L, NA)), .Names = c("RollNo", "Sub1", "Sub2",
"Sub3", "Sub4", "Sub5"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000200788>)
Sounds like you are sorting on the order of first non-NA data. If df is your data:
sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1)))
# RollNo Sub1 Sub2 Sub3 Sub4 Sub5
# 1 5 7 1 7 9
Gives the first non-NA row for each column. This should be a natural sort, meaning ties retain the original order between the ties.
There are a couple of ways to use this, one such:
order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))
# [1] 1 4 2 3 5 6
df[,order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))]
# RollNo Sub3 Sub1 Sub2 Sub4 Sub5
# 1 1 4 NA <NA> NA NA
# 2 2 3 NA <NA> NA NA
# 3 3 5 NA <NA> NA NA
# 4 4 6 NA <NA> NA NA
# 5 5 NA 3 <NA> NA NA
# 6 6 NA 2 <NA> NA NA
# 7 7 NA NA A 2 NA
# 8 8 NA NA B 5 NA
# 9 9 NA NA <NA> NA 7
# 10 10 NA NA <NA> NA NA
I'm inferring from the column names that RollNo should always be first, so:
df[,c(1, 1 + order(sapply(df[-1], function(x) min(Inf, head(which(!is.na(x)),n=1)))))]
Using:
DT[, nms := paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6]
will get you:
> DT
RollNo Sub1 Sub2 Sub3 Sub4 Sub5 nms
1: 1 NA NA 4 NA NA Sub3
2: 2 NA NA 3 NA NA Sub3
3: 3 NA NA 5 NA NA Sub3
4: 4 NA NA 6 NA NA Sub3
5: 5 3 NA NA NA NA Sub1
6: 6 2 NA NA NA NA Sub1
7: 7 NA A NA 2 NA Sub2,Sub4
8: 8 NA B NA 5 NA Sub2,Sub4
9: 9 NA NA NA NA 7 Sub5
10: 10 NA NA NA NA NA
If you just want the specified vector:
unique(DT[, paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6][V1!='']$V1)
which returns:
[1] "Sub3" "Sub1" "Sub2,Sub4" "Sub5"
As #Frank pinted out in the comments, you can also use:
melt(DT, id=1, na.rm = TRUE)[, toString(unique(variable)), by = RollNo][order(RollNo)]