I'm a fairly new R user -- trying to teach myself based on forums, videos, and trial+error. I have a very large dataset and would like to calculate number of members in the household who are considered children ( aged under 18). I have a column for number of household members, as well as 11 columns for each household member's age. My initial thought would be to select those who are under 18 and subtract from total household members. I've tried a few different lines of code unsuccessfully and I'm not sure how best to go about executing this. Any help is greatly appreciated!
enter image description here
There are a few ways to do this. I'm using something called a datastep from the libr package.
First, here is your data:
df <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
Then I set up the datastep with an array for the columns you want to iterate. Also I also set up a childCount variable with the value of 0 to start with. The datastep will loop through the dataframe row by row. So then you just iterate through the array and add any children to the childCount variable.
library(libr)
res <- datastep(df,
arrays = list(ages = dsarray("ChildAge", "hhm2_Age", "hhm3_Age",
"hhm4_Age", "hhm5_Age", "hhm6_Age",
"hhm7_Age", "hhm8_Age", "hhm9_Age")),
calculate = { childCount <- 0 },
drop = "age",
{
for(age in ages) {
if (!is.na(ages[age])) {
if (ages[age] < 18)
childCount <- childCount + 1
}
}
})
Here are the results:
res
# num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age childCount
# 1 6 9 36 34 15 7 11 NA NA NA 4
# 2 4 8 44 16 10 NA NA NA NA NA 3
# 3 4 10 52 53 92 NA NA NA NA NA 1
# 4 5 10 40 15 17 20 NA NA NA NA 3
# 5 4 9 33 15 11 NA NA NA NA NA 3
# 6 NA NA NA NA NA NA NA NA NA NA 0
# 7 8 8 37 39 NA 10 6 68 78 NA 3
# 8 NA NA NA NA NA NA NA NA NA NA 0
Here is another potential solution using tidyverse functions and the data formatted by #David J. Bosak:
df1 <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
df2 <- df1 %>%
rowwise() %>%
mutate(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df2
#> # A tibble: 8 × 11
#> # Rowwise:
#> num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age total_kids
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
#> 1 6 9 36 34 15 7 11 NA NA NA 4
#> 2 4 8 44 16 10 NA NA NA NA NA 3
#> 3 4 10 52 53 92 NA NA NA NA NA 1
#> 4 5 10 40 15 17 20 NA NA NA NA 3
#> 5 4 9 33 15 11 NA NA NA NA NA 3
#> 6 NA NA NA NA NA NA NA NA NA NA 0
#> 7 8 8 37 39 NA 10 6 68 78 NA 3
#> 8 NA NA NA NA NA NA NA NA NA NA 0
Or, if you just want the counts in a dataframe on their own:
df3 <- df1 %>%
rowwise() %>%
summarise(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df3
#> # A tibble: 8 × 1
#> total_kids
#> <dbl>
#> 1 4
#> 2 3
#> 3 1
#> 4 3
#> 5 3
#> 6 0
#> 7 3
#> 8 0
Related
I am analyzing panel data with R now, and the data format is as follows.
pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer dep3 bmi mmse
1 3122 1 2 <NA> NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 3122 1 NA 1 NA NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 3122 1 NA <NA> 3 NA <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 3122 1 NA <NA> NA 71 <NA> NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
5 3122 1 NA <NA> NA NA 1 NA <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
6 3122 1 NA <NA> NA NA <NA> 11 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
The data are repeated measurements, and there are many missing values. If only the observed values are left at every year, the loss of the number is large, so I want to select and analyze only subjects who have been measured more than once among the 'mmse' variables.
I tried to check the change of the variable of interest through the following code, but it didn't work.
df %>%
arrange(pid, wave) %>%
group_by(pid) %>%
mutate(
mmse_change = mmse - lag(mmse),
mmse_increase = mmse_change > 0,
mmse_decrease = mmse_change < 0
)
I need the above object to analyze the baseline characteristic. How can I extract subjects with this condition?
We could do something like this:
df %>%
filter(!is.na(mmse)) %>% # just keep rows with non-NA in mmse
count(pid) %>% # count how many observations per pid
filter(n > 1) %>% # keep those pid's appearing more than once
select(pid) %>% # just keep the pid column
left_join(df) # get `df` for just those pid's
Another approach without join is to group_by(pid) and then filter all groups where max(row_number()) > 1.
Below I changed your initial data so that it can be used for this problem (your original data has only NAs in mmse and please put your data in reproducible code next).
library(tidyverse)
# initial data slightly changed:
df <- tribble(~pid, ~wave, ~edu, ~marri, ~rela, ~age, ~apt, ~sido, ~dongy, ~urban, ~stat1, ~stat2, ~exer, ~dep3, ~bmi, ~mmse,
3122 , 1, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1,
3122 , 1, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2,
3122 , 1, NA, NA, NA, 71, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3122 , 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 3,
3124 , 1, NA, NA, NA, NA, NA, 11, NA, NA, NA, NA, NA, NA, NA, 5)
df %>%
filter(!is.na(mmse)) %>%
group_by(pid) %>%
filter(max(row_number()) > 1) %>%
ungroup()
#> # A tibble: 3 x 16
#> pid wave edu marri rela age apt sido dongy urban stat1 stat2 exer
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 3122 1 2 NA NA NA NA NA NA NA NA NA NA
#> 2 3122 1 NA NA 3 NA NA NA NA NA NA NA NA
#> 3 3122 1 NA NA NA NA 1 NA NA NA NA NA NA
#> # ... with 3 more variables: dep3 <lgl>, bmi <lgl>, mmse <dbl>
Created on 2022-09-21 by the reprex package (v2.0.1)
I am analyzing a very large survey in which I want to combine four parts of the survey, through several combinations of 4 questions. Below I have created a small example. A little background: a respondent either answered q2, q5, q8 or q9, because they only filled in 1 of 4 parts of the survey based on their answer in q1 (not shown here).Therefore, only one of the four columns contains an answer (1 or 2), while the others contain NAs. q2, q5, q8, q9 are similar questions that have the same answer options, which is why I want to combine them to make my dataset less wide and make it easier to further analyze the data.
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
# running df shows:
q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
1 NA NA NA 1 NA NA NA 1
2 NA NA NA 2 NA NA NA 2
3 NA NA 1 NA NA NA 1 NA
4 NA NA 2 NA NA NA 2 NA
5 NA 1 NA NA NA 1 NA NA
6 NA 2 NA NA NA 2 NA NA
7 1 NA NA NA 1 NA NA NA
8 2 NA NA NA 2 NA NA NA
My desired end result would be a dataframe with only columns for questions starting with q2_ (so, in the example that would be q2_1 and q2_2; in reality there's about 20 for this question), but with the NAs replaced for the answer options from the corresponding q5_, q8_, and q_9.
# desired end result
q2_1 q2_2
1 1 1
2 1 2
3 1 1
4 2 2
5 1 1
6 2 2
7 1 1
8 2 2
For single questions, i've done this using the code below, but this is very manual and because q2, q5, q8, and q9 both go up to _20, I'm looking for a way to automate this more.
# example single question
library(tidyverse)
df <- df %>%
mutate(q2_1 = case_when(!is.na(q2_1) ~ q2_1,
!is.na(q5_1) ~ q5_1,
!is.na(q8_1) ~ q8_1,
!is.na(q9_1) ~ q9_1))
I hope I explained myself well enough and looking forward for some directions!
Here's one way, using coalesce:
df %>%
mutate(q2_1 = do.call(coalesce, across(ends_with('_1'))),
q2_2 = do.call(coalesce, across(ends_with('_2')))) %>%
select(q2_1, q2_2)
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
#> q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
#> 1 NA NA NA 1 NA NA NA 1
#> 2 NA NA NA 2 NA NA NA 2
#> 3 NA NA 1 NA NA NA 1 NA
#> 4 NA NA 2 NA NA NA 2 NA
#> 5 NA 1 NA NA NA 1 NA NA
#> 6 NA 2 NA NA NA 2 NA NA
#> 7 1 NA NA NA 1 NA NA NA
#> 8 2 NA NA NA 2 NA NA NA
library(tidyverse)
suffix <- str_c("_", 1:2)
map_dfc(.x = suffix,
.f = ~ transmute(df, !!str_c("q2", .x) := rowSums(across(ends_with(.x
)), na.rm = T)))
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
Created on 2022-04-04 by the reprex package (v2.0.1)
I have the following data frame:
data <- structure(list(Date = structure(c(-17897, -17896, -17895, -17894,
-17893, -17892, -17891, -17890, -17889, -17888, -17887, -17887,
-17886, -17885, -17884, -17883, -17882, -17881, -17880, -17879,
-17878, -17877, -17876, -17875, -17874, -17873, -17872, -17871,
-17870, -17869, -17868, -17867, -17866, -17865, -17864), class = "Date"),
duration = c(NA, NA, NA, 5, NA, NA, NA, 5, NA, NA, 1, 1,
NA, NA, 3, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4, NA, NA, 4, NA, NA), name = c(NA, NA, NA, "Date_beg",
NA, NA, NA, "Date_end", NA, NA, "Date_beg", "Date_end", NA,
NA, "Date_beg", NA, "Date_end", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "Date_beg", NA, NA, "Date_end", NA, NA
)), row.names = c(NA, -35L), class = c("tbl_df", "tbl", "data.frame"
))
And looks like:
Date duration name
<date> <dbl> <chr>
1 1921-01-01 NA NA
2 1921-01-02 NA NA
3 1921-01-03 NA NA
4 1921-01-04 5 Date_beg
5 1921-01-05 NA NA
6 1921-01-06 NA NA
7 1921-01-07 NA NA
8 1921-01-08 5 Date_end
9 1921-01-09 NA NA
10 1921-01-10 NA NA
...
I want to replace the NA values in column name that are between rows with Date_beg and Date_end with the word "event".
I have tried this:
data %<>% mutate(name = ifelse(((lag(name) == 'Date_beg')|(lag(name) == 'event')) &
But only the first row after Date_beg changes. It is quite easy with a for-loop, but I wanted to use a more R-like method.
There is probably a better way using data.table::nafill, but as you're using tidyverse functions, I would do it by creating an extra event column using tidyr::fill and then pulling it through to the name column where name is NA:
library(tidyr)
data %>%
mutate(
events = ifelse(
fill(data, name)$name == "Date_beg",
"event",
NA),
name = coalesce(name, events)
) %>%
select(-events)
You can do it by looking at the indices where there have been more "Date_beg" than "Dat_end" with:
data$name[lag(cumsum(data$name == "Date_beg" & !is.na(data$name))) -
cumsum(data$name == "Date_end" & !is.na(data$name)) >0] <- "event"
print(data, n=20)
# # A tibble: 35 x 3
# Date duration name
# <date> <dbl> <chr>
# 1 1921-01-01 NA NA
# 2 1921-01-02 NA NA
# 3 1921-01-03 NA NA
# 4 1921-01-04 5 Date_beg
# 5 1921-01-05 NA event
# 6 1921-01-06 NA event
# 7 1921-01-07 NA event
# 8 1921-01-08 5 Date_end
# 9 1921-01-09 NA NA
# 10 1921-01-10 NA NA
# 11 1921-01-11 1 Date_beg
# 12 1921-01-11 1 Date_end
# 13 1921-01-12 NA NA
# 14 1921-01-13 NA NA
# 15 1921-01-14 3 Date_beg
# 16 1921-01-15 NA event
# 17 1921-01-16 3 Date_end
# 18 1921-01-17 NA NA
# 19 1921-01-18 NA NA
# 20 1921-01-19 NA NA
# # ... with 15 more rows
Lagging the first index by one is required so that you don't overwrite the "Date_beg" at the start of each run.
Another dplyr approach using the cumsum function.
If the row in the name column in NA, it'll add 0 to the cumsum, otherwise add 1. Therefore the values under Date_beg will always be odd numbers (0 + 1) and the values under Date_end will always be even numbers (0 + 1 + 1). Then replace values that are odd in the ref column AND not NA in the name column with "event".
library(dplyr)
data %>%
mutate(ref = cumsum(ifelse(is.na(name), 0, 1)),
name = ifelse(ref %% 2 == 1 & is.na(name), "event", name)) %>%
select(-ref)
I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.
In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.
The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
This is what the table looks like currently
(I have replaced NA with - to make reading easier)
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
-
-
15
F
M
-
-
F
3
26
-
35
15
-
M
-
M
F
-
4
47
13
-
11
-
M
F
-
F
-
5
24
57
-
-
-
M
F
-
-
-
6
-
28
26
-
-
-
F
M
-
-
but i would like it to look like this
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
15
-
-
F
M
F
-
-
3
26
35
15
-
-
M
M
F
-
-
4
47
13
11
-
-
M
F
F
-
-
5
24
57
-
-
-
M
F
-
-
-
6
28
26
-
-
-
F
M
-
-
-
Is there a way of correcting this using dplyr? If not, is there another way in R of correcting the data
Here is a way using dplyr and tidyr. The approach involves pivoting the data to longer format, sorting the NA values to the end, renumbering the column names, and the pivoting to wide form again.
library(dplyr)
library(tidyr)
df <- data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
SEX1, SEX2, SEX3, SEX4, SEX5)
df %>%
mutate(across(starts_with("AGE"), as.character)) %>%
pivot_longer(2:11) %>%
separate(name, into = c("cat", "num"), 3) %>%
arrange(is.na(value)) %>%
group_by(ref, cat) %>%
mutate(num = seq_along(value)) %>%
ungroup() %>%
arrange(cat) %>%
unite(name, cat, num, sep = "") %>%
pivot_wider(id_cols = ref) %>%
mutate(across(starts_with("AGE"), as.numeric))
# A tibble: 6 x 11
ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 45 NA NA NA NA M NA NA NA NA
2 2 36 24 15 NA NA F M F NA NA
3 3 26 35 15 NA NA M M F NA NA
4 4 47 13 11 NA NA M F F NA NA
5 5 24 57 NA NA NA M F NA NA NA
6 6 28 26 NA NA NA F M NA NA NA
Here's a way using dplyr and tidyr library.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ref,
names_to = c('.value', 'num'),
names_pattern = '([A-Z]+)(\\d+)') %>%
arrange(ref, AGE, SEX) %>%
group_by(ref) %>%
mutate(num = row_number()) %>%
ungroup %>%
pivot_wider(names_from = num, values_from = c(AGE, SEX))
# ref AGE_1 AGE_2 AGE_3 AGE_4 AGE_5 SEX_1 SEX_2 SEX_3 SEX_4 SEX_5
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
#1 1 45 NA NA NA NA M NA NA NA NA
#2 2 15 24 36 NA NA F M F NA NA
#3 3 15 26 35 NA NA F M M NA NA
#4 4 11 13 47 NA NA F F M NA NA
#5 5 24 57 NA NA NA M F NA NA NA
#6 6 26 28 NA NA NA M F NA NA NA
Try the base code below
u1 <- reshape(
setNames(df, sub("(\\d)", ".\\1", names(df))),
direction = "long",
idvar = "ref",
varying = -1
)
u2 <- reshape(
transform(
u1[with(u1, order(is.na(AGE), is.na(SEX))), ],
time = ave(time, ref, FUN = seq_along)
),
direction = "wide",
idvar = "ref"
)
out <- u2[match(names(df),sub("\\.","",names(u2)))]
and you will get
> out
ref AGE.1 AGE.2 AGE.3 AGE.4 AGE.5 SEX.1 SEX.2 SEX.3 SEX.4 SEX.5
1.1 1 45 NA NA NA NA M <NA> <NA> <NA> <NA>
2.1 2 36 24 15 NA NA F M F <NA> <NA>
3.1 3 26 35 15 NA NA M M F <NA> <NA>
4.1 4 47 13 11 NA NA M F F <NA> <NA>
5.1 5 24 57 NA NA NA M F <NA> <NA> <NA>
6.2 6 28 26 NA NA NA F M <NA> <NA> <NA>
data
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
Here is a solution using package dedupewider:
library(dedupewider)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
age_moved <- na_move(df, cols = names(df)[grepl("^AGE\\d$", names(df))]) # 'right' direction is by default
sex_moved <- na_move(age_moved, cols = names(df)[grepl("^SEX\\d$", names(df))])
sex_moved
#> ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
#> 1 1 45 NA NA NA NA M <NA> <NA> NA NA
#> 2 2 36 24 15 NA NA F M F NA NA
#> 3 3 26 35 15 NA NA M M F NA NA
#> 4 4 47 13 11 NA NA M F F NA NA
#> 5 5 24 57 NA NA NA M F <NA> NA NA
#> 6 6 28 26 NA NA NA F M <NA> NA NA
I am using R.
I have 4 different databases. Each one have values for my variables. Some of the bases have more values than others. So I want to use first the one that has the most values and lastly the one that have the least values. The data looks like this...
Variables A B C D
John 2 4
Mike 6
Walter 7
Jennifer 9 8
Amanda 3
Carlos 9
Michael 3
James 5
Kevin 4
Dennis 7
Frank
Steven
Joseph
Elvis 2
Maria 1
So, in roder to fill the data a need to create a new column that first uses the data of column B because is the one that contains the most values, then A, then C and then D and the ones that are missing need to be NA's. Also I need to add another column that gives me the reference of the data. In other words if I am using the column B to the that of John I need a column that tells me that the data pertains to column B.
The column should look like this...
Variables E D
John 4 B
Mike 6 B
Walter 7 B
Jennifer 9 B
Amanda 3 A
Carlos 9 A
Michael 3 B
James 5 D
Kevin 4 A
Dennis 7 C
Frank NA NA
Steven NA NA
Joseph NA NA
Elvis 2 B
Maria 1 B
With tidyverse you can do the following...
Use pivot_longer to put into long form. Make name an ordered factor by "B", "A", "C", and "D". Then when you arrange, you can get the first value by this order within each person's name.
This assumes your missing data are NA. If they are instead blank character values, you can filter those out with filter(value != "") instead of drop_na(value).
library(tidyverse)
df %>%
pivot_longer(cols = -Variables) %>%
mutate(name = ordered(name, levels = c('B', 'A', 'C', 'D'))) %>%
group_by(Variables) %>%
drop_na(value) %>%
arrange(name) %>%
summarise(E = first(value),
New_D = first(name)) %>%
right_join(df)
Output
Variables E New_D A B C D
<chr> <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
1 Amanda 3 A 3 NA NA NA
2 Carlos 9 A 9 NA NA NA
3 Dennis 7 C NA NA 7 NA
4 Elvis 2 B NA 2 NA NA
5 James 5 D NA NA NA 5
6 Jennifer 9 B NA 9 8 NA
7 John 4 B 2 4 NA NA
8 Kevin 4 A 4 NA NA NA
9 Maria 1 B NA 1 NA NA
10 Michael 3 B NA 3 NA NA
11 Mike 6 B NA 6 NA NA
12 Walter 7 B NA 7 NA NA
13 Frank NA NA NA NA NA NA
14 Steven NA NA NA NA NA NA
15 Joseph NA NA NA NA NA NA
Data
df <- structure(list(Variables = c("John", "Mike", "Walter", "Jennifer",
"Amanda", "Carlos", "Michael", "James", "Kevin", "Dennis", "Frank",
"Steven", "Joseph", "Elvis", "Maria"), A = c(2, NA, NA, NA, 3,
9, NA, NA, 4, NA, NA, NA, NA, NA, NA), B = c(4, 6, 7, 9, NA,
NA, 3, NA, NA, NA, NA, NA, NA, 2, 1), C = c(NA, NA, NA, 8, NA,
NA, NA, NA, NA, 7, NA, NA, NA, NA, NA), D = c(NA, NA, NA, NA,
NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-15L))