Related
I have yearly observations for individuals on different variables from 2008-2020. I have data on family (25 variables), income (15 variables), and schooling (22 variables).
Right now, have 'cleaned' every single dataset so that every column of every category has the same column name. For context, this is what my R looks like now.
The thing is, I would like to have one big dataset with all of the individuals and years in one dataframe. I know that I should/could use the innerjoin or merge function first of all sorting by 'Householdmember', and that I could use the gather function, but I am truly struggling in what order I should do this and where I should start. I've been trying a lot of things, but considering the number of dataframes, it's hard to keep track of what I'm doing. I also created lists of every category for every year because this was recommended in one method, but that did not work out...
I want to end up with a dataframe that looks similar to this:
Individual
Year
Var1
Var2
1
2008
value
value
1
2009
value
value
1
2010
value
value
2
2008
value
value
2
2009
value
value
2
2010
value
value
What I should do as first step... If I merge the dataframes, I don't think R knows which values correspond to which year...
> head(fam08)
# A tibble: 6 x 25
HouseholdMember RandomChild YearBirthRandom Gender Age FatherBirth FatherAlive MotherBirth MotherAlive Divorce SeeFather SeeMother
<dbl> <dbl+lbl> <dbl> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl> <dbl+lbl>
1 800033 16 [not ap… NA 1 [mal… 16 1952 1 [yes] 1961 1 [yes] 1 [yes] 7 [ever… 7 [ever…
2 800042 16 [not ap… NA 2 [fem… 32 1946 1 [yes] 1948 1 [yes] 2 [no] 4 [at l… 4 [at l…
3 800045 16 [not ap… NA 1 [mal… 65 1913 2 [no] 1915 2 [no] 2 [no] NA NA
4 800057 16 [not ap… NA 1 [mal… 33 1939 1 [yes] 1945 1 [yes] 1 [yes] 4 [at l… 4 [at l…
5 800076 16 [not ap… NA 2 [fem… 22 1955 1 [yes] 1955 1 [yes] 1 [yes] 5 [at l… 3 [a fe…
6 800119 16 [not ap… NA 2 [fem… 57 1908 2 [no] 1918 2 [no] 2 [no] NA NA
# … with 13 more variables: Married <dbl+lbl>, Child <dbl+lbl>, NumChild <dbl>, SchoolCH1 <dbl+lbl>, SchoolCH2 <dbl+lbl>,
# SchoolCH3 <dbl+lbl>, SchoolCH4 <dbl+lbl>, BirthCH1 <dbl>, BirthCH2 <dbl>, BirthCH3 <dbl>, BirthCH4 <dbl>, FamSatisfaction <dbl+lbl>,
# Year <dbl>
> head(fam09)
# A tibble: 6 x 25
HouseholdMember RandomChild YearBirthRandom Gender Age FatherBirth FatherAlive MotherBirth MotherAlive Divorce SeeFather SeeMother
<dbl> <dbl+lbl> <dbl> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl> <dbl+lbl>
1 800033 16 [not ap… NA 1 [mal… 17 1952 1 [yes] 1961 1 [yes] NA 5 [at l… 7 [ever…
2 800042 16 [not ap… NA 2 [fem… 33 1946 1 [yes] 1948 1 [yes] NA 4 [at l… 4 [at l…
3 800057 16 [not ap… NA 1 [mal… 34 1939 1 [yes] 1945 1 [yes] NA 3 [a fe… 3 [a fe…
4 800076 16 [not ap… NA 2 [fem… 23 1955 1 [yes] 1955 1 [yes] NA 5 [at l… 3 [a fe…
5 800119 16 [not ap… NA 2 [fem… 58 NA NA NA NA NA NA NA
6 800125 16 [not ap… NA 2 [fem… 50 NA NA 1928 1 [yes] NA NA 1 [neve…
# … with 13 more variables: Married <dbl+lbl>, Child <dbl+lbl>, NumChild <dbl>, SchoolCH1 <dbl+lbl>, SchoolCH2 <dbl+lbl>,
# SchoolCH3 <dbl+lbl>, SchoolCH4 <dbl+lbl>, BirthCH1 <dbl>, BirthCH2 <dbl>, BirthCH3 <dbl>, BirthCH4 <dbl>, FamSatisfaction <dbl+lbl>,
# Year <dbl>
dput(head(fam09,10))
structure(list(HouseholdMember = c(800033, 800042, 800057, 800076,
800119, 800125, 800170, 800186, 800201, 800204), RandomChild = structure(c(16,
16, 16, 16, 16, 16, 3, 16, 16, 16), label = "Randomly chosen child", labels = c(`child 1` = 1,
`child 2` = 2, `child 3` = 3, `child 4` = 4, `child 5` = 5, `child 6` = 6,
`child 7` = 7, `child 8` = 8, `child 9` = 9, `child 10` = 10,
`child 11` = 11, `child 12` = 12, `child 13` = 13, `child 14` = 14,
`child 15` = 15, `not applicable` = 16), class = "haven_labelled"),
YearBirthRandom = c(NA, NA, NA, NA, NA, NA, 1999, NA, NA,
NA), Gender = structure(c(1, 2, 1, 2, 2, 2, 2, 2, 1, 1), label = "Gender respondent", labels = c(male = 1,
female = 2), class = "haven_labelled"), Age = c(17, 33, 34,
23, 58, 50, 50, 69, 35, 67), FatherBirth = structure(c(1952,
1946, 1939, 1955, NA, NA, 1926, NA, 1948, NA), label = "What is the year of birth of your father?", labels = c(`I don't know` = 99999), class = "haven_labelled"),
FatherAlive = structure(c(1, 1, 1, 1, NA, NA, 1, NA, 1, NA
), label = "Is your father still alive?", labels = c(yes = 1,
no = 2, `I don't know` = 99), class = "haven_labelled"),
MotherBirth = structure(c(1961, 1948, 1945, 1955, NA, 1928,
1931, NA, 1950, NA), label = "What is the year of birth of your mother?", labels = c(`I don't know` = 99999), class = "haven_labelled"),
MotherAlive = structure(c(1, 1, 1, 1, NA, 1, 1, NA, 1, NA
), label = "Is your mother still alive?", labels = c(yes = 1,
no = 2, `I don't know` = 99), class = "haven_labelled"),
Divorce = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "Did your own parents ever divorce?", labels = c(yes = 1,
no = 2, `my parents never had a relationship` = 3, `I don't know` = 99
), class = "haven_labelled"), SeeFather = structure(c(5,
4, 3, 5, NA, NA, 6, NA, 3, NA), label = "How often did you see your father over the past 12 months?", labels = c(never = 1,
once = 2, `a few times` = 3, `at least every month` = 4,
`at least every week` = 5, `a few times per week` = 6, `every day` = 7
), class = "haven_labelled"), SeeMother = structure(c(7,
4, 3, 3, NA, 1, 6, NA, 3, NA), label = "How often did you see your mother over the past 12 months?", labels = c(never = 1,
once = 2, `a few times` = 3, `at least every month` = 4,
`at least every week` = 5, `a few times per week` = 6, `every day` = 7
), class = "haven_labelled"), Married = structure(c(NA, 1,
2, 2, 1, 2, 1, 1, 1, 1), label = "Are you married to this partner?", labels = c(yes = 1,
no = 2), class = "haven_labelled"), Child = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), label = "Have you had any children?", labels = c(yes = 1,
no = 2), class = "haven_labelled"), NumChild = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), SchoolCH1 = structure(c(NA,
NA, NA, NA, NA, NA, 4, NA, NA, NA), label = "What school does child 1 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH2 = structure(c(NA,
NA, NA, NA, NA, NA, 3, NA, NA, NA), label = "What school does child 2 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH3 = structure(c(NA,
NA, NA, NA, NA, NA, 1, NA, NA, NA), label = "What school does child 3 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH4 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), label = "What school does child 4 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), BirthCH1 = c(NA, 2005,
2007, NA, 1983, NA, 1991, 1964, NA, 1974), BirthCH2 = c(NA,
2007, NA, NA, 1985, NA, 1994, 1966, NA, 1976), BirthCH3 = c(NA,
NA, NA, NA, NA, NA, 1999, 1970, NA, NA), BirthCH4 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), FamSatisfaction = structure(c(NA,
8, 9, NA, 8, NA, 8, NA, NA, NA), label = "How satisfied are you with your family life?", labels = c(`entirely dissatisfied` = 0,
`entirely satisfied` = 10, `I don’t know` = 999), class = "haven_labelled"),
Year = c(2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,
2009, 2009)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I believe you could do something along these lines:
fam = bind_rows(fam_list)
inc = bind_rows(inc_list)
ws = bind_rows(ws_list)
result = fam %>%
left_join(inc, by=c("HouseholdMember", "Year")) %>%
left_join(ws, by=c("HouseholdMember", "Year"))
Output:
HouseholdMember Year fam_v1 fam_v2 fam_v3 inc_v1 inc_v2 inc_v3 ws_v1 ws_v2 ws_v3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8001 2008 0.609 -0.253 -1.30 0.0147 0.719 -0.765 0.120 0.974 -0.764
2 8002 2008 0.395 1.73 -0.503 0.119 -3.33 -0.798 0.325 0.664 1.65
3 8003 2008 0.562 0.157 0.243 -1.18 -0.260 0.105 1.09 0.855 1.19
4 8004 2008 1.32 0.737 -1.18 0.725 -1.82 0.356 0.362 2.04 1.76
5 8005 2008 -0.497 -0.444 -0.632 -0.534 1.63 0.984 1.29 0.614 0.576
6 8006 2008 -1.70 -0.989 -1.32 0.868 0.0979 0.468 -0.0146 1.11 0.957
7 8007 2008 -2.19 -0.419 1.69 1.34 -0.404 -1.43 -0.156 0.648 -0.186
8 8008 2008 1.48 0.350 -0.595 0.785 -0.609 1.28 -1.01 1.04 0.845
9 8009 2008 -0.315 -0.530 0.419 0.390 -0.0951 -0.755 0.135 0.696 -1.97
10 8010 2008 -0.882 1.38 2.06 -0.0757 1.53 -0.494 -1.03 1.14 1.87
Note:
I manufactured the data for this example by creating a lists of tibbles; I believe the fam_list, inc_list, and ws_list are similar to the list objects in your image. These are list of data frames / tibbles. I then use bind_rows to bind these similar structure tibbles together so that I have a three large tibbles.
I then use left_join twice to join inc and ws to fam
Input Data:
library(tidyverse)
fam_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
fam_v1=rnorm(100),
fam_v2=rnorm(100),
fam_v3=rnorm(100)
)
})
names(fam_list) = paste0("fam_20", 8:20)
inc_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
inc_v1=rnorm(100),
inc_v2=rnorm(100),
inc_v3=rnorm(100)
)
})
names(inc_list) = paste0("inc_20", 8:20)
ws_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
ws_v1=rnorm(100),
ws_v2=rnorm(100),
ws_v3=rnorm(100)
)
})
names(ws_list) = paste0("ws_20", 8:20)
Input
I have a dataframe...
df <- tibble(
id = 1:10,
family = c("a","a","b","b","c", "d", "e", "f", "g", "h"),
col1_a = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
col1_b = c(1, 2, 3, 4, NA, NA, NA, NA, NA, NA),
col2_a = c(11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
col2_b = c(11, 12, 13, 14, NA, NA, NA, NA, NA, NA),
)
Families will only contain 2 members at most (so they're either individuals or pairs).
For individuals (families with only one row, i.e. id = 5:10), I want to randomly move 50% of the data from columns ending in 'a' to columns ending in 'b'.
By the end, the data should look like the following (depending on which 50% of rows are used)...
df <- tibble(
id = 1:10,
family = c("a","a","b","b","c", "d", "e", "f", "g", "h"),
col1_a = c(1, 2, 3, 4, 5, NA, 7, NA, 9, NA),
col1_b = c(1, 2, 3, 4, NA, 6, NA, 8, NA, 10),
col2_a = c(11, 12, 13, 14, NA, NA, 17, 18, NA, 20),
col2_b = c(11, 12, 13, 14, 15, 16, NA, NA, 19, NA),
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
Update: I forgot to mention that values in columns ending 'a' should be replaced with NA if they are moved across to 'b'.
I would do this in two primary steps, first create the fam_count column to determine which families only have 1 person. Then, create two rand columns, to determine whether or not we use the values in the b columns.
library(tidyverse)
set.seed(1)
df %>% group_by(family) %>%
mutate(fam_count = n()) %>%
ungroup() %>%
mutate(
rand1 = sample(c(NA, 1), nrow(.), replace = TRUE),
rand2 = sample(c(NA, 1), nrow(.), replace = TRUE),
col1_b = ifelse(fam_count == 1, rand1 * col1_a, col1_b),
col2_b = ifelse(fam_count == 1, rand2 * col2_a, col2_b)
) %>%
mutate(
col1_a = ifelse(fam_count == 1 & !is.na(col1_b), NA, col1_a),
col2_a = ifelse(fam_count == 1 & !is.na(col2_b), NA, col2_a)
) %>%
select(-rand1, -rand2, - fam_count)
# A tibble: 10 x 6
id family col1_a col1_b col2_a col2_b
<int> <chr> <int> <dbl> <int> <dbl>
1 1 a 1 1 11 11
2 2 a 2 2 12 12
3 3 b 3 3 13 13
4 4 b 4 4 14 14
5 5 c 5 NA NA 15
6 6 d 6 NA NA 16
7 7 e NA 7 17 NA
8 8 f 8 NA NA 18
9 9 g NA 9 19 NA
10 10 h 10 NA 20 NA
I want to select a number of variables based on thier names to transform them. The variable names all start with inq and end with 7, 8, 10, 13:15. This is not working for me... Apologies if this is obvious, but I cannot get it to work. Am I using the wrong functions, putting my functions and arguments together wrong, or something else?
A reproducible example:
structure(list(inq1_1 = c(NA, 7, 5, 1, 1, 6, 5, 2, NA, NA), inq1_2 = c(NA,
7, 5, 1, 1, 6, 5, 5, NA, NA), inq1_3 = c(NA, 6, 4, 2, 1, 5, 2,
1, NA, NA), inq1_4 = c(NA, 6, 6, 1, 1, 6, 5, 1, NA, NA), inq1_5 = c(NA,
7, 3, 1, 1, 6, 2, 1, NA, NA), inq1_6 = c(NA, 7, 4, 4, 2, 7, 2,
4, NA, NA), inq1_7 = c(NA, 2, 4, 6, 7, 3, 1, 7, NA, NA), inq1_8 = c(NA,
1, NA, 2, 7, 2, 1, 4, NA, NA), inq1_9 = c(NA, 4, 6, 3, 1, 3,
7, 1, NA, NA), inq1_10 = c(NA, 3, 5, 7, 4, 4, 2, 7, NA, NA),
inq1_11 = c(NA, 5, 4, 7, 1, 6, 7, 6, NA, NA), inq1_12 = c(NA,
7, 5, 7, 4, 6, 7, 2, NA, NA), inq1_13 = c(NA, 3, 4, 6, 4,
3, 4, 4, NA, NA), inq1_14 = c(NA, 3, 2, 4, 4, 2, 1, 4, NA,
NA), inq1_15 = c(NA, 2, 2, 3, 5, 2, 4, 4, NA, NA), inqfinal_1 = c(5,
NA, 3, NA, NA, NA, NA, NA, NA, NA), inqfinal_2 = c(5, NA,
3, NA, NA, NA, NA, NA, NA, NA), inqfinal_3 = c(6, NA, 3,
NA, NA, NA, NA, NA, NA, NA), inqfinal_4 = c(5, NA, 3, NA,
NA, NA, NA, NA, NA, NA), inqfinal_5 = c(5, NA, 3, NA, NA,
NA, NA, NA, NA, NA), inqfinal_6 = c(6, NA, 3, NA, NA, NA,
NA, NA, NA, NA), inqfinal_7 = c(4, NA, 3, NA, NA, NA, NA,
NA, NA, NA), inqfinal_8 = c(2, NA, 3, NA, NA, NA, NA, NA,
NA, NA), inqfinal_9 = c(5, NA, 3, NA, NA, NA, NA, NA, NA,
NA), inqfinal_10 = c(4, NA, 3, NA, NA, NA, NA, NA, NA, NA
), inqfinal_11 = c(6, NA, 4, NA, NA, NA, NA, NA, NA, NA),
inqfinal_12 = c(6, NA, 4, NA, NA, NA, NA, NA, NA, NA), inqfinal_13 = c(4,
NA, 3, NA, NA, NA, NA, NA, NA, NA), inqfinal_14 = c(2, NA,
2, NA, NA, NA, NA, NA, NA, NA), inqfinal_15 = c(2, NA, 2,
NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I am trying to become tidy and utilising dplyr as per the code below:
# select specific columns
sf_df %>% select(starts_with("inq"),
ends_with(7, 8, 10, 13:15)) %>% view(title = "test")
Alas, I get the following error:
Error in ends_with(7, 8, 10, 13:15) : unused argument (13:15)
14. .f(.x[[i]], ...)
13. map(.x[sel], .f, ...)
12. map_if(ind_list, is_helper, eval_tidy)
11. vars_select_eval(.vars, quos)
10. tidyselect::vars_select(names(.data), !!!quos(...))
9. select.data.frame(., starts_with("inq"), ends_with(7, 8, 10, 13:15))
8. select(., starts_with("inq"), ends_with(7, 8, 10, 13:15))
7. function_list[[i]](value)
6. freduce(value, `_function_list`)
5. `_fseq`(`_lhs`)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
3. eval(quote(`_fseq`(`_lhs`)), env, env)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1. sf_df %>% select(starts_with("inq"), ends_with(7, 8, 10, 13:15)) %>% view(title = "test")
Any help would be greatly appreciated! Thank you in advance.
Cheers,
Atanas.
A better option would be matches to match a regex pattern in the column name. Here, it matches the pattern 'ing' at the beginning (^) of the column name and numbers at the end ($) of the column name
sf_df %>%
select(matches('^inq.*(7|8|10|13|14|15)$'))
# A tibble: 10 x 12
# inq1_7 inq1_8 inq1_10 inq1_13 inq1_14 inq1_15 inqfinal_7 inqfinal_8 inqfinal_10 inqfinal_13 inqfinal_14 inqfinal_15
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA 4 2 4 4 2 2
# 2 2 1 3 3 3 2 NA NA NA NA NA NA
# 3 4 NA 5 4 2 2 3 3 3 3 2 2
# 4 6 2 7 6 4 3 NA NA NA NA NA NA
# 5 7 7 4 4 4 5 NA NA NA NA NA NA
# 6 3 2 4 3 2 2 NA NA NA NA NA NA
# 7 1 1 2 4 1 4 NA NA NA NA NA NA
# 8 7 4 7 4 4 4 NA NA NA NA NA NA
# 9 NA NA NA NA NA NA NA NA NA NA NA NA
#10 NA NA NA NA NA NA NA NA NA NA NA NA
Note that by using both starts_with and ends_with, the desired result may not be the expected one. The OP's dataset has 30 columns where all the column names start with 'inq'. So, with starts_with, it returns all columns, and adding ends_with, it is checking an OR match, e.g.
sf_df %>%
select(starts_with("inq"), ends_with("5")) %>%
ncol
#[1] 30 # returns 30 columns
It is not removing the columns that have no match for 5 at the string
It is not a behavior of the order of arguments as
sf_df %>%
select(ends_with("5"), starts_with("inq")) %>%
ncol
#[1] 30
Now, if we use only ends_with
sf_df %>%
select(ends_with("5")) %>%
ncol
#[1] 4
Based on the example, all columns starts with 'inq', so, ends_with alone would be sufficient for a single string match as the documentation for ?ends_with specifies
match - A string.
and not multiple strings
where the Usage is
starts_with(match, ignore.case = TRUE, vars = peek_vars())
I am sure I am not the only person who has asked this but after hours of searching with no luck I need to ask the question myself.
I have a df (rp) like so:
rp <- structure(list(agec1 = c(7, 16, 11, 11, 17, 17),
agec2 = c(6, 12, 9, 9, 16, 15),
agec3 = c(2, 9, 9, 9, 14, NA),
agec4 = c(NA, 7, 9, 9, 13, NA),
agec5 = c(NA, 4, 7, 7, 10, NA),
agec6 = c(NA, NA, 6, 6, 9, NA),
agec7 = c(NA, NA, NA, NA, 7, NA),
agec8 = c(NA, NA, NA, NA, 5, NA),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Where each obs in agecX refers to the age of a parent's children up to 8 children. I would like to create a new column "agec5_12" that contains the age of the oldest child aged 5-12. So my df would look like this:
rpage <- structure(list(agec1 = c(7, 16, 11, 11, 17, 17),
agec2 = c(6, 12, 9, 9, 16, 15),
agec3 = c(2, 9, 9, 9, 14, NA),
agec4 = c(NA, 7, 9, 9, 13, NA),
agec5 = c(NA, 4, 7, 7, 10, NA),
agec6 = c(NA, NA, 6, 6, 9, NA),
agec7 = c(NA, NA, NA, NA, 7, NA),
agec8 = c(NA, NA, NA, NA, 5, NA),
agec5_12 = c(7, 12, 11, 11, 10, NA))
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Notes about my data:
Ages are not always in the same chronological order i.e. youngest to oldest or oldest to youngest
It is possible for a row to have no children aged within this range (in which case I would like NA to be returned)
I have tried writing a function and applying it using rowwise and mutate:
fun.age5_12 <- function(x){
x[which(x == max(x[(x > 4) & (x < 13)], na.rm = TRUE))]
}
rpage <- rp %>%
select(-c(20:21, 199:200)) %>%
rowwise() %>%
mutate(agec5_12 = fun.age5_12(c(1:8)))
However, this returns all obs as "12". Ideally I would like to do this using dplyr. Any suggestions using mutate or ifelse and not necessarily with functions are fine.
Thank you
I know you wanted tidyverse but here's one base R way:
data.frame(
agec1 = c(7, 16, 11, 11, 17, 17),
agec2 = c(6, 12, 9, 9, 16, 15),
agec3 = c(2, 9, 9, 9, 14, NA),
agec4 = c(NA, 7, 9, 9, 13, NA),
agec5 = c(NA, 4, 7, 7, 10, NA),
agec6 = c(NA, NA, 6, 6, 9, NA),
agec7 = c(NA, NA, NA, NA, 7, NA),
agec8 = c(NA, NA, NA, NA, 5, NA),
stringsAsFactors = FALSE
) -> rp
for (i in 1:nrow(rp)) {
agec5_12 <- unlist(rp[i,], use.names = FALSE)
agec5_12 <- agec5_12[agec5_12 >= 5 & agec5_12 <= 12 & !is.na(agec5_12)]
rp[i, "agec5_12"] <- if (length(agec5_12)) max(agec5_12) else NA_integer_
}
rp
## agec1 agec2 agec3 agec4 agec5 agec6 agec7 agec8 agec5_12
## 1 7 6 2 NA NA NA NA NA 7
## 2 16 12 9 7 4 NA NA NA 12
## 3 11 9 9 9 7 6 NA NA 11
## 4 11 9 9 9 7 6 NA NA 11
## 5 17 16 14 13 10 9 7 5 10
## 6 17 15 NA NA NA NA NA NA NA
The for shows the idiom but an sapply() solution is alot faster:
rp1$agec5_12 <- sapply(1:nrow(rp), function(i) {
agec5_12 <- unlist(rp[i,], use.names = FALSE)
agec5_12 <- agec5_12[agec5_12 >= 5 & agec5_12 <= 12 & !is.na(agec5_12)]
if (length(agec5_12)) max(agec5_12) else NA_integer_
})
I think apply solution for such a problem will always be simpler and more readable thandplyr (I am assuming you meant tidyverse) solution but since you asked, here is one way -
library(dplyr)
library(tidyr)
rp %>%
rownames_to_column("parent_id") %>%
gather(variable, value, -parent_id) %>%
group_by(parent_id) %>%
arrange(parent_id, desc(value)) %>%
mutate(
agec5_12 = value[between(value, 5, 12)][1]
) %>%
ungroup() %>%
spread(variable, value) %>%
select(3:10, 2)
# A tibble: 6 x 9
agec1 agec2 agec3 agec4 agec5 agec6 agec7 agec8 agec5_12
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 6 2 NA NA NA NA NA 7
2 16 12 9 7 4 NA NA NA 12
3 11 9 9 9 7 6 NA NA 11
4 11 9 9 9 7 6 NA NA 11
5 17 16 14 13 10 9 7 5 10
6 17 15 NA NA NA NA NA NA NA
Another base R solution. We can use replace to replace numbers outside the range of 5 to 12, and then use apply and function(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)) to find the maximum for each row. You can also consider to use max directly, but for rows with elements are NA, the max function would return -Inf.
rp$agec5_12 <- apply(replace(rp, rp > 12 | rp < 5, NA), 1,
function(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)))
Or use do.call and pmax.
rp$agec5_12 <- do.call(pmax, c(replace(rp, rp > 12 | rp < 5, NA), na.rm = TRUE))
Here is a performance comparison of the three base R methods so far. do.call with pmax seems to be the fastest one.
library(microbenchmark)
perf <- microbenchmark(
m1 = {sapply(1:nrow(rp), function(i) {
agec5_12 <- unlist(rp[i,], use.names = FALSE)
agec5_12 <- agec5_12[agec5_12 >= 5 & agec5_12 <= 12 & !is.na(agec5_12)]
if (length(agec5_12)) max(agec5_12) else NA_integer_
})},
m2 = {
apply(replace(rp, rp > 12 | rp < 5, NA), 1,
function(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)))
},
m3 = {rp$agec5_12 <- do.call(pmax, c(replace(rp, rp > 12 | rp < 5, NA), na.rm = TRUE))
}, times = 1000L)
perf
# Unit: microseconds
# expr min lq mean median uq max neval cld
# m1 505.318 559.2935 860.3941 608.386 1231.937 9844.699 1000 b
# m2 526.394 568.0325 831.6851 629.205 1207.262 4748.342 1000 b
# m3 384.514 425.1250 635.3154 465.736 918.362 8992.393 1000 a
DATA
rp <- data.frame(
agec1 = c(7, 16, 11, 11, 17, 17),
agec2 = c(6, 12, 9, 9, 16, 15),
agec3 = c(2, 9, 9, 9, 14, NA),
agec4 = c(NA, 7, 9, 9, 13, NA),
agec5 = c(NA, 4, 7, 7, 10, NA),
agec6 = c(NA, NA, 6, 6, 9, NA),
agec7 = c(NA, NA, NA, NA, 7, NA),
agec8 = c(NA, NA, NA, NA, 5, NA)
)
Since you asked for it, here's a pure dplyr way to do this -
max5_12 <- function(x) {
a <- sort(x, decreasing = T)
a[a >= 5 & a <= 12][1]
}
rp %>%
t() %>%
as.data.frame() %>%
bind_rows(
summarise_all(., max5_12)
) %>%
t() %>%
as.data.frame() %>%
setNames(c(names(rp), "agec5_12"))
agec1 agec2 agec3 agec4 agec5 agec6 agec7 agec8 agec5_12
V1 7 6 2 NA NA NA NA NA 7
V2 16 12 9 7 4 NA NA NA 12
V3 11 9 9 9 7 6 NA NA 11
V4 11 9 9 9 7 6 NA NA 11
V5 17 16 14 13 10 9 7 5 10
V6 17 15 NA NA NA NA NA NA NA
The most straightforward way I can think of to accomplish this uses dplyr, purrr and tidyr:
library(dplyr)
library(purrr)
library(tidyr)
rp %>%
mutate_at(vars(agec1:agec8), funs(ifelse(between(., 5, 12), ., NA))) %>%%
group_by(id) %>%
nest() %>%
mutate(agec5_12 = map(data, max, na.rm = TRUE),
agec5_12 = ifelse(agec5_12 == -Inf, NA, agec5_12)) %>%
unnest()
I'm working on a table that contains a lot of NAs and answers by numbering
and it looks like this
structure(list(ID = c(101, 102, 103, 104, 105, 106, 107, 108, 109, 110), a = c(NA, 9, NA, NA, NA, NA, NA, NA, NA, NA), b = c(NA, 10, 9, 9, NA, NA, 2, NA, NA,NA), c = c(NA, NA, NA, 9, 1, NA, NA, 4, 11, 9), d = c(NA, NA, NA, NA, 8, NA, NA, 7, 9, 9), e = c(NA, NA, NA, NA, 9, NA, NA, 8, NA, 9), f = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), g = c(NA, NA, NA, NA, NA, NA, NA, 9, NA, NA)), .Names = c("ID", "a", "b", "c", "d", "e", "f", "g"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
And what I am trying to do is delete rows that only contains number 9.
In this case ID 103, 104, 110 would be the case. I want those 3 rows to be removed.
I tried the code below
df1[rowSums(df1[-1]==9)==0,]
But, with having NAs in the table, it only reproduces NA table.
Please help :( !
You can use apply to check for the whole row:
df1[apply(df1[,-1], 1, function(x) !all(na.omit(x) == 9) | all(is.na(x))), ]
# ID a b c d e f g
# 1 101 NA NA NA NA NA NA NA
# 2 102 9 10 NA NA NA NA NA
# 5 105 NA NA 1 8 9 NA NA
# 6 106 NA NA NA NA NA NA NA
# 7 107 NA 2 NA NA NA NA NA
# 8 108 NA NA 4 7 8 NA 9
# 9 109 NA NA 11 9 NA NA NA
I use na.omit to get rid of the NA-values in each row and then check if all the remaining values are equal to 9.
There's probably a much more efficient way, but the following works, I believe:
df1[!(apply(df1[-1] == 9, 1, prod, na.rm = TRUE) * !apply(is.na(df1[-1]), 1, prod)), ]
You can use the na.rm argument to ignore the NAs:
df1[rowSums(df1[-1]==9, na.rm = TRUE) == 0, ]
But also note that this code will only keep the rows that don't have any 9, which isn't exactly what you are asking for in the question.
edit after comment:
in that case simply flip:
df1[rowSums(df1[-1]!=9, na.rm = TRUE) > 0, ]