I am trying to deal with some aggregated data. I would like to have the data in a tidy format, but I am not sure how to do this without ending up with a number of value variables. What is the correct way to organize this data? I have searched around but can't find anything.
Here is an example:
#create the dataframe
df <- data.frame('date' = seq(as.Date('2019-01-15'), as.Date('2019-04-15'), 'months'),
'total' = c(2, 4, 1, 6),
'age.0-6' = c(1, 4, 0, 3),
'age.7-12' = c(1, 0, 1, 3),
'race.white' = c(1, 2, 0, 2),
'race.black' = c(1, 2, 1, 2),
'race.other' = c(0, 0, 1, 2))
#print the dataframe
df
date total age.0_6 age.7_12 race.white race.black race.other
1 2019-01-15 2 1 1 1 1 0
2 2019-02-15 4 4 0 2 2 0
3 2019-03-15 1 0 1 0 1 1
4 2019-04-15 6 3 3 2 2 2
The problem here is that i don't know the individual categories as the data is all aggregated. For example, for April 2014, I don't know if the races for ages 0-6 are:
2 other and 1 white; or
2 white and 1 black; or
1 black, 1 white and 1 other.
Because of this I can't get unique columns for each variable with one value for each outcome. So I can't tidy in the usual way.
Instead, I can tidy age and race, and have value columns for each. The first easy problem is to change the name of the value variable, but the bigger problem remains that I have lots of variables each with a value equivalent.
Here is a quick example:
df %>%
pivot_longer(c(age.0_6, age.7_12), names_to = 'age') %>% #pivot age data
mutate(age = gsub('[a-z]+\\.', '', age)) %>% #clean the age variable
pivot_longer(c(race.white, race.black, race.other), names_to = 'race', values_to = 'count') %>% #pivot the race data (use 'count' instead of 'value'
mutate(race = gsub('[a-z]+\\.', '', race)) #clean the race data
# A tibble: 24 x 6
date total age value race count
<date> <dbl> <chr> <dbl> <chr> <dbl>
1 2019-01-15 2 0_6 1 white 1
2 2019-01-15 2 0_6 1 black 1
3 2019-01-15 2 0_6 1 other 0
4 2019-01-15 2 7_12 1 white 1
5 2019-01-15 2 7_12 1 black 1
6 2019-01-15 2 7_12 1 other 0
7 2019-02-15 4 0_6 4 white 2
8 2019-02-15 4 0_6 4 black 2
9 2019-02-15 4 0_6 4 other 0
10 2019-02-15 4 7_12 0 white 2
# ... with 14 more rows
This is clearly not a tidy format and the data is pretty unmanageable. The problem rapidly becomes huge when I have a large number of age brackets, a large number of race categories, and a host of other aggregated characteristics: gender, disability, income bracket etc. etc.
Any thoughts on the best way to organize data of this sort? I am assuming it is common enough and there is best practice.
I think you have a few options that might make sense, depending on how you want to use the data. For visualizing the data, I think it's enough to just pivot the whole thing longer (#1 below). For analysis within each dimension, it might be safest and least presumptuous to keep them as separate tables (#2), since as you noted there are a huge number of ways the dimensions could conceivably relate to each other. If you want to show all the dimensions together, you will need to make assumptions about how the dimensions relate to each other. In #3 I assume the dimensions are completely uncorrelated, but in real samples this is rarely the case, and may lead to incorrect conclusions. (e.g. see examples of Simpson's Paradox)
Make dimension a variable in longer table
Here we just make the dimension of data (total / race / age) one column, and the value another.
library(tidyverse)
long_all <- df %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
This might make sense if you want to go right to visualization, where you could either filter by dimension or assign them to facets:
ggplot(long_all, aes(category, value)) +
geom_col() +
facet_wrap(~dimension, scales = "free_x" )
Make into multiple tables
You don't know how the dimensions relate to each other, so one clean method would be to keep them distinct. Then we could analyze each separately with a table focused on that dimension.
race <- df %>%
select(date, contains("race")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
age <- df %>%
select(date, contains("age")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
Impute hypothetical individuals
If you need to include both dimensions, you will have to make assumptions about how they relate. You might posit, for instance, that race and age are perfectly independent of each other in the sample (this is likely a faulty assumption, so should be noted). To create hypothetical crosstabs this way, you could create hypothetical individuals and have each sample without replacement from the various ages and races. The result will be one possibility of how the original summary data could have arisen, but might well omit patterns that exist in the true underlying data.
set.seed(42)
shuffle_step <- function(df) {
df %>%
uncount(value) %>%
slice_sample(prop = 1, replace = FALSE) %>%
group_by(date) %>%
mutate(row_in_date = row_number()) %>%
ungroup()
}
imputed_individuals <- full_join(
age %>%
shuffle_step %>%
select(date, row_in_date, age = category),
race %>%
shuffle_step %>%
select(date, row_in_date, race = category),
by = c("date", "row_in_date"))
Here, I make a row for each individual within each date with a possible category value, either for race or age. Then we join the two resulting data sets together, giving one possible set of individuals who would produce the same summary stats we started with, assuming the dimensions are uncorrelated.
We see here that there is one more individual who was assigned race than the ones who were counted by age or total dimensions. They show up with NA age here at the bottom of the list. It's likely a typo, but such data misalignment can be common in real-world data collection, so it's good practice to accommodate the possibility for inconsistent values.
> imputed_individuals
# A tibble: 14 x 4
date row_in_date age race
<date> <int> <chr> <chr>
1 2019-02-15 1 0.6 black
2 2019-04-15 1 0.6 black
3 2019-01-15 1 0.6 black
4 2019-04-15 2 7.12 black
5 2019-04-15 3 0.6 other
6 2019-02-15 2 0.6 white
7 2019-04-15 4 7.12 white
8 2019-04-15 5 0.6 other
9 2019-01-15 2 7.12 white
10 2019-02-15 3 0.6 white
11 2019-02-15 4 0.6 black
12 2019-03-15 1 7.12 other
13 2019-04-15 6 7.12 white
14 2019-03-15 2 NA black
We can confirm that this hypothetical scenario is consistent with our original data:
long_all %>%
filter(dimension == "age") %>%
left_join(
imputed_individuals %>% count(date, age),
by = c("date", "category" = "age"))
# A tibble: 8 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 age 0.6 1 1
2 2019-01-15 age 7.12 1 1
3 2019-02-15 age 0.6 4 4
4 2019-02-15 age 7.12 0 NA
5 2019-03-15 age 0.6 0 NA
6 2019-03-15 age 7.12 1 1
7 2019-04-15 age 0.6 3 3
8 2019-04-15 age 7.12 3 3
long_all %>%
filter(dimension == "race") %>%
left_join(
imputed_individuals %>% count(date, race),
by = c("date", "category" = "race"))
# A tibble: 12 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 race white 1 1
2 2019-01-15 race black 1 1
3 2019-01-15 race other 0 NA
4 2019-02-15 race white 2 2
5 2019-02-15 race black 2 2
6 2019-02-15 race other 0 NA
7 2019-03-15 race white 0 NA
8 2019-03-15 race black 1 1
9 2019-03-15 race other 1 1
10 2019-04-15 race white 2 2
11 2019-04-15 race black 2 2
12 2019-04-15 race other 2 2
Related
I have a system which records sanctions against clients' names.
There should only ever be one active sanction, yet there are some cases where there are multiple active sanctions.
I would like to know how I can count how many people had two or more simultaneously-active sanctions over the past three years (sample data ranges from 2019-2022, so this won't need to be filtered in the solution).
The way I would work this out is to detect those cases where start_date2 occurs before end_date1.
Sample data (note that the end_date values are random, so there may be several cases of them occurring before their respective start_date values, but bear in mind that this is just sample data, so take it with a pinch of salt):
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
The algorithm in my mind goes like this:
for each client
check if there was more than one active sanction at any one time
look for cases where start_date2/start_date3/start_dateY occurs before end_date1/end_date2/end_dateX
group by month-year (using month_year column)
The output I am looking for is a monthly breakdown, indicating how many simultaneous sanctions occurred per month. Something like this:
01-2020: 10
02-2020: 35
03-2020: 29
...
01-2022: 5
I believe that I have covered everything, but am happy to clarify anything where required/requested.
Updated, given clarifications in comment section
If we do this without regard to client, then we have something like this:
sanc %>% arrange(start_date) %>%
mutate(same_as_prev = start_date<lag(end_date) |row_number()==1 & end_date>lead(start_date)) %>%
group_by(start_month_year) %>%
summarize(simActive = sum(same_as_prev))
Output:
# A tibble: 37 x 2
start_month_year simActive
<chr> <int>
1 2019-01 29
2 2019-02 26
3 2019-03 30
4 2019-04 26
5 2019-05 25
6 2019-06 19
7 2019-07 19
8 2019-08 26
9 2019-09 21
10 2019-10 23
# ... with 27 more rows
It seems that in your sample data, all the clients have only one row, so I've adjusted it so that each of 200 clients has 5 rows. I then do something rather simple:
sanc %>% as_tibble() %>%
group_by(client, active = cumsum(start_date>lag(end_date) & row_number()>1)) %>%
filter(n()>1) %>%
ungroup() %>%
distinct(client, active) %>%
count(client, name="simActive")
This returns a list of clients, along with the number of times the client had simultaneous active sanctions.
Output:
# A tibble: 193 x 2
client simActive
<int> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 2
6 6 2
7 7 1
8 8 1
9 9 1
10 10 1
# ... with 183 more rows
So for client 1, there was one time when there was 2 or more active sanctions. The data for client one (see input below) looks like this, and this client had rows 3 and 4 active at the same time.
client start_date end_date start_month_year
1 1 2019-03-18 2019-09-25 2019-03
2 1 2020-10-19 2019-12-03 2020-10
3 1 2021-03-11 2019-11-26 2021-03
4 1 2020-07-06 2021-09-03 2020-07
5 1 2021-05-11 2019-09-06 2021-05
Input:
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
Here is another way to do it. It might not be very performant, but the approach should yield the correct results. See my inline comments for how it works. Further note, that I adjusted your sample data. You did just sample random start and end dates without making sure that start_date < end_date. I changed this so that each start_date is smaller than its end_date.
set.seed(147)
library(dplyr)
sanc <-
tibble(
client = sample(1:500, 1000, replace = TRUE),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-06-01"), by = "day"), 1000),
end_date = round(runif(1000, min = 1, max = 150), 0 ) + start_date
)
sanc %>%
# make each sanction an `lubridate::interval`
mutate(int = interval(start_date, end_date)) %>%
# group_by month and client
group_by(month = format(start_date, "%Y-%m"), client) %>%
# use `lubridate::int_overlaps` to compare all intervals
summarise(overlap = list(outer(int, int, int_overlaps))) %>%
# apply to each row ...
rowwise() %>%
# to get only the lower triangle of each matrix and sum it up
mutate(overlap = sum(overlap[lower.tri(overlap)])) %>%
# now group by month
group_by(month) %>%
# and how many individuals in each month have more than one active sanction
summarise(overlap = sum(overlap))
#> `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
#> # A tibble: 42 x 2
#> month overlap
#> <chr> <int>
#> 1 2019-01 0
#> 2 2019-02 0
#> 3 2019-03 0
#> 4 2019-04 0
#> 5 2019-05 0
#> 6 2019-06 1
#> 7 2019-07 1
#> 8 2019-08 2
#> 9 2019-09 1
#> 10 2019-10 3
#> # ... with 32 more rows
Created on 2022-03-09 by the reprex package (v2.0.1)
In R, I need to find which treatments are occurring concurrently and work out what the dose for that day would be. I need to do this by patient, so presumably using a group_by statement in dplyr.
user_id
treatment
dosage
treatment_start
treatment_end
1
1
3
01/28/2019
07/30/2019
1
1
2
05/26/2019
11/25/2019
1
2
1
08/13/2019
02/12/2020
1
1
2
12/06/2019
04/07/2020
1
2
1
12/09/2019
06/10/2020
Ideally the final form of it will be the user id, the treatments they're on, the sum of the dosage of all treatments, and the dates that they're on all of those treatments. I've made an example results table with a few rows below.
user_id
treatments
total_dosage
treatment_start
treatment_end
1
1
3
01/28/2019
05/25/2019
1
1
5
05/26/2019
07/30/2019
1
1
2
07/31/2019
08/12/2019
1
1,2
3
08/13/2019
11/25/2019
I worked out how to find if an event overlaps with other events but it doesn't get the resulting dates, and doesn't sum the dosages so I don't know if it's usable. In this case, course is a combination of the treatment and dosage column.
DF %>% group_by(user_id ) %>%
mutate(overlap = purrr::map2_chr(treatment_start, treatment_end,
~toString(course[.x >= treatment_start & .x < treatment_end| .y > treatment_start & .y < treatment_end]))) %>%
ungroup()
This is an interesting question. One way is to expand the dataframe to have one row for each day, and then summarise the data by date:
library(tidyverse)
library(lubridate)
dat %>%
# Convert dates to date format
mutate(across(treatment_start:treatment_end, ~ mdy(.x))) %>%
# Expand the dataframe
group_by(user_id, treatment_start, treatment_end) %>%
mutate(date = list(seq(treatment_start, treatment_end, by = "day"))) %>%
unnest(date) %>%
# Summarise by day
group_by(user_id, date) %>%
summarise(dosage = sum(dosage),
treatment = toString(unique(treatment))) %>%
# Summarise by different dosage (and create periods)
group_by(user_id, treatment, dosage) %>%
summarise(treatment_start = min(date),
treatment_ends = max(date)) %>%
arrange(treatment_start)
output:
user_id treatment dosage treatment_start treatment_ends
<int> <chr> <int> <date> <date>
1 1 1 3 2019-01-28 2019-05-25
2 1 1 5 2019-05-26 2019-07-30
3 1 1 2 2019-07-31 2019-08-12
4 1 1, 2 3 2019-08-13 2020-04-07
5 1 2 1 2019-11-26 2020-06-10
6 1 2, 1 3 2019-12-06 2019-12-08
7 1 2, 1 4 2019-12-09 2020-02-12
I have a data frame containing data that looks something like this:
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
I want to summarise the counts of yes/no in the variables one, two, and three which normally I would do by df %>% group_by(group1,group2,one) %>% summarise(n()). Is there any way that I can summarise all three columns and then bind them all into one output df without having to manually perform the code over each column? I've tried using for loop but I can't get the group_by() to recognize the colname I am giving it as input
Get the data in long format and count :
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols = one:three) %>% count(group1, group2, value)
# group1 group2 value n
# <chr> <chr> <chr> <int>
#1 High female no 1
#2 High female yes 2
#3 High male no 3
#4 High male yes 3
#5 Low female no 2
#6 Low female yes 4
#7 Low male no 1
#8 Low male yes 2
This may be done in dplyr only (no need to use tidyr::pivot_*), though giving slightly different output format. (This one is working even without rowwise though I am not aware of exact reason of it)
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
library(dplyr)
df %>%
group_by(group1, group2) %>%
summarise(yes_count = sum(c_across(everything()) == 'yes'),
no_count = sum(c_across(one:three) == 'no'), .groups = 'drop')
#> # A tibble: 4 x 4
#> group1 group2 yes_count no_count
#> <chr> <chr> <int> <int>
#> 1 High female 2 1
#> 2 High male 3 3
#> 3 Low female 4 2
#> 4 Low male 2 1
Created on 2021-05-12 by the reprex package (v2.0.0)
Using data.table
library(data.table)
melt(setDT(df), id.var = c('group1', 'group2'))[, .(n = .N),
.(group1, group2, value)]
-output
group1 group2 value n
1: High male yes 3
2: High female yes 2
3: Low female yes 4
4: Low male no 1
5: Low female no 2
6: High male no 3
7: Low male yes 2
8: High female no 1
With base R, we can use by and table
by(df[3:5], df[1:2], function(x) table(unlist(x)))
I have a dataframe in R where it looks like:
ID <- c(1,1,1,2,2,3,3)
times <- c("2021-02-01", "2021-02-02", "2021-02-05","2021-02-01","2021-02-02", "2021-02-05", "2021-02-09")
dat <- data.frame(times=times, ID=ID)
> dat
times ID
1 2021-02-01 1
2 2021-02-02 1
3 2021-02-05 1
4 2021-02-01 2
5 2021-02-02 2
6 2021-02-05 3
7 2021-02-09 3
I would like to sort this into a tally by date, such that for a given date, it counts how many users appeared on the date, and appeared again within a 2 day time interval. Since on 2021-02-01 ID 1 appears and reappears again on 2021-02-02, ID 1 would be counted in. ID 2 is also counted in as it also appears on 2021-02-01 and appears again on 2021-02-02. The resulting data frame would look like
> dat_result
times counts
1 2021-02-01 2
2 2021-02-02 0
3 2021-02-05 0
4 2021-02-09 0
Is there a way to achieve this by data.table or dplyr? thanks.
dplyr approach :
library(dplyr)
dat %>%
mutate(times1 = as.Date(times),
times = factor(times)) %>%
arrange(ID, times1) %>%
group_by(ID) %>%
filter(lead(times1) - times1 == 1) %>%
ungroup %>%
count(times, .drop = FALSE) %>%
mutate(times = as.Date(times))
# times n
# <date> <int>
#1 2021-02-01 2
#2 2021-02-02 0
#3 2021-02-05 0
#4 2021-02-09 0
For each ID keep only those rows that have difference of 1 day between the dates and count number of such dates.
This is a question for all the Tidyverse experts out there. I have a dataset with lots of different classes (datettime, integer, factor, etc.) and want to use tidyr to gather multiple variables at the same time. In the reproducible example below I would like to gather time_, factor_ and integer_ at once, while id and gender remain untouched.
I am looking for the current best practice solution using any of the Tidyverse functions.
(I'd prefer if the solution isn't too "hacky" as I have a dataset with dozens of different key variables and around five hundred thousand rows).
Example data:
library("tidyverse")
data <- tibble(
id = c(1, 2, 3),
gender = factor(c("Male", "Female", "Female")),
time1 = as.POSIXct(c("2014-03-03 20:19:42", "2014-03-03 21:53:17", "2014-02-21 12:13:06")),
time2 = as.POSIXct(c("2014-05-28 15:26:49 UTC", NA, "2014-05-24 10:53:01 UTC")),
time3 = as.POSIXct(c(NA, "2014-09-26 00:52:40 UTC", "2014-09-27 07:08:47 UTC")),
factor1 = factor(c("A", "B", "C")),
factor2 = factor(c("B", NA, "C")),
factor3 = factor(c(NA, "A", "B")),
integer1 = c(1, 3, 2),
integer2 = c(1, NA, 4),
integer3 = c(NA, 5, 2)
)
Desired outcome:
# A tibble: 9 x 5
id gender Time Integer Factor
<dbl> <fct> <dttm> <dbl> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 2 Female 2014-03-03 21:53:17 3 B
3 3 Female 2014-02-21 12:13:06 2 C
4 1 Male 2014-05-28 15:26:49 1 B
5 2 Female NA NA NA
6 3 Female 2014-05-24 10:53:01 4 C
7 1 Male NA NA NA
8 2 Female 2014-09-26 00:52:40 5 A
9 3 Female 2014-09-27 07:08:47 2 B
P.S. I did find a couple of threads that scratch the surface of gathering multiple variables, but none deal with the issue of gathering different classes and describe the current state of the art Tidyverse solution.
Probably too repetitive for what you want, but using mutate_at to recode multiple variables at the end when dealing with a large number of variables may be an option
Changing them all to character at the start maintains the time data then it needs to be converted back to date time at the end
data %>%
mutate_all(funs(as.character)) %>%
gather(key = variable, value = value, -id, -gender, convert = T) %>%
mutate(wave = readr::parse_number(variable),
variable = gsub("\\d","", x = variable)) %>%
spread(variable, value, convert = T) %>%
mutate(time = as.POSIXct(time),
factor = factor(factor),
gender = factor(gender)) %>%
select(1, 2, 6, 5, 4)
# A tibble: 9 x 5
id gender time integer factor
<chr> <fct> <dttm> <int> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 1 Male 2014-05-28 15:26:49 1 B
3 1 Male NA NA NA
4 2 Female 2014-03-03 21:53:17 3 B
5 2 Female NA NA NA
6 2 Female 2014-09-26 00:52:40 5 A
7 3 Female 2014-02-21 12:13:06 2 C
8 3 Female 2014-05-24 10:53:01 4 C
9 3 Female 2014-09-27 07:08:47 2 B
(I'm rewriting basically all of my previous answer but keeping as this post to preserve comments.)
You can use some of the tidyselect helper functions, namely starts_with, to select batches of columns to gather, and then drop superfluous ones. This handles (some) of the issue of data types with gathering, because you're gathering sets of columns of the same type together, but it still requires re-coercing Factor into a factor because of the different factor levels present when gathering (see the warning message).
What I had trouble grasping was how the gathered columns would "move" while keeping some pattern with the id and gender columns. Doing a series of gather calls doesn't keep the pattern you want, but you can do each gather call and join them back together.
Here's one:
library(tidyverse)
data %>%
select(id, gender, starts_with("time")) %>%
gather(key = key_time, value = Time, starts_with("time"))
#> # A tibble: 9 x 4
#> id gender key_time Time
#> <dbl> <fct> <chr> <dttm>
#> 1 1 Male time1 2014-03-03 20:19:42
#> 2 2 Female time1 2014-03-03 21:53:17
#> 3 3 Female time1 2014-02-21 12:13:06
#> 4 1 Male time2 2014-05-28 15:26:49
#> 5 2 Female time2 NA
#> 6 3 Female time2 2014-05-24 10:53:01
#> 7 1 Male time3 NA
#> 8 2 Female time3 2014-09-26 00:52:40
#> 9 3 Female time3 2014-09-27 07:08:47
To do all of these, you can map over the prefixes—"time," "factor," and "integer"—and reduce-join them together. The trick is that you need some unique identifier for each row in order to join properly; for this, I added a column with row_number, use it as a joining column, then drop it.
map(c("time", "factor", "integer"), function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Joining, by = c("id", "gender", "row")
#> Joining, by = c("id", "gender", "row")
#> # A tibble: 9 x 5
#> id gender Time Factor Integer
#> <dbl> <fct> <dttm> <chr> <dbl>
#> 1 1 Male 2014-03-03 20:19:42 A 1
#> 2 2 Female 2014-03-03 21:53:17 B 3
#> 3 3 Female 2014-02-21 12:13:06 C 2
#> 4 1 Male 2014-05-28 15:26:49 B 1
#> 5 2 Female NA <NA> NA
#> 6 3 Female 2014-05-24 10:53:01 C 4
#> 7 1 Male NA <NA> NA
#> 8 2 Female 2014-09-26 00:52:40 A 5
#> 9 3 Female 2014-09-27 07:08:47 B 2
It's a little ugly, and won't fit well in a piped workflow already underway, but you could easily enough wrap it in a function:
gather_by_prefix <- function(.data, prefix) {
map(prefix, function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
}
Calling it like so gets the same output as above:
data %>%
gather_by_prefix(c("time", "factor", "integer"))
As for keeping factor levels, I think unfortunately you'll need to coerce it back afterwards. There are other questions on possible ways around it; here's one.
It's also worth noting that the tidyr github has several issues filed on work being done to implement a multi_gather-type of function, likely for use cases like yours. Not sure if those would cover factor conversion.