This is a question for all the Tidyverse experts out there. I have a dataset with lots of different classes (datettime, integer, factor, etc.) and want to use tidyr to gather multiple variables at the same time. In the reproducible example below I would like to gather time_, factor_ and integer_ at once, while id and gender remain untouched.
I am looking for the current best practice solution using any of the Tidyverse functions.
(I'd prefer if the solution isn't too "hacky" as I have a dataset with dozens of different key variables and around five hundred thousand rows).
Example data:
library("tidyverse")
data <- tibble(
id = c(1, 2, 3),
gender = factor(c("Male", "Female", "Female")),
time1 = as.POSIXct(c("2014-03-03 20:19:42", "2014-03-03 21:53:17", "2014-02-21 12:13:06")),
time2 = as.POSIXct(c("2014-05-28 15:26:49 UTC", NA, "2014-05-24 10:53:01 UTC")),
time3 = as.POSIXct(c(NA, "2014-09-26 00:52:40 UTC", "2014-09-27 07:08:47 UTC")),
factor1 = factor(c("A", "B", "C")),
factor2 = factor(c("B", NA, "C")),
factor3 = factor(c(NA, "A", "B")),
integer1 = c(1, 3, 2),
integer2 = c(1, NA, 4),
integer3 = c(NA, 5, 2)
)
Desired outcome:
# A tibble: 9 x 5
id gender Time Integer Factor
<dbl> <fct> <dttm> <dbl> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 2 Female 2014-03-03 21:53:17 3 B
3 3 Female 2014-02-21 12:13:06 2 C
4 1 Male 2014-05-28 15:26:49 1 B
5 2 Female NA NA NA
6 3 Female 2014-05-24 10:53:01 4 C
7 1 Male NA NA NA
8 2 Female 2014-09-26 00:52:40 5 A
9 3 Female 2014-09-27 07:08:47 2 B
P.S. I did find a couple of threads that scratch the surface of gathering multiple variables, but none deal with the issue of gathering different classes and describe the current state of the art Tidyverse solution.
Probably too repetitive for what you want, but using mutate_at to recode multiple variables at the end when dealing with a large number of variables may be an option
Changing them all to character at the start maintains the time data then it needs to be converted back to date time at the end
data %>%
mutate_all(funs(as.character)) %>%
gather(key = variable, value = value, -id, -gender, convert = T) %>%
mutate(wave = readr::parse_number(variable),
variable = gsub("\\d","", x = variable)) %>%
spread(variable, value, convert = T) %>%
mutate(time = as.POSIXct(time),
factor = factor(factor),
gender = factor(gender)) %>%
select(1, 2, 6, 5, 4)
# A tibble: 9 x 5
id gender time integer factor
<chr> <fct> <dttm> <int> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 1 Male 2014-05-28 15:26:49 1 B
3 1 Male NA NA NA
4 2 Female 2014-03-03 21:53:17 3 B
5 2 Female NA NA NA
6 2 Female 2014-09-26 00:52:40 5 A
7 3 Female 2014-02-21 12:13:06 2 C
8 3 Female 2014-05-24 10:53:01 4 C
9 3 Female 2014-09-27 07:08:47 2 B
(I'm rewriting basically all of my previous answer but keeping as this post to preserve comments.)
You can use some of the tidyselect helper functions, namely starts_with, to select batches of columns to gather, and then drop superfluous ones. This handles (some) of the issue of data types with gathering, because you're gathering sets of columns of the same type together, but it still requires re-coercing Factor into a factor because of the different factor levels present when gathering (see the warning message).
What I had trouble grasping was how the gathered columns would "move" while keeping some pattern with the id and gender columns. Doing a series of gather calls doesn't keep the pattern you want, but you can do each gather call and join them back together.
Here's one:
library(tidyverse)
data %>%
select(id, gender, starts_with("time")) %>%
gather(key = key_time, value = Time, starts_with("time"))
#> # A tibble: 9 x 4
#> id gender key_time Time
#> <dbl> <fct> <chr> <dttm>
#> 1 1 Male time1 2014-03-03 20:19:42
#> 2 2 Female time1 2014-03-03 21:53:17
#> 3 3 Female time1 2014-02-21 12:13:06
#> 4 1 Male time2 2014-05-28 15:26:49
#> 5 2 Female time2 NA
#> 6 3 Female time2 2014-05-24 10:53:01
#> 7 1 Male time3 NA
#> 8 2 Female time3 2014-09-26 00:52:40
#> 9 3 Female time3 2014-09-27 07:08:47
To do all of these, you can map over the prefixes—"time," "factor," and "integer"—and reduce-join them together. The trick is that you need some unique identifier for each row in order to join properly; for this, I added a column with row_number, use it as a joining column, then drop it.
map(c("time", "factor", "integer"), function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Joining, by = c("id", "gender", "row")
#> Joining, by = c("id", "gender", "row")
#> # A tibble: 9 x 5
#> id gender Time Factor Integer
#> <dbl> <fct> <dttm> <chr> <dbl>
#> 1 1 Male 2014-03-03 20:19:42 A 1
#> 2 2 Female 2014-03-03 21:53:17 B 3
#> 3 3 Female 2014-02-21 12:13:06 C 2
#> 4 1 Male 2014-05-28 15:26:49 B 1
#> 5 2 Female NA <NA> NA
#> 6 3 Female 2014-05-24 10:53:01 C 4
#> 7 1 Male NA <NA> NA
#> 8 2 Female 2014-09-26 00:52:40 A 5
#> 9 3 Female 2014-09-27 07:08:47 B 2
It's a little ugly, and won't fit well in a piped workflow already underway, but you could easily enough wrap it in a function:
gather_by_prefix <- function(.data, prefix) {
map(prefix, function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
}
Calling it like so gets the same output as above:
data %>%
gather_by_prefix(c("time", "factor", "integer"))
As for keeping factor levels, I think unfortunately you'll need to coerce it back afterwards. There are other questions on possible ways around it; here's one.
It's also worth noting that the tidyr github has several issues filed on work being done to implement a multi_gather-type of function, likely for use cases like yours. Not sure if those would cover factor conversion.
Related
I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique() or distinct() because then some data would be erased, as the rows are not identical.
Please see this example:
# The style of df I have
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df
# IDs Intervention Other_data1 Other_data2
# 1 1 Progesterone 22 a
# 2 1 Stitch 22 a
# 3 2 <NA> 32 b
# 4 3 Stitch 44 c
# 5 4 Progesterone 24 d
# 6 4 Stitch 24 d
# 7 4 Pessary 24 d
# 8 5 Progesterone NA e
So if I used unique() I would lose the full information in the df$Intervention column.
Please could someone let me know how I can get the df into this format:
# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
Stitch=c("Yes", NA, "Yes", "Yes", "No"),
Pessary=c("No", NA, "No", "Yes", "No"),
Other_data1= c(22,32,44,24,NA),
Other_data2=c("a","b","c","d","e"))
df_I_want
# IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1 1 Yes Yes No 22 a
# 2 2 <NA> <NA> <NA> 32 b
# 3 3 No Yes No 44 c
# 4 4 Yes Yes Yes 24 d
# 5 5 Yes No No NA e
My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention.
Here is another pivot_wider solution, but here I use mutate and case_when to identify their corresponding values under the newly expanded columns.
If all of the three newly expanded columns are NA, they should remain NA. Otherwise, treat NA as "No" and non-NA as "Yes".
Note that within across(), you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone, Stitch and Pessary are newly created, and they are in position 4 to 6, therefore 4:6).
Edit: Added length(unique(na.omit(df$Intervention))) when calculating and comparing the number of NAs across the newly expanded columns so that it's more dynamic
library(tidyverse)
df %>%
pivot_wider(names_from = Intervention, values_from = Intervention) %>%
select(-"NA") %>%
mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == length(unique(na.omit(df$Intervention))) ~ NA_character_,
is.na(.x) ~ "No",
!is.na(.x) ~ "Yes")))
# A tibble: 5 × 6
IDs Other_data1 Other_data2 Progesterone Stitch Pessary
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 22 a Yes Yes No
2 2 32 b NA NA NA
3 3 44 c No Yes No
4 4 24 d Yes Yes Yes
5 5 NA e Yes No No
Updated: I have now updated the code so that you have "Yes" and "No" for each intervention.
You can use the function pivot_wider() to achieve this:
library(tidyverse)
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df %>%
pivot_wider(names_from = Intervention,
values_from = Intervention) %>%
select(-c(`NA`)) %>%
mutate(across(.cols = Progesterone:Pessary,
.fns = ~case_when(is.na(.) ~ "No",
TRUE ~ "Yes")))
#> # A tibble: 5 × 6
#> IDs Other_data1 Other_data2 Progesterone Stitch Pessary
#> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 22 a Yes Yes No
#> 2 2 32 b No No No
#> 3 3 44 c No Yes No
#> 4 4 24 d Yes Yes Yes
#> 5 5 NA e Yes No No
Created on 2022-08-19 with reprex v2.0.2
I have a data frame containing data that looks something like this:
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
I want to summarise the counts of yes/no in the variables one, two, and three which normally I would do by df %>% group_by(group1,group2,one) %>% summarise(n()). Is there any way that I can summarise all three columns and then bind them all into one output df without having to manually perform the code over each column? I've tried using for loop but I can't get the group_by() to recognize the colname I am giving it as input
Get the data in long format and count :
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols = one:three) %>% count(group1, group2, value)
# group1 group2 value n
# <chr> <chr> <chr> <int>
#1 High female no 1
#2 High female yes 2
#3 High male no 3
#4 High male yes 3
#5 Low female no 2
#6 Low female yes 4
#7 Low male no 1
#8 Low male yes 2
This may be done in dplyr only (no need to use tidyr::pivot_*), though giving slightly different output format. (This one is working even without rowwise though I am not aware of exact reason of it)
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
library(dplyr)
df %>%
group_by(group1, group2) %>%
summarise(yes_count = sum(c_across(everything()) == 'yes'),
no_count = sum(c_across(one:three) == 'no'), .groups = 'drop')
#> # A tibble: 4 x 4
#> group1 group2 yes_count no_count
#> <chr> <chr> <int> <int>
#> 1 High female 2 1
#> 2 High male 3 3
#> 3 Low female 4 2
#> 4 Low male 2 1
Created on 2021-05-12 by the reprex package (v2.0.0)
Using data.table
library(data.table)
melt(setDT(df), id.var = c('group1', 'group2'))[, .(n = .N),
.(group1, group2, value)]
-output
group1 group2 value n
1: High male yes 3
2: High female yes 2
3: Low female yes 4
4: Low male no 1
5: Low female no 2
6: High male no 3
7: Low male yes 2
8: High female no 1
With base R, we can use by and table
by(df[3:5], df[1:2], function(x) table(unlist(x)))
I am trying to deal with some aggregated data. I would like to have the data in a tidy format, but I am not sure how to do this without ending up with a number of value variables. What is the correct way to organize this data? I have searched around but can't find anything.
Here is an example:
#create the dataframe
df <- data.frame('date' = seq(as.Date('2019-01-15'), as.Date('2019-04-15'), 'months'),
'total' = c(2, 4, 1, 6),
'age.0-6' = c(1, 4, 0, 3),
'age.7-12' = c(1, 0, 1, 3),
'race.white' = c(1, 2, 0, 2),
'race.black' = c(1, 2, 1, 2),
'race.other' = c(0, 0, 1, 2))
#print the dataframe
df
date total age.0_6 age.7_12 race.white race.black race.other
1 2019-01-15 2 1 1 1 1 0
2 2019-02-15 4 4 0 2 2 0
3 2019-03-15 1 0 1 0 1 1
4 2019-04-15 6 3 3 2 2 2
The problem here is that i don't know the individual categories as the data is all aggregated. For example, for April 2014, I don't know if the races for ages 0-6 are:
2 other and 1 white; or
2 white and 1 black; or
1 black, 1 white and 1 other.
Because of this I can't get unique columns for each variable with one value for each outcome. So I can't tidy in the usual way.
Instead, I can tidy age and race, and have value columns for each. The first easy problem is to change the name of the value variable, but the bigger problem remains that I have lots of variables each with a value equivalent.
Here is a quick example:
df %>%
pivot_longer(c(age.0_6, age.7_12), names_to = 'age') %>% #pivot age data
mutate(age = gsub('[a-z]+\\.', '', age)) %>% #clean the age variable
pivot_longer(c(race.white, race.black, race.other), names_to = 'race', values_to = 'count') %>% #pivot the race data (use 'count' instead of 'value'
mutate(race = gsub('[a-z]+\\.', '', race)) #clean the race data
# A tibble: 24 x 6
date total age value race count
<date> <dbl> <chr> <dbl> <chr> <dbl>
1 2019-01-15 2 0_6 1 white 1
2 2019-01-15 2 0_6 1 black 1
3 2019-01-15 2 0_6 1 other 0
4 2019-01-15 2 7_12 1 white 1
5 2019-01-15 2 7_12 1 black 1
6 2019-01-15 2 7_12 1 other 0
7 2019-02-15 4 0_6 4 white 2
8 2019-02-15 4 0_6 4 black 2
9 2019-02-15 4 0_6 4 other 0
10 2019-02-15 4 7_12 0 white 2
# ... with 14 more rows
This is clearly not a tidy format and the data is pretty unmanageable. The problem rapidly becomes huge when I have a large number of age brackets, a large number of race categories, and a host of other aggregated characteristics: gender, disability, income bracket etc. etc.
Any thoughts on the best way to organize data of this sort? I am assuming it is common enough and there is best practice.
I think you have a few options that might make sense, depending on how you want to use the data. For visualizing the data, I think it's enough to just pivot the whole thing longer (#1 below). For analysis within each dimension, it might be safest and least presumptuous to keep them as separate tables (#2), since as you noted there are a huge number of ways the dimensions could conceivably relate to each other. If you want to show all the dimensions together, you will need to make assumptions about how the dimensions relate to each other. In #3 I assume the dimensions are completely uncorrelated, but in real samples this is rarely the case, and may lead to incorrect conclusions. (e.g. see examples of Simpson's Paradox)
Make dimension a variable in longer table
Here we just make the dimension of data (total / race / age) one column, and the value another.
library(tidyverse)
long_all <- df %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
This might make sense if you want to go right to visualization, where you could either filter by dimension or assign them to facets:
ggplot(long_all, aes(category, value)) +
geom_col() +
facet_wrap(~dimension, scales = "free_x" )
Make into multiple tables
You don't know how the dimensions relate to each other, so one clean method would be to keep them distinct. Then we could analyze each separately with a table focused on that dimension.
race <- df %>%
select(date, contains("race")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
age <- df %>%
select(date, contains("age")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
Impute hypothetical individuals
If you need to include both dimensions, you will have to make assumptions about how they relate. You might posit, for instance, that race and age are perfectly independent of each other in the sample (this is likely a faulty assumption, so should be noted). To create hypothetical crosstabs this way, you could create hypothetical individuals and have each sample without replacement from the various ages and races. The result will be one possibility of how the original summary data could have arisen, but might well omit patterns that exist in the true underlying data.
set.seed(42)
shuffle_step <- function(df) {
df %>%
uncount(value) %>%
slice_sample(prop = 1, replace = FALSE) %>%
group_by(date) %>%
mutate(row_in_date = row_number()) %>%
ungroup()
}
imputed_individuals <- full_join(
age %>%
shuffle_step %>%
select(date, row_in_date, age = category),
race %>%
shuffle_step %>%
select(date, row_in_date, race = category),
by = c("date", "row_in_date"))
Here, I make a row for each individual within each date with a possible category value, either for race or age. Then we join the two resulting data sets together, giving one possible set of individuals who would produce the same summary stats we started with, assuming the dimensions are uncorrelated.
We see here that there is one more individual who was assigned race than the ones who were counted by age or total dimensions. They show up with NA age here at the bottom of the list. It's likely a typo, but such data misalignment can be common in real-world data collection, so it's good practice to accommodate the possibility for inconsistent values.
> imputed_individuals
# A tibble: 14 x 4
date row_in_date age race
<date> <int> <chr> <chr>
1 2019-02-15 1 0.6 black
2 2019-04-15 1 0.6 black
3 2019-01-15 1 0.6 black
4 2019-04-15 2 7.12 black
5 2019-04-15 3 0.6 other
6 2019-02-15 2 0.6 white
7 2019-04-15 4 7.12 white
8 2019-04-15 5 0.6 other
9 2019-01-15 2 7.12 white
10 2019-02-15 3 0.6 white
11 2019-02-15 4 0.6 black
12 2019-03-15 1 7.12 other
13 2019-04-15 6 7.12 white
14 2019-03-15 2 NA black
We can confirm that this hypothetical scenario is consistent with our original data:
long_all %>%
filter(dimension == "age") %>%
left_join(
imputed_individuals %>% count(date, age),
by = c("date", "category" = "age"))
# A tibble: 8 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 age 0.6 1 1
2 2019-01-15 age 7.12 1 1
3 2019-02-15 age 0.6 4 4
4 2019-02-15 age 7.12 0 NA
5 2019-03-15 age 0.6 0 NA
6 2019-03-15 age 7.12 1 1
7 2019-04-15 age 0.6 3 3
8 2019-04-15 age 7.12 3 3
long_all %>%
filter(dimension == "race") %>%
left_join(
imputed_individuals %>% count(date, race),
by = c("date", "category" = "race"))
# A tibble: 12 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 race white 1 1
2 2019-01-15 race black 1 1
3 2019-01-15 race other 0 NA
4 2019-02-15 race white 2 2
5 2019-02-15 race black 2 2
6 2019-02-15 race other 0 NA
7 2019-03-15 race white 0 NA
8 2019-03-15 race black 1 1
9 2019-03-15 race other 1 1
10 2019-04-15 race white 2 2
11 2019-04-15 race black 2 2
12 2019-04-15 race other 2 2
My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4
I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA