I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA
Related
I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.
case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1
Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1
This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1
Here is the dataframe
df <- data.frame(number = c(1,1,2,2,2,3,3),
heahache = c(1,1,na,na,na,1,na),
pain = c(na,1,1,na,1,na,na),
futigue = c(na,na,1,na,1,1,1))
headache pain futigue
1 1 na na
1 1 1 na
2 na 1 1
2 na na na
2 na 1 1
3 1 na 1
3 na na 1
The first result that I want is to get how many times each symptom appeared like this
headache pain futigue
1 2 1 0
2 0 2 2
3 1 0 2
The second result is to calculate how many symptoms each person got like
symptoms
1 2
2 2
3 2
Since the real data set has 50+ columns discribing different symptoms, any idea to manage large data set? Thank you.
First, tidy your data (note the corrections of typos: na should be NA, heahache should be headache and futigue should be fatigue):
library(tidyverse)
df <- data.frame(number = c(1,1,2,2,2,3,3),
headache = c(1,1,NA,NA,NA,1,NA),
pain = c(NA,1,1,NA,1,NA,NA),
fatigue = c(NA,NA,1,NA,1,1,1))
longDF <- df %>%
pivot_longer(
cols=c(headache, pain, fatigue),
names_to="Symptom",
values_to="Present"
) %>%
replace_na(list(Present=0))
Then to count appearances:
longDF %>%
group_by(number, Symptom) %>%
summarise(Count=sum(Present)) %>%
pivot_wider(
names_from=Symptom,
values_from=Count
)
# A tibble: 3 x 4
# Groups: number [3]
number fatigue headache pain
<dbl> <dbl> <dbl> <dbl>
1 1 0 2 1
2 2 2 0 2
3 3 2 1 0
and the number of symptoms experienced by each number:
longDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom)))
# A tibble: 3 x 2
number symptoms
* <dbl> <int>
1 1 2
2 2 2
3 3 2
Note that this final calculation will omit numbers who do not experience any symptoms. To do that, a little more work will be required. To show the problem, add a number who exprienced no symptoms:
newDF <- longDF %>%
add_row(number=4, Symptom="headache", Present=0) %>%
add_row(number=4, Symptom="fatigue", Present=0) %>%
add_row(number=4, Symptom="pain", Present=0)
Demonstrate the problem:
newDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom)))
# A tibble: 3 x 2
number symptoms
* <dbl> <int>
1 1 2
2 2 2
3 3 2
And solve it:
newDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom))) %>%
right_join(newDF %>% distinct(number), by="number") %>%
replace_na(list(symptoms=0))
# A tibble: 4 x 2
number symptoms
<dbl> <dbl>
1 1 2
2 2 2
3 3 2
4 4 0
We can just use summarise from dplyr and doesn't need any additional packages. For larger dataset, reshaping could be costly. Would recommend to summarise first and use rowSums (vectorized and efficient) to create the 'Symptoms' column
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(everything(), ~ sum(!is.na(.))))
-output
# A tibble: 3 x 4
number headache pain fatigue
* <dbl> <int> <int> <int>
1 1 2 1 0
2 2 0 2 2
3 3 1 0 2
If we need the symptoms column
df %>%
group_by(number) %>%
summarise(across(everything(), ~ sum(!is.na(.)))) %>%
mutate(Symptoms = rowSums(.[-1] > 0))
# A tibble: 3 x 5
# number headache pain fatigue Symptoms
#* <dbl> <int> <int> <int> <dbl>
#1 1 2 1 0 2
#2 2 0 2 2 2
#3 3 1 0 2 2
data
df <- structure(list(number = c(1, 1, 2, 2, 2, 3, 3), headache = c(1,
1, NA, NA, NA, 1, NA), pain = c(NA, 1, 1, NA, 1, NA, NA), fatigue = c(NA,
NA, 1, NA, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-7L))
So my problem is as follows, I have a small data frame like this:
test_df <- data.frame(id=c(1,1,2,2,2), ttype=c("D", "C", "D", "D", "C"), val=c(1, 5, 10, 5, 100))
test_df
id ttype val
1 1 A 1
2 1 B 5
3 2 A 10
4 2 A 5
5 2 B 100
Now I want to make it wider to end up like this:
id A B n
1 1 5 1 2
2 2 100 15 3
So I want to replace the ttype with a column for each value, grouped by id with the summed values of val. But my problem is that I still want to keep track of how many either A or B occurred in total for each id, which is n in this case.
Now I found a way to do this, but it is very ugly. But this way works:
test_df %>%
group_by(id, ttype) %>%
summarise(val = sum(val), n=n()) %>%
pivot_wider(names_from = ttype, values_from=c(val, n), values_fill=0) %>%
mutate(n=n_A+n_B) %>%
select(-n_A, -n_B)
results in:
# A tibble: 2 x 4
# Groups: id [2]
id val_A val_B n
<dbl> <dbl> <dbl> <int>
1 1 5 1 2
2 2 100 15 3
So here the amount of A en B is included separately, after which I sum them and remove both other columns. But this means I have to hardcode column names and makes it not really doable when there are more than 2 values in ttype.
I feel like there must be a simple way to do this, but I can't figure it out.
You can add count of id rows as new column and get data in wide format using pivot_wider by taking sum of val values.
library(dplyr)
library(tidyr)
test_df %>%
add_count(id) %>%
pivot_wider(names_from = ttype, values_from = val, values_fn = sum)
# id n D C
# <dbl> <int> <dbl> <dbl>
#1 1 2 1 5
#2 2 3 15 100
I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")
I need to calculate the Euclidean distance between the first and current row in a dataframe. Each row is keyed by (group, month) and has a list of values. In the toy example below the key is c(month, student) and the values are in c(A, B). I want to create a distance column C, that's equal to sqrt((A_i-A_1)^2 + (B_i - B_1)^2).
So far I managed to spread my data and pull each group's first values into new columns. While I could create the formula by hand in the toy example, in my actual data I have very many columns instead of just 2. I believe I could create the squared differences within the mutate_all, and then do a row sum and take the square root of that, but no luck so far.
df <- data.frame(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
# Pull in each column's first values for each group
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
# TODO: Calculate the distance, i.e. SQRT(sum_i[(x_i - x_1)^2]).
#Output:
month student A B month_first A_first B_first
1 1 Amy 9 6 1 9 6
2 2 Amy 6 2 1 9 6
...
Desired output:
#Output:
month student A B month_first A_first B_first dist_from_first
1 1 Amy 9 6 1 9 6 0
2 2 Amy 6 2 1 9 6 5
...
Here is another way using compact dplyr code. This can be used for any number of columns
df %>%
select(-month) %>%
group_by(student) %>%
mutate_each(function(x) (first(x) - x)^2) %>%
ungroup() %>%
mutate(euc.dist = sqrt(rowSums(select(., -1))))
# A tibble: 6 x 4
student A B euc.dist
<chr> <dbl> <dbl> <dbl>
1 Amy 0 0 0
2 Amy 9 16 5
3 Amy 9 4 3.61
4 Bob 0 0 0
5 Bob 4 1 2.24
6 Bob 1 4 2.24
Edit: added alternative formulation using a join. I expect that approach will be much faster for a very wide data frame with many columns to compare.
Approach 1: To get euclidean distance for a large number of columns, one way is to rearrange the data so each row shows one month, one student, and one original column (e.g. A or B in the OP), but then two columns representing current month value and first value. Then we can square the difference, and group across all columns to get the euclidean distance, aka root-mean-squared / RMS for each student-month.
library(tidyverse)
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
ungroup() %>%
# gather into long form; make col show variant, col2 show orig column
gather(col, val, -c(student, month, month_first)) %>%
mutate(col2 = col %>% str_remove("_first")) %>%
mutate(col = if_else(col %>% str_ends("_first"),
"first",
"comparison")) %>%
spread(col, val) %>%
mutate(square_dif = (comparison - first)^2) %>%
group_by(student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Approach 2. Here, a long version of the data is joined to a version that is just the earliest month for each student.
library(tidyverse)
df_long <- gather(df, col, val, -c(month, student))
df_long %>% left_join(df_long %>%
group_by(student) %>%
top_n(-1, wt = month) %>%
rename(first_val = val) %>%
select(-month),
by = c("student", "col")) %>%
mutate(square_dif = (val - first_val)^2) %>%
group_by( student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Instead of the mutate_all call, it'd be easier to directly calculate the dist_from_first. The only thing I'm unclear about is whether month should be included in the group_by() statement.
library(tidyverse)
df <- tibble(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
df%>%
group_by(student)%>%
mutate(dist_from_first = sqrt((A - first(A))^2 + (B - first(B))^2))%>%
ungroup()
# A tibble: 6 x 5
# month student A B dist_from_first
# <int> <chr> <dbl> <dbl> <dbl>
#1 1 Amy 9 6 0
#2 2 Amy 6 2 5
#3 3 Amy 6 8 3.61
#4 1 Bob 8 5 0
#5 2 Bob 6 6 2.24
#6 3 Bob 9 7 2.24