This question already has answers here:
Reshape multiple values at once
(2 answers)
Closed 3 years ago.
Following the interaction in the previous post Reshaping database using reshape package I create this one to ask other question.
Briefly: I have a database with some rows that it replicate for Id column, I would like to transpose it. The following cose show a little example of my database.
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))
I would like a final database like this figure that I attached:
With one row for each Id value and the different columns attached.
Can you give me any suggestions?
You can use dcast from data.table. This function allows to spread multiple value variables.
library(data.table)
setDT(test) # convert test to a data.table
test1 <- dcast(test, Id ~ rowid(Id),
value.var = c('St', 'gap', 'gip', 'gat'), fill = 0)
test1
# Id St_1 St_2 gap_1 gap_2 gip_1 gip_2 gat_1 gat_2
#1: 1 20 80 0.02 0.04 0.23 0.6 0.0107 0.989
#2: 2 80 0 0.06 0.00 0.86 0.0 0.3370 0.000
#3: 3 20 0 0.08 0.00 2.09 0.0 0.6630 0.000
If you want to continue with a data.frame call setDF(test1) at the end.
A dplyr/tidyr alternative is to first gather to long format, group_by Id and key and create a sequential row identifier for each group (new_key) and finally spread it back to wide form.
library(dplyr)
library(tidyr)
test %>%
gather(key, value, -Id) %>%
group_by(Id, key) %>%
mutate(new_key = paste0(key, row_number())) %>%
ungroup() %>%
select(-key) %>%
spread(new_key, value, fill = 0)
# Id gap1 gap2 gat1 gat2 gip1 gip2 St1 St2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.02 0.04 0.0107 0.989 0.23 0.6 20 80
#2 2 0.06 0 0.337 0 0.86 0 80 0
#3 3 0.08 0 0.663 0 2.09 0 20 0
Related
I have a long-formatted set of longitudinal data with two variables that change over time (cognitive_score and motor_score), subject id (labeled subjid) and age of each subject in days at the moment of measurement(labeled agedays). Measurements were taken twice.
I want to transform it to wide-formatted longitudinal dataset.
The problem is that agedays measurements are unique for each subject, and the only way to see which measurement entry was the first, and which was the second, is to check where the agedays is higher (agedays higher than in the other entry means second measurement, lower agedays means first measurement).
We thus have this dataset:
subjid agedays cognitive_score motor_score
<int> <int> <dbl> <dbl>
1 4900001 457 0.338 0.176
2 4900001 1035 0.191 0.216
3 4900002 639 0.25 0.176
4 4900002 1248 0.176 0.353
5 4900003 335 0.103 0.196
6 4900003 913 0.176 0.196
And what I tried was using reshape:
reshape(dataset_col, direction = "wide", idvar = "subjid", timevar = "agedays", v.names = c("cognitive_score", "motor_score"))
Where dataset_col is the name of the dataset.
What it does, however, is adding these two columns:
The numbers in the name of the columns seem to be the values of agedays variable.
Any advice on how I can do this?
With libraries dplyr and tidyr:
You can use group_by and mutate to determine which of the ids is the first measurement and which is the second measurement. In this case I used an ifelse statement to determine which measurement is the first, this only works if you have exactly 2 measurements for each subjid.
Then you can use pivot_wider to pivot your data to wide format based on your newly created names column.
library(dplyr)
library(tidyr)
df %>%
group_by(subjid) %>%
mutate(names = ifelse(agedays == min(agedays),"first","second")) %>%
pivot_wider(names_from = names, values_from = c(agedays,cognitive_score,motor_score))
# A tibble: 3 × 7
# Groups: subjid [3]
# subjid agedays_first agedays_second cognitive_score_first cognitive_score_second motor_score_first motor_score_second
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 4900001 457 1035 0.338 0.191 0.176 0.216
# 2 4900002 639 1248 0.25 0.176 0.176 0.353
# 3 4900003 335 913 0.103 0.176 0.196 0.196
I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format
I need to prepare a table that includes the means and standards deviations for each level of several demographic variables and for many variables.
Consider the following data:
df <- tibble(place=c("London","Paris","London","Rome","Rome","Madrid","Madrid"),gender=c("m","f","f","f","m","m","f"), education = c(1,1,2,3,5,5,3), var1 = c(2.2,3.1,4.5,1,5,1.4,2.3),var2 = c(4.2,2.1,2.5,4,5,4.4,1.3),var3 = c(0.2,0.1,3.5,3,5,2.4,4.3))
I would like to get a dataframe that contains the grouping variables (place, gender, education) and their levels (e.g., London, Paris, etc.) in the first column and their means and standard deviations for each variable starting with var (var1, var2, var3) in additional columns.
I know how to do this for one group and several variables at a time. However, since I need to repeat this dozens of times I am looking for a way to automate this process. It would be great to have a function to which I simply need to pass (a) the names of the grouping variables (e.g., gender, education) and (b) the variables from which to get the M / SD (e.g. var1, var2).
The solution I look for should look like this (the stats are not correct in the example below):
my_results <- tibble(grouping_vars = c("place_London","place_Paris","place_Rome","place_Madrid","gender_m","gender_f","last_element"),mean_var1=c(1.3,2.5,4.5,1.7,2.5,3.6,4.0),sd_var1=c(0.01,0.41,0.21,0.12,0.02,0.38,0.28),mean_var2=c(4.3,4.5,4.0,1.2,2.5,1.6,2.3),sd_var2=c(0.21,0.1,0.1,0.32,0.22,0.18,0.08),mean_var3=c(2.3,2.5,2.0,3.2,3.5,0.6,5),sd_var3=c(0.51,0.15,0.51,0.52,0.52,0.15,0.48))
grouping_vars mean_var1 sd_var1 mean_var2 sd_var2 mean_var3 sd_var3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 place_London 1.3 0.01 4.3 0.21 2.3 0.51
2 place_Paris 2.5 0.41 4.5 0.1 2.5 0.15
3 place_Rome 4.5 0.21 4 0.1 2 0.51
4 place_Madrid 1.7 0.12 1.2 0.32 3.2 0.52
5 gender_m 2.5 0.02 2.5 0.22 3.5 0.52
6 gender_f 3.6 0.38 1.6 0.18 0.6 0.15
7 last_element 4 0.28 2.3 0.08 5 0.48
Since I typically work with tidyverse, I would particularly appreciate solutions that use these packages (probably dplyr or purrr?).
EDIT:
I thought there would be an elegant way to do this using map(). Maybe there is but I haven't found it yet. For the mean time, I figured out a way that simply restructures the data into an appropriate long format and then computes the statistics.
df %>%
# all grouping vars need to be of the same type, here "factor" is most appropriate
mutate_at(grouping_vars, list(factor)) %>%
# pivot longer, so that each row is a unique combination of grouping variable and grouping level
pivot_longer(
cols = one_of(grouping_vars),
names_to = "group_var",
values_to = "group_level"
) %>%
# merge grouping variable and group level into a single column
unite(var_level,group_var,group_level, sep="_") %>%
# group by group level
group_by(var_level) %>%
# compute means and sd for each test variable
summarise_at(test_vars, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))
The result seems fine, e.g., the mean of var1 of the two people who live in London (2.2 + 4.5) is 3.35.
# A tibble: 10 x 7
var_level var1_mean var2_mean var3_mean var1_sd var2_sd var3_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 education_1 2.65 3.15 0.15 0.636 1.48 0.0707
2 education_2 4.5 2.5 3.5 NA NA NA
3 education_3 1.65 2.65 3.65 0.919 1.91 0.919
4 education_5 3.2 4.7 3.7 2.55 0.424 1.84
5 gender_f 2.72 2.48 2.72 1.47 1.13 1.83
6 gender_m 2.87 4.53 2.53 1.89 0.416 2.40
7 place_London 3.35 3.35 1.85 1.63 1.20 2.33
8 place_Madrid 1.85 2.85 3.35 0.636 2.19 1.34
9 place_Paris 3.1 2.1 0.1 NA NA NA
10 place_Rome 3 4.5 4 2.83 0.707 1.41
Any thoughts on possible risks of this approach or how this could be improved?
One option is the describeBy function from psych:
library(psych)
describeBy(df,group = c("gender","education"), mat= TRUE)
Then subset what you want from there.
Another, surprisingly simple option with dplyr:
library(dplyr)
group.vars <- c("gender","education")
measure.vars <- c("var1","var2")
df %>%
group_by_at(group.vars) %>%
summarize_at(measure.vars,
list(mean =~ mean(.),sd =~ sd(.)))
# A tibble: 5 x 6
# Groups: gender [2]
gender education var1_mean var2_mean var1_sd var2_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1 3.1 2.1 NA NA
2 f 2 4.5 2.5 NA NA
3 f 3 1.65 2.65 0.919 1.91
4 m 1 2.2 4.2 NA NA
5 m 5 3.2 4.7 2.55 0.424
You can continue adding additional function to that list. For every element, the name will be appended to the variable and the result will be come the column values. Recall that ~ is shorthand for function(x).
I am trying to calculate the means of this data frame and group them in a tale! I know how to do that using averageifs in excel but because I want to eventually get the standard deviation and coefficient of variation (CV) I need to learn that in R.
For now I just need the mean. Here are my conditions:
I need a table where the "stim_ending_t" which is my time intervals are arranged from 1.0 to 3.5 in a rows. For time intervals I need these three conditions to be met while calculating the mean which is "key_resp_2.rt"
only image Visibility and soundvolume(V=1 & s=0)
Only Sound (V=0 & s=1)
Blank (V=0 & s=0)
The data frame
Expected out come
This will compute stim_ending_t (6) x modality (3) = 18 group means.
First I generate some data like your analysis_v or analysis_a data frames:
library(dplyr)
library(tidyr)
analysis_v <- data.frame(stim_ending_t = rep(seq(1, 3.5, 0.5), each = 30),
visbility = rep(c(1, 0, 0), 60),
soundvolume = rep(c(0, 1, 0), 60),
key_resp_2.rt = runif(180, 1, 5))
Then I pipe the object into the code block:
analysis_v %>%
group_by(stim_ending_t, visbility, soundvolume) %>%
summarize(average = mean(key_resp_2.rt)) %>%
ungroup() %>%
mutate(key = case_when(visbility == 0 & soundvolume == 0 ~ "blank",
visbility == 0 & soundvolume == 1 ~ "only_sound",
visbility == 1 & soundvolume == 0 ~ "only_images")) %>%
select(-visbility, -soundvolume) %>%
spread(key, average)
Which results in the requested output format:
# A tibble: 6 x 4
stim_ending_t blank only_images only_sound
<dbl> <dbl> <dbl> <dbl>
1 1 3.28 3.55 2.84
2 1.5 2.64 3.11 2.32
3 2 3.27 3.72 2.42
4 2.5 2.14 3.01 2.30
5 3 2.47 3.03 3.02
6 3.5 2.93 2.92 2.78
You would need to repeat the code block using analysis_a to get those means.
Thank you for your help #Matthew Schuelke, but with your code I was getting different results every time I run the code.
Here is how I solved the problem with this code:
name of the new data = (name of the data frame without the parentheses) %>%
group_by(stim_ending_t, visbility, soundvolume, Opening_text) %>%
summarize(m = mean(key_resp_2.rt),
sd = sd(key_resp_2.rt),
coefVar = cv(key_resp_2.rt))
The result as I wanted:
stim_ending_t visbility soundvolume Opening_text m sd coefVar
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 0 0 Now focus on the Image 1.70 1.14 0.670
2 1 0 0 Now focus on the Sound 1.57 0.794 0.504
3 1 0 1 Now focus on the Image 1.62 1.25 0.772
4 1 0 1 Now focus on the Sound 1.84 1.17 0.637
5 1 1 0 Now focus on the Image 3.19 17.2 5.38
6 1 1 0 Now focus on the Sound 1.59 0.706 0.444
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.