Multiple gathering in R to create tidy dataset - r

I have a complicated untidy dataset which a dummy version of can be replicated below.
studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)
untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)
I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance. I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.
Thanks in advance!

With tidyr you can use pivot_longer:
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")
Output
# A tibble: 750 x 5
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.432 10762. 0.786
2 1 2019 0.948 11340. 0.909
3 1 2020 0.122 12837. 0.944
4 2 2018 0.422 11515. 0.950
5 2 2019 0.0639 12968. 0.828
6 2 2020 0.611 13645. 0.901
7 3 2018 0.489 11281. 0.784
8 3 2019 0.00337 12250. 0.753
9 3 2020 0.711 12898. 0.803
10 4 2018 0.0596 10526. 0.842

Using pure R:
tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)
studentID time score payment attendance
1.2018 1 2018 0.86743970 10995.45 0.9473540
2.2018 2 2018 0.53204701 11152.74 0.8167776
3.2018 3 2018 0.90072918 10631.06 0.9335316
4.2018 4 2018 0.89154492 11889.23 0.9098399
5.2018 5 2018 0.06320442 10973.20 0.8118909
6.2018 6 2018 0.67519166 11751.67 0.8328860
If you want "year" instead of the default "time", add timevar="year"

We could try:
library(dplyr)
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID) %>%
separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>%
pivot_wider(names_from = measure, values_from = value )
Which returns:
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.807 10179. 0.974
2 1 2019 0.599 11601. 0.785
3 1 2020 0.515 12347. 0.760
4 2 2018 0.474 11154. 0.983
5 2 2019 0.409 11682. 0.864
6 2 2020 0.688 13756. 0.812
7 3 2018 0.509 11746. 0.870
8 3 2019 0.867 12851. 0.801
9 3 2020 0.878 12710. 0.955
10 4 2018 0.621 11165. 0.975

Related

how can I make a new data frame where the columns are the unique values with corresponding observations from an old data frame? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
My data frame has different dates as rows. Every unique date occurs appr. 500 times. I want to make a new data frame where every column is a unique date and where the rows are all the observations of that date from my old dataset. So for every column dat represents a certain date, I should have appr. 500 rows that each represent a rel_spread from that day.
You can use pivot_wider from tidyr:
library(tidyr)
pivot_wider(df, names_from = date, values_from = rel_spread, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 17
#> `20000103` `20000104` `20000105` `20000106` `20000107` `20000108` `20000109`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.0234 -0.0128 0.00729 0.0408 -0.0298 0.0398 0.0445
#> 2 0.0492 -0.0120 0.0277 0.0435 -0.0288 0.0152 -0.0374
#> # ... with 10 more variables: `20000110` <dbl>, `20000111` <dbl>,
#> # `20000112` <dbl>, `20000113` <dbl>, `20000114` <dbl>, `20000115` <dbl>,
#> # `20000116` <dbl>, `20000117` <dbl>, `20000118` <dbl>, `20000119` <dbl>
Note that we don't have your data (and I wasn't about to transcribe a picture of your data), but I created a little reproducible data set which should match the structure of your data set, except it only has two values per date for demo purposes:
set.seed(1)
df <- data.frame(date = rep(as.character(20000103:20000119), 2),
rel_spread = runif(34, -0.05, 0.05))
df
#> date rel_spread
#> 1 20000103 -0.0234491337
#> 2 20000104 -0.0127876100
#> 3 20000105 0.0072853363
#> 4 20000106 0.0408207790
#> 5 20000107 -0.0298318069
#> 6 20000108 0.0398389685
#> 7 20000109 0.0444675269
#> 8 20000110 0.0160797792
#> 9 20000111 0.0129114044
#> 10 20000112 -0.0438213730
#> 11 20000113 -0.0294025425
#> 12 20000114 -0.0323443247
#> 13 20000115 0.0187022847
#> 14 20000116 -0.0115896282
#> 15 20000117 0.0269841420
#> 16 20000118 -0.0002300758
#> 17 20000119 0.0217618508
#> 18 20000103 0.0491906095
#> 19 20000104 -0.0119964821
#> 20 20000105 0.0277445221
#> 21 20000106 0.0434705231
#> 22 20000107 -0.0287857479
#> 23 20000108 0.0151673766
#> 24 20000109 -0.0374444904
#> 25 20000110 -0.0232779331
#> 26 20000111 -0.0113885907
#> 27 20000112 -0.0486609667
#> 28 20000113 -0.0117612043
#> 29 20000114 0.0369690846
#> 30 20000115 -0.0159651003
#> 31 20000116 -0.0017919885
#> 32 20000117 0.0099565825
#> 33 20000118 -0.0006458693
#> 34 20000119 -0.0313782399
Allan’s answer is perfect if you have the same number of rows for each date. If this isn’t the case, the following should work:
library(tidyr)
library(dplyr)
data_wide <- data_long %>%
group_by(date) %>%
mutate(daterow = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = date, values_from = rel_spread) %>%
select(!daterow)
data_wide
Output:
# A tibble: 6 x 4
`20000103` `20000104` `20000105` `20000106`
<dbl> <dbl> <dbl> <dbl>
1 -0.626 0.184 -0.836 -0.621
2 1.60 0.330 -0.820 -2.21
3 0.487 0.738 0.576 1.12
4 -0.305 1.51 0.390 -0.0449
5 NA NA NA -0.0162
6 NA NA NA 0.944
Example data:
set.seed(1)
data_long <- data.frame(
date = c(rep(20000103:20000105, 4), rep(20000106, 6)),
rel_spread = rnorm(18)
)

Choose dataframe variables by name and multiply with a vector elementwise

I have a data frame and a vector as follows:
my_df <- as.data.frame(
list(year = c(2001, 2001, 2001, 2001, 2001, 2001), month = c(1,
2, 3, 4, 5, 6), Pdt_d0 = c(0.379045935402736, 0.377328817455841,
0.341158889847019, 0.36761990427443, 0.372442657083218, 0.382702189949558
), Pdt_d1 = c(0.146034519173855, 0.166289573095497, 0.197787188740911,
0.137071647982617, 0.162103042313547, 0.168566518193772), Pdt_d2 = c(0.126975939811326,
0.107708783271871, 0.14096203677089, 0.142228236885706, 0.115542396064519,
0.106935751726809), Pdt_tot = c(2846715, 2897849.5, 2935406.25,
2850649, 2840313.75, 3087993.5))
)
my_vec <- 1:3
I want to multiply Pdt_d0:Pdt_d2 with the corresponding element from my_vec, while keeping the other columns untouched. I can get the desired multiplication with dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) but I lose the year, month, Pdt_tot columns in the process. I tried to achieve my goal with dplyr::select(my_df, num_range("Pdt_d", 0:2)) <- dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) which returns an error 'select<-' is not an exported object. Is there an obvious trick I am not seeing?
I don't think my question is a duplicate; I have seen the answers in here and here but neither question allows me to choose variables by name
You can use the left-hand-side overwritten by the right-hand-side Map/mapply logic, which you tried, outside of the tidy world:
vars <- paste0("Pdt_d", 0:2)
my_df[vars] <- Map(`*`, my_df[vars], my_vec)
my_df
# year month Pdt_d0 Pdt_d1 Pdt_d2 Pdt_tot
#1 2001 1 0.3790459 0.2920690 0.3809278 2846715
#2 2001 2 0.3773288 0.3325791 0.3231263 2897850
#3 2001 3 0.3411589 0.3955744 0.4228861 2935406
#4 2001 4 0.3676199 0.2741433 0.4266847 2850649
#5 2001 5 0.3724427 0.3242061 0.3466272 2840314
#6 2001 6 0.3827022 0.3371330 0.3208073 3087994
This works because [<- exists as a function in R, for assigning to a left-hand-side selection by the square brackets, like my_df[].
The error that was returned is because the code has a select() function on the left-hand-side, and there is no 'select<-' function. I.e., you can't assign to a select()-ion because it isn't setup to work like that. The tidy functions are usually expected to be piped like my_df %>% select() %>% etc without overwriting the original input.
I don't think that you want to do this mess, but it does work.
library(dplyr)
library(tidyr)
my_df %>%
gather(variable, value, -year,-month,-Pdt_tot) %>%
group_by(year, month, Pdt_tot) %>%
mutate(value = value * my_vector) %>%
spread(variable,value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
Not specifying year, month, and Pdt_tot is,
my_df %>%
gather(variable, value, - !num_range("Pdt_d", 0:2)) %>%
group_by(across(c(-variable, -value))) %>%
mutate(value = value * my_vector) %>%
spread(variable, value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321

reshaping rows of data to two columns

We have data on school districts where the columns are the local-specific information (e.g., free and reduced price lunch %) and the corresponding statewide values.
dat <- tribble(
~state.poverty, ~state.EL, ~state.disability, ~state.frpl, ~local.poverty, ~local.frpl, ~local.disability, ~local.EL,
12.50592, 0.08342419, 0.12321831, 0.4495395, 25.23731, 0.6415712, 0.140739, 0.1469898)
dat
# A tibble: 1 x 8
state.poverty state.EL state.disability state.frpl local.poverty local.frpl local.disability local.EL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12.5 0.0834 0.123 0.450 25.2 0.642 0.141 0.147
We want to reshape that so that it looks like this.
demog state local
<chr> <dbl> <dbl>
1 poverty 12.5 25.2
2 EL 0.0834 0.147
3 disability 0.123 0.141
4 frpl 0.450 0.642
It seems like something that pivot_longer should be able to handle, but I haven't had much success so far. Any suggestions?
We can use pivot_longer
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = everything(),
names_to = c(".value", "demog"), names_sep = "\\.")
-output
# A tibble: 4 x 3
# demog state local
# <chr> <dbl> <dbl>
#1 poverty 12.5 25.2
#2 EL 0.0834 0.147
#3 disability 0.123 0.141
#4 frpl 0.450 0.642
A base R option using reshape
reshape(
dat,
direction = "long",
varying = 1:ncol(dat)
)
gives
# A tibble: 4 x 4
time state local id
<chr> <dbl> <dbl> <int>
1 poverty 12.5 25.2 1
2 EL 0.0834 0.642 1
3 disability 0.123 0.141 1
4 frpl 0.450 0.147 1

Aggregating by fixed date range R

Given a simplification of my dataset like:
df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
"ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
"pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
"Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01",
"1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
"NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
check.names = FALSE, stringsAsFactors = FALSE)
I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.
These means should consider:
Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.
So the output should look something like:
> desired_df
ID ForestType Date meanNDVI
1 1 oak 1987 0.250
2 1 oak 1988 0.325
3 1 oak 1989 0.430
4 2 pine 1987 0.350
5 2 pine 1988 0.635
6 2 pine 1989 0.740
In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.
Thanks a lot!
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
group_by(ID, ForestType, Year = year(Date)) %>%
filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
summarise(meanNDVI = mean(NDVI))
Output
# A tibble: 6 x 4
# Groups: ID, ForestType [2]
ID ForestType Year meanNDVI
<dbl> <chr> <dbl> <dbl>
1 1 oak 1987 0.25
2 1 oak 1988 0.325
3 1 oak 1989 0.475
4 2 pine 1987 0.35
5 2 pine 1988 0.635
6 2 pine 1989 0.74
The classical base R approach using aggregate. The year can be obtained using substr.
res <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean))
res[order(res$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.2875
# 3 1 oak 1988 0.3575
# 5 1 oak 1989 0.5150
# 2 2 pine 1987 0.3575
# 4 2 pine 1988 0.6425
# 6 2 pine 1989 0.7725
Trimmed version
Trimmed for 25% outlyers.
res2 <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean, trim=.25))
res2[order(res2$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.250
# 3 1 oak 1988 0.325
# 5 1 oak 1989 0.475
# 2 2 pine 1987 0.350
# 4 2 pine 1988 0.635
# 6 2 pine 1989 0.740
Using data.table package, you could proceed as follows:
library(data.table)
setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]
# ID ForestType year meanNDVI
# 1: 1 oak 1987 0.250
# 2: 1 oak 1988 0.325
# 3: 1 oak 1989 0.475
# 4: 2 pine 1987 0.350
# 5: 2 pine 1988 0.635
# 6: 2 pine 1989 0.740
Another option. You can set trim in mean
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = ymd(Date) %>% year()) %>%
group_by(ID, ForestType, Date) %>%
summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))

Summarizing by group of two rows

I have a data frame that I want to group by two variables, and then summarize the total and average.
I tried this on my data, which is correct.
df %>%
group_by(date, group) %>%
summarise(
weight = sum(ind_weigh) ,
total_usage = sum(total_usage_min) ,
Avg_usage = total_usage / weight) %>%
ungroup()
It returns this data frame:
df <- tibble::tribble(
~date, ~group, ~weight, ~total_usage, ~Avg_usage,
20190201, 0, 450762, 67184943, 149,
20190201, 1, 2788303, 385115718, 138,
20190202, 0, 483959, 60677765, 125,
20190202, 1, 2413699, 311226351, 129,
20190203, 0, 471189, 59921762, 127,
20190203, 1, 2143811, 277425186, 129,
20190204, 0, 531020, 83695977, 158,
20190204, 1, 2640087, 403200829, 153
)
I am wondering how can I add another variable in my script to get the avg_usage_total(for both group 0 and group 1) as well.
Expected result:
ex, first row --> (67184943 / (450762 + 2788303) = 20.7
date group rech total_usage Avg_usage Avg_usage_total
20190201 0 450762 67184943 149 20.7
20190201 1 2788303 385115718 138 118.9
You can do that using mutate and group_by if necessary.
library(tidyverse)
# generate dataset
(df <- tibble(
date = c(rep(Sys.Date(), 10), rep(Sys.Date() - 1, 10)),
group = rbinom(20, 1, 0.5),
rech = runif(20),
weight = runif(20),
total_usage = runif(20)
))
# A tibble: 20 x 5
date group rech weight total_usage
<date> <int> <dbl> <dbl> <dbl>
1 2019-03-10 0 0.985 0.831 0.963
2 2019-03-10 1 0.178 0.990 0.676
3 2019-03-10 1 0.505 0.697 0.152
4 2019-03-10 1 0.416 0.165 0.824
5 2019-03-10 0 0.554 0.790 0.974
# step 1 of analysis
(df <- df %>%
group_by(date, group) %>%
summarise(rech = sum(rech),
weight = sum(weight),
total_usage = sum(total_usage)) %>%
mutate(Avg_usage = total_usage / weight))
# A tibble: 4 x 6
# Groups: date [2]
date group rech weight total_usage Avg_usage
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628
2 2019-03-09 1 1.45 1.22 1.16 0.954
3 2019-03-10 0 1.54 1.62 1.94 1.20
4 2019-03-10 1 3.15 4.55 4.63 1.02
# step 2 of analysis
df %>%
group_by(date) %>% # only necessary if you want to compute Avg_usage_total by date
mutate(Avg_usage_total = total_usage / sum(rech)) %>% # total_usage is taken by row, sum is taken for the entire column
ungroup()
# A tibble: 4 x 7
date group rech weight total_usage Avg_usage Avg_usage_total
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628 0.639
2 2019-03-09 1 1.45 1.22 1.16 0.954 0.246
3 2019-03-10 0 1.54 1.62 1.94 1.20 0.413
4 2019-03-10 1 3.15 4.55 4.63 1.02 0.986

Resources