reshaping rows of data to two columns - r

We have data on school districts where the columns are the local-specific information (e.g., free and reduced price lunch %) and the corresponding statewide values.
dat <- tribble(
~state.poverty, ~state.EL, ~state.disability, ~state.frpl, ~local.poverty, ~local.frpl, ~local.disability, ~local.EL,
12.50592, 0.08342419, 0.12321831, 0.4495395, 25.23731, 0.6415712, 0.140739, 0.1469898)
dat
# A tibble: 1 x 8
state.poverty state.EL state.disability state.frpl local.poverty local.frpl local.disability local.EL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12.5 0.0834 0.123 0.450 25.2 0.642 0.141 0.147
We want to reshape that so that it looks like this.
demog state local
<chr> <dbl> <dbl>
1 poverty 12.5 25.2
2 EL 0.0834 0.147
3 disability 0.123 0.141
4 frpl 0.450 0.642
It seems like something that pivot_longer should be able to handle, but I haven't had much success so far. Any suggestions?

We can use pivot_longer
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = everything(),
names_to = c(".value", "demog"), names_sep = "\\.")
-output
# A tibble: 4 x 3
# demog state local
# <chr> <dbl> <dbl>
#1 poverty 12.5 25.2
#2 EL 0.0834 0.147
#3 disability 0.123 0.141
#4 frpl 0.450 0.642

A base R option using reshape
reshape(
dat,
direction = "long",
varying = 1:ncol(dat)
)
gives
# A tibble: 4 x 4
time state local id
<chr> <dbl> <dbl> <int>
1 poverty 12.5 25.2 1
2 EL 0.0834 0.642 1
3 disability 0.123 0.141 1
4 frpl 0.450 0.147 1

Related

Choose dataframe variables by name and multiply with a vector elementwise

I have a data frame and a vector as follows:
my_df <- as.data.frame(
list(year = c(2001, 2001, 2001, 2001, 2001, 2001), month = c(1,
2, 3, 4, 5, 6), Pdt_d0 = c(0.379045935402736, 0.377328817455841,
0.341158889847019, 0.36761990427443, 0.372442657083218, 0.382702189949558
), Pdt_d1 = c(0.146034519173855, 0.166289573095497, 0.197787188740911,
0.137071647982617, 0.162103042313547, 0.168566518193772), Pdt_d2 = c(0.126975939811326,
0.107708783271871, 0.14096203677089, 0.142228236885706, 0.115542396064519,
0.106935751726809), Pdt_tot = c(2846715, 2897849.5, 2935406.25,
2850649, 2840313.75, 3087993.5))
)
my_vec <- 1:3
I want to multiply Pdt_d0:Pdt_d2 with the corresponding element from my_vec, while keeping the other columns untouched. I can get the desired multiplication with dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) but I lose the year, month, Pdt_tot columns in the process. I tried to achieve my goal with dplyr::select(my_df, num_range("Pdt_d", 0:2)) <- dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) which returns an error 'select<-' is not an exported object. Is there an obvious trick I am not seeing?
I don't think my question is a duplicate; I have seen the answers in here and here but neither question allows me to choose variables by name
You can use the left-hand-side overwritten by the right-hand-side Map/mapply logic, which you tried, outside of the tidy world:
vars <- paste0("Pdt_d", 0:2)
my_df[vars] <- Map(`*`, my_df[vars], my_vec)
my_df
# year month Pdt_d0 Pdt_d1 Pdt_d2 Pdt_tot
#1 2001 1 0.3790459 0.2920690 0.3809278 2846715
#2 2001 2 0.3773288 0.3325791 0.3231263 2897850
#3 2001 3 0.3411589 0.3955744 0.4228861 2935406
#4 2001 4 0.3676199 0.2741433 0.4266847 2850649
#5 2001 5 0.3724427 0.3242061 0.3466272 2840314
#6 2001 6 0.3827022 0.3371330 0.3208073 3087994
This works because [<- exists as a function in R, for assigning to a left-hand-side selection by the square brackets, like my_df[].
The error that was returned is because the code has a select() function on the left-hand-side, and there is no 'select<-' function. I.e., you can't assign to a select()-ion because it isn't setup to work like that. The tidy functions are usually expected to be piped like my_df %>% select() %>% etc without overwriting the original input.
I don't think that you want to do this mess, but it does work.
library(dplyr)
library(tidyr)
my_df %>%
gather(variable, value, -year,-month,-Pdt_tot) %>%
group_by(year, month, Pdt_tot) %>%
mutate(value = value * my_vector) %>%
spread(variable,value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
Not specifying year, month, and Pdt_tot is,
my_df %>%
gather(variable, value, - !num_range("Pdt_d", 0:2)) %>%
group_by(across(c(-variable, -value))) %>%
mutate(value = value * my_vector) %>%
spread(variable, value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321

tidyselect::where() inconsistencies: where is where()?

Summary: You can do rename(A=1, B=2), can you do the same using rename_with()? my ~str_replace(... paste0()) works, I don't need to change that. But it only works for one variable at a time. Tidyselect suggests wrapping where(~str_replace...) but then complains it can't find it even though I can get where() to work in other instances.
I want to implement rename_with for more than one variable, but I get an error Error: Formula shorthand must be wrapped in where()`.
# Bad
data %>% select(~str_replace(., "Var_2_", paste0("Issue: Time")))
# Good
data %>% select(where(~str_replace(., "Var_2_", paste0("Issue: time"))))
Example original:
test%>% rename_with( ~str_replace(., "Var_2_", paste0("Issue: Time")), ~str_replace(., "Var_3_", paste0("Issue: Time")))
when I run
test%>% rename_with(where( ~str_replace(., "Var_2_", paste0("Issue: Time")), ~str_replace(., "Var_3_", paste0("Issue: Time"))))
and
test%>% rename_with( where(~str_replace(., "Var_2_", paste0("Issue: Time"))), where(~str_replace(., "Var_3_", paste0("Issue: Time"))))
I get
Error in where(~str_replace(., "Var_1_", paste0("Gov't surveillance: video wave")), : could not find function "where"
And I can't find it tabbing through tidyselect::
But I can run
test%>% select(where(is.numeric)) %>% map(sd, na.rm = TRUE)
without any issue so it does exist. What am I doing wrong?
Example data:
x <- c("_1_1",
"_1_2",
"_1_3",
"_2_1",
"_2_2",
"_2_3",
"_3_1",
"_3_2",
"_3_3",
"_4_3")
paste0("Var",x)
test <- t(as_tibble(rnorm(10, 5.5, .35)))
colnames(test) <- paste0("Var",x)
There is a switching of arguments in rename_with compared to rename_at. It is a bit unclear about the column names specified in the code and the data showed especially with the str_replace in both arguments. A typical use to replace the column names that starts with 'Var_2' with 'Issue: Time_2' would be
library(dplyr)
data <- data %>%
rename_with(~ str_replace(., 'Var_2', 'Issue: Time'),
starts_with('Var_2'))
-output
data
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 `Issue: Time_1` `Issue: Time_2` `Issue: Time_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
If we need to change multiple column patterns, use matches
data %>%
rename_with(~ str_replace(., '(Var_2|Var_3)', '\\1_Issue: Time'),
matches('Var_2|Var_3'))
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 `Var_2_Issue: Tim… `Var_2_Issue: Tim… `Var_2_Issue: Tim… `Var_3_Issue: Ti… `Var_3_Issue: Ti… `Var_3_Issue: Ti… Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
Or if we want to change corresponding replacement, pattern, use str_replace_all
data1 <- data %>%
set_names(str_replace_all(names(.), c("Var_1", "Var_2"), c("Issue 1 wave", "Issue 2 Wave")))
compare the output
data1
# A tibble: 1 x 10
`Issue 1 wave_1` Var_1_2 `Issue 1 wave_3` `Trust Wave_1` Var_2_2 `Issue 2 Wave_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
with original data
data
# A tibble: 1 x 10
Var_1_1 Var_1_2 Var_1_3 Var_2_1 Var_2_2 Var_2_3 Var_3_1 Var_3_2 Var_3_3 Var_4_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
where is generally used to check the column value i.e. suppose we want to select columns that are numeric type, use select(where(is.numeric)) and not on the column names. There are select_helpers to find the column names based on a substring i.e. starts_with, ends_with, contains, or pass a regex pattern in matches. An use case of where would be
data %>%
rename_with(~ str_replace(., 'Var_2', 'Issue: Time'), where(~ all(. > 5.5)))
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 Var_2_1 Var_2_2 `Issue: Time_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
In the OP's code, select/map can be replaced with summarise/across
df %>%
summarise(across(where(is.numeric), sd))
data
data <- as_tibble(test)

Multiple gathering in R to create tidy dataset

I have a complicated untidy dataset which a dummy version of can be replicated below.
studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)
untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)
I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance. I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.
Thanks in advance!
With tidyr you can use pivot_longer:
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")
Output
# A tibble: 750 x 5
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.432 10762. 0.786
2 1 2019 0.948 11340. 0.909
3 1 2020 0.122 12837. 0.944
4 2 2018 0.422 11515. 0.950
5 2 2019 0.0639 12968. 0.828
6 2 2020 0.611 13645. 0.901
7 3 2018 0.489 11281. 0.784
8 3 2019 0.00337 12250. 0.753
9 3 2020 0.711 12898. 0.803
10 4 2018 0.0596 10526. 0.842
Using pure R:
tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)
studentID time score payment attendance
1.2018 1 2018 0.86743970 10995.45 0.9473540
2.2018 2 2018 0.53204701 11152.74 0.8167776
3.2018 3 2018 0.90072918 10631.06 0.9335316
4.2018 4 2018 0.89154492 11889.23 0.9098399
5.2018 5 2018 0.06320442 10973.20 0.8118909
6.2018 6 2018 0.67519166 11751.67 0.8328860
If you want "year" instead of the default "time", add timevar="year"
We could try:
library(dplyr)
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID) %>%
separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>%
pivot_wider(names_from = measure, values_from = value )
Which returns:
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.807 10179. 0.974
2 1 2019 0.599 11601. 0.785
3 1 2020 0.515 12347. 0.760
4 2 2018 0.474 11154. 0.983
5 2 2019 0.409 11682. 0.864
6 2 2020 0.688 13756. 0.812
7 3 2018 0.509 11746. 0.870
8 3 2019 0.867 12851. 0.801
9 3 2020 0.878 12710. 0.955
10 4 2018 0.621 11165. 0.975

Summarizing by group of two rows

I have a data frame that I want to group by two variables, and then summarize the total and average.
I tried this on my data, which is correct.
df %>%
group_by(date, group) %>%
summarise(
weight = sum(ind_weigh) ,
total_usage = sum(total_usage_min) ,
Avg_usage = total_usage / weight) %>%
ungroup()
It returns this data frame:
df <- tibble::tribble(
~date, ~group, ~weight, ~total_usage, ~Avg_usage,
20190201, 0, 450762, 67184943, 149,
20190201, 1, 2788303, 385115718, 138,
20190202, 0, 483959, 60677765, 125,
20190202, 1, 2413699, 311226351, 129,
20190203, 0, 471189, 59921762, 127,
20190203, 1, 2143811, 277425186, 129,
20190204, 0, 531020, 83695977, 158,
20190204, 1, 2640087, 403200829, 153
)
I am wondering how can I add another variable in my script to get the avg_usage_total(for both group 0 and group 1) as well.
Expected result:
ex, first row --> (67184943 / (450762 + 2788303) = 20.7
date group rech total_usage Avg_usage Avg_usage_total
20190201 0 450762 67184943 149 20.7
20190201 1 2788303 385115718 138 118.9
You can do that using mutate and group_by if necessary.
library(tidyverse)
# generate dataset
(df <- tibble(
date = c(rep(Sys.Date(), 10), rep(Sys.Date() - 1, 10)),
group = rbinom(20, 1, 0.5),
rech = runif(20),
weight = runif(20),
total_usage = runif(20)
))
# A tibble: 20 x 5
date group rech weight total_usage
<date> <int> <dbl> <dbl> <dbl>
1 2019-03-10 0 0.985 0.831 0.963
2 2019-03-10 1 0.178 0.990 0.676
3 2019-03-10 1 0.505 0.697 0.152
4 2019-03-10 1 0.416 0.165 0.824
5 2019-03-10 0 0.554 0.790 0.974
# step 1 of analysis
(df <- df %>%
group_by(date, group) %>%
summarise(rech = sum(rech),
weight = sum(weight),
total_usage = sum(total_usage)) %>%
mutate(Avg_usage = total_usage / weight))
# A tibble: 4 x 6
# Groups: date [2]
date group rech weight total_usage Avg_usage
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628
2 2019-03-09 1 1.45 1.22 1.16 0.954
3 2019-03-10 0 1.54 1.62 1.94 1.20
4 2019-03-10 1 3.15 4.55 4.63 1.02
# step 2 of analysis
df %>%
group_by(date) %>% # only necessary if you want to compute Avg_usage_total by date
mutate(Avg_usage_total = total_usage / sum(rech)) %>% # total_usage is taken by row, sum is taken for the entire column
ungroup()
# A tibble: 4 x 7
date group rech weight total_usage Avg_usage Avg_usage_total
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628 0.639
2 2019-03-09 1 1.45 1.22 1.16 0.954 0.246
3 2019-03-10 0 1.54 1.62 1.94 1.20 0.413
4 2019-03-10 1 3.15 4.55 4.63 1.02 0.986

Conditional replacement of column name in tibble using dplyr

I have the following tibble:
df <- structure(list(gene_symbol = c("0610005C13Rik", "0610007P14Rik",
"0610009B22Rik", "0610009L18Rik", "0610009O20Rik", "0610010B08Rik"
), foo.control.cv = c(1.16204038288333, 0.120508045270669, 0.205712615954009,
0.504508040948641, 0.333956330117591, 0.543693011377001), foo.control.mean = c(2.66407458486012,
187.137728870855, 142.111269303428, 16.7278587043453, 69.8602872478098,
4.77769028710622), foo.treated.cv = c(0.905769898934564, 0.186441944401973,
0.158552512842753, 0.551955061149896, 0.15743983656006, 0.290447431974039
), foo.treated.mean = c(2.40658723367692, 180.846795140269, 139.054032348287,
11.8584348984435, 76.8141734599118, 2.24088124240385)), .Names = c("gene_symbol",
"foo.control.cv", "foo.control.mean", "foo.treated.cv", "foo.treated.mean"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
6L))
Which looks like this:
# A tibble: 6 × 5
gene_symbol foo.control.cv foo.control.mean foo.treated.cv foo.treated.mean
* <chr> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
What I want to do is to replace all column names with mean in it into mean_expr. Resulting in
gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
How can I achieve that?
With current versions of dplyr, you can use rename_at:
library(dplyr)
df %>% rename_at(vars(contains('mean')), funs(sub('mean', 'mean_expr', .)))
#> # A tibble: 6 × 5
#> gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv
#> * <chr> <dbl> <dbl> <dbl>
#> 1 0610005C13Rik 1.1620404 2.664075 0.9057699
#> 2 0610007P14Rik 0.1205080 187.137729 0.1864419
#> 3 0610009B22Rik 0.2057126 142.111269 0.1585525
#> 4 0610009L18Rik 0.5045080 16.727859 0.5519551
#> 5 0610009O20Rik 0.3339563 69.860287 0.1574398
#> 6 0610010B08Rik 0.5436930 4.777690 0.2904474
#> # ... with 1 more variables: foo.treated.mean_expr <dbl>
Really, you could use rename_all, as well, as names that don't match would be unaffected anyway. Further, you can use a quosure or anything that can be coerced to a function by rlang::as_function for .funs, so you can use purrr-style notation:
df %>% rename_all(~sub('mean', 'mean_expr', .x))
Since a data frame is a list, purrr's set_names can do the same thing:
library(purrr) # or library(tidyverse)
df %>% set_names(~sub('mean', 'mean_expr', .x))
All return the same thing.
Another option is to paste in rename_at (using the devel version of dplyr)
library(dplyr)
df %>%
rename_at(vars(matches('mean')), funs(sprintf('%s_expr', .)))
# A tibble: 6 × 5
# gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
#* <chr> <dbl> <dbl> <dbl> <dbl>
#1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
#2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
#3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
#4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
#5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
#6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
Or using rename_if
df %>%
rename_if(grepl("mean", names(.)), funs(sprintf("%s_expr", .)))
Here is a non-dplyr base R method:
names(df) <- sub("mean$", "mean_expr", names(df))
# or names(df) <- sub("mean", "mean_expr", names(df)) if the mean doesn't have to be at the
# end of the string
names(df)
#[1] "gene_symbol" "foo.control.cv" "foo.control.mean_expr"
#[4] "foo.treated.cv" "foo.treated.mean_expr"
If you want it to be a part of the pipe, you can make use of setNames function:
df %>% setNames(sub("mean", "mean_expr", names(.))) %>% names(.)
#[1] "gene_symbol" "foo.control.cv" "foo.control.mean_expr"
#[4] "foo.treated.cv" "foo.treated.mean_expr"
Another option is dplyr::select_all():
df %>% select_all(~gsub("mean", "mean_expr", .))
And with the use of magritrr you can have
library(magrittr)
names(df)[df %>% names %>% grep(pattern = "mean")] %<>% paste0("_expr")
df
# A tibble: 6 x 5
gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
* <chr> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.16 2.66 0.906 2.41
2 0610007P14Rik 0.121 187. 0.186 181.
3 0610009B22Rik 0.206 142. 0.159 139.
4 0610009L18Rik 0.505 16.7 0.552 11.9
5 0610009O20Rik 0.334 69.9 0.157 76.8
6 0610010B08Rik 0.544 4.78 0.290 2.24

Resources