Choose dataframe variables by name and multiply with a vector elementwise - r

I have a data frame and a vector as follows:
my_df <- as.data.frame(
list(year = c(2001, 2001, 2001, 2001, 2001, 2001), month = c(1,
2, 3, 4, 5, 6), Pdt_d0 = c(0.379045935402736, 0.377328817455841,
0.341158889847019, 0.36761990427443, 0.372442657083218, 0.382702189949558
), Pdt_d1 = c(0.146034519173855, 0.166289573095497, 0.197787188740911,
0.137071647982617, 0.162103042313547, 0.168566518193772), Pdt_d2 = c(0.126975939811326,
0.107708783271871, 0.14096203677089, 0.142228236885706, 0.115542396064519,
0.106935751726809), Pdt_tot = c(2846715, 2897849.5, 2935406.25,
2850649, 2840313.75, 3087993.5))
)
my_vec <- 1:3
I want to multiply Pdt_d0:Pdt_d2 with the corresponding element from my_vec, while keeping the other columns untouched. I can get the desired multiplication with dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) but I lose the year, month, Pdt_tot columns in the process. I tried to achieve my goal with dplyr::select(my_df, num_range("Pdt_d", 0:2)) <- dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) which returns an error 'select<-' is not an exported object. Is there an obvious trick I am not seeing?
I don't think my question is a duplicate; I have seen the answers in here and here but neither question allows me to choose variables by name

You can use the left-hand-side overwritten by the right-hand-side Map/mapply logic, which you tried, outside of the tidy world:
vars <- paste0("Pdt_d", 0:2)
my_df[vars] <- Map(`*`, my_df[vars], my_vec)
my_df
# year month Pdt_d0 Pdt_d1 Pdt_d2 Pdt_tot
#1 2001 1 0.3790459 0.2920690 0.3809278 2846715
#2 2001 2 0.3773288 0.3325791 0.3231263 2897850
#3 2001 3 0.3411589 0.3955744 0.4228861 2935406
#4 2001 4 0.3676199 0.2741433 0.4266847 2850649
#5 2001 5 0.3724427 0.3242061 0.3466272 2840314
#6 2001 6 0.3827022 0.3371330 0.3208073 3087994
This works because [<- exists as a function in R, for assigning to a left-hand-side selection by the square brackets, like my_df[].
The error that was returned is because the code has a select() function on the left-hand-side, and there is no 'select<-' function. I.e., you can't assign to a select()-ion because it isn't setup to work like that. The tidy functions are usually expected to be piped like my_df %>% select() %>% etc without overwriting the original input.

I don't think that you want to do this mess, but it does work.
library(dplyr)
library(tidyr)
my_df %>%
gather(variable, value, -year,-month,-Pdt_tot) %>%
group_by(year, month, Pdt_tot) %>%
mutate(value = value * my_vector) %>%
spread(variable,value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
Not specifying year, month, and Pdt_tot is,
my_df %>%
gather(variable, value, - !num_range("Pdt_d", 0:2)) %>%
group_by(across(c(-variable, -value))) %>%
mutate(value = value * my_vector) %>%
spread(variable, value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321

Related

In R , there are `actual` and `budget` values,how to add new variable and calculate the variable values

In variable type ,there are actual and budget values,how to add new variable and calculate the variable value ? Current code can work, but a little bording. Anyone can help? Thanks!
ori_data <- data.frame(
category=c("A","A","A","B","B","B"),
year=c(2021,2022,2022,2021,2022,2022),
type=c("actual","actual","budget","actual","actual","budget"),
sales=c(100,120,130,70,80,90),
profit=c(3.7,5.52,5.33,2.73,3.92,3.69)
)
Add sales inc%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='actual'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2021'&type=='actual']-1
Add budget acheved%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='budget'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2022'&type=='budget']
Using a group_by and an if_elseyou could do:
library(dplyr)
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(sales_inc_or_budget_achieved = if_else(type == "actual",
sales / lag(sales) - 1,
lag(sales) / sales)) |>
ungroup()
#> # A tibble: 6 × 6
#> category year type sales profit sales_inc_or_budget_achieved
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA
#> 2 A 2022 actual 120 5.52 0.2
#> 3 A 2022 budget 130 5.33 0.923
#> 4 B 2021 actual 70 2.73 NA
#> 5 B 2022 actual 80 3.92 0.143
#> 6 B 2022 budget 90 3.69 0.889
And using across you could do the same for both sales and profit:
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(across(c(sales, profit), ~ if_else(type == "actual",
.x / lag(.x) - 1,
lag(.x) / .x),
.names = "{.col}_inc_or_budget_achieved")) |>
ungroup()
#> # A tibble: 6 × 7
#> category year type sales profit sales_inc_or_budget_achie… profit_inc_or_b…
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA NA
#> 2 A 2022 actual 120 5.52 0.2 0.492
#> 3 A 2022 budget 130 5.33 0.923 1.04
#> 4 B 2021 actual 70 2.73 NA NA
#> 5 B 2022 actual 80 3.92 0.143 0.436
#> 6 B 2022 budget 90 3.69 0.889 1.06
Answer from stefan suits perfectly well, however, I would suggest you rearrange your data first.
In my opinion sales and profit are types of measures (aka observations) and actual and budget are the measurements here:
library(tidyr)
library(dplyr)
ori_data2 <-
ori_data %>%
pivot_longer(c(sales, profit)) %>%
pivot_wider(names_from = type, values_from = value) %>%
group_by(category, name) %>%
arrange(year, .by_group = TRUE)
then your calculations become much more easier:
ori_data2 %>%
mutate(increase = actual / lag(actual) - 1, # compare to the year before
budget_acheved = actual / budget) %>% # compare actual vs. budget
filter(year == 2022) # you can filter for year of interest
mutate(across(c(increase, budget_acheved), scales::percent)) # and format as percent

Reordering rows alphabetically with specific exception(s) in R

This is my first Stack Overflow question so bear with me please. I'm trying to create dataframes that are ordered alphabetically based on a "Variable" field, with exceptions made for rows of particular values (e.g. "Avg. Temp" at the top of the dataframe and "Intercept" at the bottom of the dataframe). The starting dataframe might look like this, for example:
Variable Model 1 Estimate
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Age 0.033***
Avg Temp. -0.001***
Intercept -3.772***
Sex -0.073***
Year=2008 0.084***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
And I want to reorder it as such:
Variable Model 1 Estimate
Avg Temp. -0.001***
Age 0.033***
Sex -0.073***
Year=2008 0.084***
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
Intercept -3.772***
Appreciate any help on this.
You can use the fct_relevel() function from {forcats}. Its first call put Avg Temp., Age and Sex at the beginning (after = 0 by default). The second call will put Intercept at the end (n() refers to the numbers of line in the data frame).
library(tidyverse)
df <-
tribble(~Variable, ~Model,
"Year=2009", 0.026,
"Year=2010", -0.04,
"Year=2011", -0.135,
"Age", 0.033,
"Avg Temp.", -0.001,
"Intercept", -3.772,
"Sex", -0.073,
"Year=2008", 0.084,
"Year=2012", -0.237,
"Year=2013", -0.326,
"Year=2014", -0.431,
"Year=2015", -0.589)
df %>%
mutate(Variable = as.factor(Variable),
Variable = fct_relevel(Variable, "Avg Temp.", "Age", "Sex"),
Variable = fct_relevel(Variable, "Intercept", after = n())) %>%
arrange(Variable)
# A tibble: 12 × 2
Variable Model
<fct> <dbl>
1 Avg Temp. -0.001
2 Age 0.033
3 Sex -0.073
4 Year=2008 0.084
5 Year=2009 0.026
6 Year=2010 -0.04
7 Year=2011 -0.135
8 Year=2012 -0.237
9 Year=2013 -0.326
10 Year=2014 -0.431
11 Year=2015 -0.589
12 Intercept -3.77
Another option, in case the dataframes contain a variety of different variable names besides year and intercept, is something like this:
library(tidyverse)
# Sample data
df <- tribble(
~variable, ~model_1_estimate,
"Year=2009", "0.026",
"Year=2010", "-0.04",
"Year=2011", "-0.135***",
"Age", "0.033***",
"Avg Temp.", "-0.001***",
"Intercept", "-3.772***",
"Sex", "-0.073***",
"Year=2008", "0.084***",
"Year=2012", "-0.237***",
"Year=2013", "-0.326***",
"Year=2014", "-0.431***",
"Year=2015", "-0.589***"
)
# Possible solution
df |>
separate(variable, c("term", "year"), sep = "=") |>
mutate(intercept = if_else(term == "Intercept", 1, 0)) |>
arrange(intercept, term, year) |>
select(-intercept)
#> # A tibble: 12 × 3
#> term year model_1_estimate
#> <chr> <chr> <chr>
#> 1 Age <NA> 0.033***
#> 2 Avg Temp. <NA> -0.001***
#> 3 Sex <NA> -0.073***
#> 4 Year 2008 0.084***
#> 5 Year 2009 0.026
#> 6 Year 2010 -0.04
#> 7 Year 2011 -0.135***
#> 8 Year 2012 -0.237***
#> 9 Year 2013 -0.326***
#> 10 Year 2014 -0.431***
#> 11 Year 2015 -0.589***
#> 12 Intercept <NA> -3.772***
Created on 2022-06-28 by the reprex package (v2.0.1)

Multiple gathering in R to create tidy dataset

I have a complicated untidy dataset which a dummy version of can be replicated below.
studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)
untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)
I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance. I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.
Thanks in advance!
With tidyr you can use pivot_longer:
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")
Output
# A tibble: 750 x 5
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.432 10762. 0.786
2 1 2019 0.948 11340. 0.909
3 1 2020 0.122 12837. 0.944
4 2 2018 0.422 11515. 0.950
5 2 2019 0.0639 12968. 0.828
6 2 2020 0.611 13645. 0.901
7 3 2018 0.489 11281. 0.784
8 3 2019 0.00337 12250. 0.753
9 3 2020 0.711 12898. 0.803
10 4 2018 0.0596 10526. 0.842
Using pure R:
tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)
studentID time score payment attendance
1.2018 1 2018 0.86743970 10995.45 0.9473540
2.2018 2 2018 0.53204701 11152.74 0.8167776
3.2018 3 2018 0.90072918 10631.06 0.9335316
4.2018 4 2018 0.89154492 11889.23 0.9098399
5.2018 5 2018 0.06320442 10973.20 0.8118909
6.2018 6 2018 0.67519166 11751.67 0.8328860
If you want "year" instead of the default "time", add timevar="year"
We could try:
library(dplyr)
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID) %>%
separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>%
pivot_wider(names_from = measure, values_from = value )
Which returns:
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.807 10179. 0.974
2 1 2019 0.599 11601. 0.785
3 1 2020 0.515 12347. 0.760
4 2 2018 0.474 11154. 0.983
5 2 2019 0.409 11682. 0.864
6 2 2020 0.688 13756. 0.812
7 3 2018 0.509 11746. 0.870
8 3 2019 0.867 12851. 0.801
9 3 2020 0.878 12710. 0.955
10 4 2018 0.621 11165. 0.975

Aggregating by fixed date range R

Given a simplification of my dataset like:
df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
"ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
"pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
"Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01",
"1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
"NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
check.names = FALSE, stringsAsFactors = FALSE)
I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.
These means should consider:
Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.
So the output should look something like:
> desired_df
ID ForestType Date meanNDVI
1 1 oak 1987 0.250
2 1 oak 1988 0.325
3 1 oak 1989 0.430
4 2 pine 1987 0.350
5 2 pine 1988 0.635
6 2 pine 1989 0.740
In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.
Thanks a lot!
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
group_by(ID, ForestType, Year = year(Date)) %>%
filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
summarise(meanNDVI = mean(NDVI))
Output
# A tibble: 6 x 4
# Groups: ID, ForestType [2]
ID ForestType Year meanNDVI
<dbl> <chr> <dbl> <dbl>
1 1 oak 1987 0.25
2 1 oak 1988 0.325
3 1 oak 1989 0.475
4 2 pine 1987 0.35
5 2 pine 1988 0.635
6 2 pine 1989 0.74
The classical base R approach using aggregate. The year can be obtained using substr.
res <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean))
res[order(res$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.2875
# 3 1 oak 1988 0.3575
# 5 1 oak 1989 0.5150
# 2 2 pine 1987 0.3575
# 4 2 pine 1988 0.6425
# 6 2 pine 1989 0.7725
Trimmed version
Trimmed for 25% outlyers.
res2 <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean, trim=.25))
res2[order(res2$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.250
# 3 1 oak 1988 0.325
# 5 1 oak 1989 0.475
# 2 2 pine 1987 0.350
# 4 2 pine 1988 0.635
# 6 2 pine 1989 0.740
Using data.table package, you could proceed as follows:
library(data.table)
setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]
# ID ForestType year meanNDVI
# 1: 1 oak 1987 0.250
# 2: 1 oak 1988 0.325
# 3: 1 oak 1989 0.475
# 4: 2 pine 1987 0.350
# 5: 2 pine 1988 0.635
# 6: 2 pine 1989 0.740
Another option. You can set trim in mean
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = ymd(Date) %>% year()) %>%
group_by(ID, ForestType, Date) %>%
summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Resources