R: using customised function in dplyr - r

Sample data:
library(tidyverse)
set.seed(123)
dat <- tibble(
year = rep(1980:2015, each = 100),
day = rep(200:299, times = 36),
rain = sample(0:17, size = 100*36,replace = T),
PETc = sample(rnorm(100*36)),
ini.t = rep(10:45, each = 100 ))
I have a function that operates on a DataFrame
my.func <- function(df, initial, thres, upper.limit){
df$paw <- rep(NA, nrow(df))
df$aetc <- rep(NA, nrow(df))
df$sw <- rep(NA, nrow(df))
for(n in 1:nrow(df)){
df$paw[n] <- df$rain[n] + initial
df$aetc[n] <- ifelse(df$paw[n] >= thres, df$PETc[n], (df$paw[n]/thres) * df$PETc[n])
df$aetc[n] <- ifelse(df$aetc[n] > df$paw[n], df$paw[n], df$aetc[n])
df$sw[n] <- initial + df$rain[n] - df$aetc[n]
df$sw[n] <- ifelse(df$sw[n] > upper.limit,upper.limit,ifelse(df$sw[n] < 0, 0,df$sw[n]))
initial <- df$sw[n]
}
return(df)
}
thres <- 110
upper.limit <- 200
Applying the above function for a single year:
dat.1980 <- dat[dat$year == 1980,]
my.func(dat.1980, initial = dat.1980$ini.t[1], thres, upper.limit)
How do I apply this function to each year. I thought of using dplyr
dat %>% group_by(year)%>% run my function on each year.
Also since there are 35 years, there will be 35 dataframes returned. How do I return the bind these data frame row wise?

You were on the right track. do lets you perform functions by group.
dat %>%
group_by(year) %>%
do(my.func(., initial = head(.$ini.t, 1), thres, upper.limit))
# Groups: year [36]
# year day rain PETc ini.t paw aetc sw
# <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 1980 200 5 0.968 10 15.0 0.132 14.9
# 2 1980 201 14 0.413 10 28.9 0.108 28.8
# 3 1980 202 7 -0.912 10 35.8 -0.296 36.1
# 4 1980 203 15 -0.337 10 51.1 -0.156 51.2
# 5 1980 204 16 0.412 10 67.2 0.252 67.0
# 6 1980 205 0 -0.923 10 67.0 -0.562 67.5
# 7 1980 206 9 1.17 10 76.5 0.813 75.7
# 8 1980 207 16 0.0542 10 91.7 0.0452 91.7
# 9 1980 208 9 -0.293 10 101 -0.268 101
# 10 1980 209 8 0.0788 10 109 0.0781 109
# ... with 3,590 more rows
purrr::map functions are the du jour method but I think in this case it's a stylistic choice

We can split by 'year' and then use map to apply the my.func to each of the split datasets in the list
library(purrr)
dat %>%
split(.$year) %>%
map_df(~my.func(.x, initial = .x$ini.t[1], thres, upper.limit))

Related

R: how to average every 7th row

I want to take the average of each column (except the date) after every seven rows. I tried the approach below, but I was getting incorrect values. This method also seems really long. Is there a way to shorten it?
bankamerica = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/bankamerica.csv')
library(tidyverse)
GroupLabels <- 0:(nrow(bankamerica) - 1)%/% 7
bankamerica$Group <- GroupLabels
Avgs <- bankamerica %>%
group_by(bankamerica$Group) %>%
summarize(Avg = mean(bankamerica$tr))
EDITED: Just realized this code provides the incorrect values
I think you're on the right path.
bankamerica %>%
mutate(group = cumsum(row_number() %% 7 == 1)) %>%
group_by(group) %>%
summarise(caldate = first(caldate), across(-caldate, mean)) %>%
select(-group)
## A tibble: 144 × 3
# caldate tr var
# <chr> <dbl> <dbl>
# 1 1/2/01 28.9 -50.6
# 2 1/11/01 23.6 -45.4
# 3 1/23/01 20.9 -45
# 4 2/1/01 17.4 -48
# 5 2/12/01 14.4 -48
# 6 2/21/01 17 -48.9
# 7 3/2/01 19.1 -56
# 8 3/13/01 19.4 -56.9
# 9 3/22/01 23.3 -55.7
#10 4/2/01 7.71 -58.3
This averages every 7 rows not every 7 days, because there are missing days in the data.

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

Compare one group to the rest of the groups as a whole in R

Here is some sample data:
movie_df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"movie_type" = c("Action", "Horror", "Comedy", "Thriller", "Comedy",
"Action","Thriller", "Horror", "Action", "Comedy"),
"snack_type" = c("Chocolate", "Popcorn", "Candy", "Popcorn", "Popcorn",
"Candy","Chocolate", "Candy", "Popcorn", "Chocolate"),
"event_type" = c("Solo", "Family", "Date", "Friends", "Solo",
"Family","Date", "Date", "Friends", "Friends"),
"total_cost" = c(50, 35, 20, 50, 30,
60, 25, 35, 20, 50))
What I want to do is go through each column and compare each group to the rest of the groups on total_cost. For example, I want to see how movie_type == 'Action' compares to movie_type != 'Action' for total_cost. I want to do that for every type in movie_type then every type in snack_type and event_type.
What I ultimately want to get to is this where sd = Standard Deviation. Ideally this will be done by a tidyverse method in R (e.g. dplyr or tidyr):
> results_df
# A tibble: 11 x 11
Group Grp_1 Grp_2 Grp_1_mean Grp_2_mean Grp_1_sd Grp_2_sd Grp_1_n Grp_2_n Mean_Diff `t-test`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 movie_type Action Rest of group 43.3 35 20.8 11.5 3 7 8.33 2.84
2 movie_type Horror Rest of group 35 38.1 0 16.0 2 8 -3.12 -2.21
3 movie_type Thriller Rest of group 37.5 37.5 17.7 14.6 2 8 0 0
4 movie_type Comedy Rest of group 33.3 39.3 15.3 14.6 3 7 -5.95 -2.22
5 snack_type Chocolate Rest of group 41.7 35.7 14.4 14.8 3 7 5.95 2.26
6 snack_type Candy Rest of group 38.3 37.1 20.2 12.9 3 7 1.19 0.407
7 snack_type Popcorn Rest of group 33.8 40 12.5 15.8 4 6 -6.25 -2.60
8 event_type Date Rest of group 26.7 42.1 7.64 14.1 3 7 -15.5 -7.25
9 event_type Family Rest of group 47.5 35 17.7 13.4 2 8 12.5 3.86
10 event_type Friends Rest of group 40 36.4 17.3 14.1 3 7 3.57 1.28
11 event_type Solo Rest of group 40 36.9 14.1 15.1 2 8 3.12 1.04
It's same logic as Daniel did using purrr::map and purrr::map2.
library(dplyr)
library(tibble)
library(purrr)
library(stringr)
needed_cols <- c("movie_type", "snack_type", "event_type")
new_names <- 1:2 %>%
map(~str_c(c("group", "mean", "sd", "n"), "_", .x)) %>%
unlist()
my_data <- needed_cols %>%
map(function(df_c)
map(unique(movie_df[[df_c]]),
function(v){
df <- movie_df %>%
mutate(group = ifelse(get(df_c) == v, v, "rest_of_group")) %>%
group_by(group) %>%
summarize(mean = mean(total_cost), sd = sd(total_cost), n = n()) %>%
.[match(.$group, c(v, "rest_of_group")),]
df <- bind_cols(df[1, ], df[2,])
names(df) <- new_names
df
}
)
) %>%
map2(needed_cols, ~bind_rows(.x) %>% mutate(group = .y)) %>%
bind_rows() %>%
select(
str_subset(names(.), "group") %>% sort(),
str_subset(names(.), "mean"),
str_subset(names(.), "sd"),
str_subset(names(.), "n")
) %>%
mutate(mean_diff = mean_1 - mean_2)
Sorry its not in pipes, but in Base R we can:
results_df <- do.call(rbind,unlist(
apply(movie_df[,2:4],2,function(u)
lapply(unique(u), function(x)
data.frame(
group1 = as.character(x),
group2 = "rest",
grp1_mean = mean(movie_df$total_cost[u == x]),
grp2_mean = mean(movie_df$total_cost[u != x]),
grp1_sd = sd(movie_df$total_cost[u == x]),
grp2_sd = sd(movie_df$total_cost[u != x])
)
)
),recursive=F)
)
#add mean differences
results_df$meandiff <- with(results_df, grp1_mean - grp2_mean)
> results_df
group1 group2 grp1_mean grp2_mean grp1_sd grp2_sd meandiff
movie_type1 Action rest 43.33333 35.00000 20.816660 11.54701 8.333333
movie_type2 Horror rest 35.00000 38.12500 0.000000 16.02175 -3.125000
movie_type3 Comedy rest 33.33333 39.28571 15.275252 14.55695 -5.952381
movie_type4 Thriller rest 37.50000 37.50000 17.677670 14.63850 0.000000
snack_type1 Chocolate rest 41.66667 35.71429 14.433757 14.84042 5.952381
snack_type2 Popcorn rest 33.75000 40.00000 12.500000 15.81139 -6.250000
snack_type3 Candy rest 38.33333 37.14286 20.207259 12.86375 1.190476
event_type1 Solo rest 40.00000 36.87500 14.142136 15.10381 3.125000
event_type2 Family rest 47.50000 35.00000 17.677670 13.36306 12.500000
event_type3 Date rest 26.66667 42.14286 7.637626 14.09998 -15.476190
event_type4 Friends rest 40.00000 36.42857 17.320508 14.05770 3.571429

Why does R throw an error on iterative calculation

I'm looking at covid-19 data to calculate estimates for the reproductive number R0.
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(TTR)
# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv"
DoubleCOV <- read.csv(url, stringsAsFactors = FALSE)
names(DoubleCOV)[1] <- "countyFIPS"
DoubleCovid <- pivot_longer(DoubleCOV, cols=starts_with("X"),
values_to="cases",
names_to=c("X","date_infected"),
names_sep="X") %>%
mutate(infected = as.Date(date_infected, format="%m.%d.%y"),
countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))
#data is by county, summarise for the state of interest
stateData <- DoubleCovid %>% filter(State == "AL") %>% filter(cases != 0) %>%
group_by(infected) %>% summarise(sum(cases)) %>%
mutate(DaysSince = infected - min(infected))
names(stateData)[2] <- "cumCases"
#3 day moving average to smooth a little
stateData <- stateData %>% mutate(MA = runMean(cumCases,3))
#calculate doubling rate (DR) and then R0 infectious period/doubling rate
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
CDplot <- stateData %>%
ggplot(mapping = aes(x = as.numeric(DaysSince), y = R0)) +
geom_line(color = "firebrick")
print(CDplot)
So in the above the state of interest is Alabama, hence filter(State == "AL") and this works.
But if I change the state to "NY" I get
Error in `$<-.data.frame`(`*tmp*`, "DR", value = c(NA, NA, NA, 0.733907206043719 :
replacement has 4 rows, data has 39
head(stateData) yields
infected cumCases DaysSince MA
<date> <int> <drtn> <dbl>
1 2020-03-02 1 0 days NA
2 2020-03-03 2 1 days NA
3 2020-03-04 11 2 days 4.67
4 2020-03-05 23 3 days 12
5 2020-03-06 25 4 days 19.7
6 2020-03-07 77 5 days 41.7
The moving average values in rows 3 and 4 (12 and 4.67) would yield a doubling rate of 0.734 which aligns with the value in the error message value = c(NA, NA, NA, 0.733907206043719 but why does it throw an error after that?
Bonus question: I know loops are frowned upon in R...is there a way to get the moving average and R0 calculation without one?
You have to initialise the new variables before you can access them using the j index. Due to recycling, Alabama, which has 28 rows (divisible by 4), does not return an error, only the warnings about uninitialised columns. New York, however, has 39 rows, which is not divisible by 4 so recycling fails and R returns an error. You shouldn't ignore warnings, sometimes you can, but it's not a good idea.
Try this to see what R (you) is trying to do:
stateData[4]
You should get all rows of the 4th column, not the 4th row.
Solution: initialise your DR and R0 columns first.
stateData$DR <- NA
stateData$R0 <- NA
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
For the bonus question, you can use lag in the same mutate with MA:
stateData <- stateData %>% mutate(MA = runMean(cumCases,3),
DR = log(2)/log(MA/lag(MA)),
R0 = 14 / DR)
stateData
# A tibble: 28 x 6
infected cumCases DaysSince MA DR R0
<date> <int> <drtn> <dbl> <dbl> <dbl>
1 2020-03-13 5 0 days NA NA NA
2 2020-03-14 11 1 days NA NA NA
3 2020-03-15 22 2 days 12.7 NA NA
4 2020-03-16 29 3 days 20.7 1.42 9.89
5 2020-03-17 39 4 days 30 1.86 7.53
6 2020-03-18 51 5 days 39.7 2.48 5.64
7 2020-03-19 78 6 days 56 2.01 6.96
8 2020-03-20 106 7 days 78.3 2.07 6.78
9 2020-03-21 131 8 days 105 2.37 5.92
10 2020-03-22 167 9 days 135. 2.79 5.03
# ... with 18 more rows
I'm using Alabama's data.

Subsetting data set to only retain the mean

Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)

Resources