Calculating repeated means from columns using R - r

This is hopefully a simple question about loops in R. I have a dataset that is made up of results from a simulation. Each column is the results from a single cow, taken each day for a month, then repeated 100 times. So the total length of the column is 3000.
I would like to calculate the mean of the simulated results for each day, to get a single value for each day, for each cow. So I need to calculate the mean of the first entry, the 31st entry, the 61st entry and so on, and then the mean of the second entry, the 32nd entry, the 62nd entry and so on. I would like to end up with a 30 entry column for each cow. I have been trying to do it using a loop in R but can't work out how. Any advice would be greatly appreciated.
Here is some example data:
a<-seq(from = 1, by = 1, length = 30)
b<-seq(from = 1, by = 0.5, length = 30)
c<-seq(from = 1, by = 2, length = 30)
cow1<-rep(a,100)
cow2<-rep(b,100)
cow3<-rep(c,100)
dat<-as.data.frame(cbind(cow1,cow2,cow3))

I think it is better to construct a column "day" and then use it with tapply, as Xi'an said, there is no need for a loop and a loop would be slower and less clean. In code this gives us :
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
# Construct a data frame, I find this cay is better as it gives names to the columns.
dat <- data.frame(day,cow1,cow2,cow3)
# Here are the results
tapply(dat$cow1, dat$day, mean)
tapply(dat$cow2, dat$day, mean)
tapply(dat$cow3, dat$day, mean)

I agree with TMat, including a column with day is useful.
Here is my working example using tidyverse
library(tidyverse)
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
dat <- data.frame(day,cow1,cow2,cow3) %>%
pivot_longer(cols = 2:4) %>%
group_by(day, name) %>%
summarize(mean = mean(value))
#> `summarise()` regrouping output by 'day' (override with `.groups` argument)
dat
#> # A tibble: 90 x 3
#> # Groups: day [30]
#> day name mean
#> <dbl> <chr> <dbl>
#> 1 1 cow1 1
#> 2 1 cow2 1
#> 3 1 cow3 1
#> 4 2 cow1 2
#> 5 2 cow2 1.5
#> 6 2 cow3 3
#> 7 3 cow1 3
#> 8 3 cow2 2
#> 9 3 cow3 5
#> 10 4 cow1 4
#> # ... with 80 more rows
ggplot(dat, aes(x = day, y = mean, fill = name)) +
geom_col(position = "dodge")
Created on 2020-07-08 by the reprex package (v0.3.0)

Related

Multiply a set of columns in data frame R to another set of columns

I have a data.frame with 156 variables and I would like to multiply a subset of those variables to another subset of the 156. How can I do that: The variables are of the following form:
ID||quantity_1||quantity_2||...||quantity_156||priceperunit_q1||...priceperunit||q156
essentially I would like to multiply each quantity by its priceperunit
Given you first have all the quantity columns and then all the priceperunit columns in the correct order, you can make two different data sets and multiply them using mapply, like this. The result set will contain the product of the two columns, rename this if needed:
quantity_1 <- c(1, 2, 3)
quantity_2 <- c(1, 2, 3)
quantity_3 <- c(1, 2, 3)
quantity_40 <- c(1, 2, 3)
priceperunit_1 <- c(20, 20, 20)
priceperunit_2 <- c(30, 30, 30)
priceperunit_3 <- c(15, 15, 15)
priceperunit_40 <- c(1.25, 1.25, 1.65)
df <- data.frame(quantity_1, quantity_2, quantity_3, quantity_40,
priceperunit_1, priceperunit_2, priceperunit_3, priceperunit_40)
df
#> quantity_1 quantity_2 quantity_3 quantity_40 priceperunit_1 priceperunit_2
#> 1 1 1 1 1 20 30
#> 2 2 2 2 2 20 30
#> 3 3 3 3 3 20 30
#> priceperunit_3 priceperunit_40
#> 1 15 1.25
#> 2 15 1.25
#> 3 15 1.65
as.data.frame(mapply("*",
df[, grep("^quantity", names(df))],
df[, grep("^priceperunit", names(df))]))
#> quantity_1 quantity_2 quantity_3 quantity_40
#> 1 20 30 15 1.25
#> 2 40 60 30 2.50
#> 3 60 90 45 4.95
A simple solution would be to reshape your data from wide to long format, see ?reshape. This will also help you keeping your original data format.
The trick is to store all your price and all your quantity data into the same 2 columns, but records will be differentiated thanks to the newly created time variable of reshape (1 to 156 according to your post).
You can then compute the volume (price x quantity), and go back to the wide format if desired.
See the reproducible example below.
df <- data.frame(price1 = 1:10,
price2 = 11:20,
quantity1 = c(1:5, 1:5),
quantity2 = c(5:9, 5:9))
df_long <- reshape(df,
direction = "long",
v.names = c("price", "quantity"),
varying = list(paste0("price", 1:2), paste0("quantity", 1:2)),
times = c("1", "2"))
df_long$volume <- df_long$price*df_long$quantity
df_wide <- reshape(df_long,
direction = "wide",
v.names = c("price", "quantity", "volume"),
timevar = "time",
sep = "")

rollapply how to "ignore" certain observations and use variable width

I am trying to calculate mean for some data along a non-regular date sequence. For example, I have minute level data for specific periods of time during the day and I am interested in calculating 5 minute averages. However, I am not sure how does the width parameter in rollapply works when is specified as a list.
library(tidyverse)
library(zoo)
length = 16
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
# Create a "discontinuity"
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
# Add some noise
dxf$date <- dxf$date + runif(length, 0, 1)
diff(dxf$date)
dxf %>%
arrange(date) %>%
mutate(
diff = c(as.numeric(diff(date)), NA),
mean = rollapply(value, width = 5, mean, partial = TRUE, align = "left")
)
# This is what I need. Therefore, I need a variable width but adjusting to the discontinuity in the rows.
mean1 <- mean(dxf$value[1:5])
mean2 <- mean(dxf$value[2:6])
mean3 <- mean(dxf$value[3:7])
mean4 <- NA # Only have 4 values mean(dxf$value[4:7])
mean5 <- NA # Only have 3 values mean(dxf$value[5:7])
mean6 <- NA # Only have 2 values mean(dxf$value[6:7])
mean7 <- NA # Only have 1 values mean(dxf$value[7:7])
mean8 <- mean(dxf$value[7:11])
etc.
I think this is a tricky problem. Here is one approach
1 Generate a 1 min sequence from the first to the last datetime
2 Interpolate so we have a value at each 1 min. This includes interpolating across the discontinuity
3 Calculate the running 5 min mean based on the 1 min interpolated values
4 Remove the values where the gap in the original datetime values is too large
Also, take care with time zones, best to set these to some deliberately chosen value or UTC which the lubridate functions do by default.
library(tidyverse)
library(RcppRoll)
library(lubridate)
dxf <- tibble(
date = seq(from = ymd_hms('2019-08-14 09:06:05'), by = "59 sec", length.out = 30),
value = runif(30)
)
dxf$date[15:30] <- dxf$date[15:30] + 3600*24 # discontinuing
dxf$date <- dxf$date + round(runif(30)) # noise
dxf <- dxf %>%
mutate(date = ymd_hms(date),
date_num = as.numeric(date),
diff = date_num - lag(date_num))
discontinuity <- which(dxf$diff > 70)
n = nrow(dxf)
date_seq <- seq(from = dxf$date_num[1], to = dxf$date_num[n], by = 60) # create a 1 min sequence
value_interp = approx(x = dxf$date_num, y = dxf$value, xout = date_seq) # interpolate values for the 5 min sequence
df <- tibble(
date = as_datetime(date_seq),
mean_value = RcppRoll::roll_mean(value_interp$y, n = 5, fill = NA, align = 'left'))
df %>%
filter(date < dxf$date[discontinuity - 1] | date > dxf$date[discontinuity])
We could extract the date, group them and then use rollmean
library(dplyr)
dxf %>%
mutate(d1 = as.Date(date)) %>%
group_by(d1) %>%
mutate(mean = zoo::rollmean(value, 5, align = "left", fill = NA)) %>%
ungroup %>%
select(-d1)
# date value mean
# <dttm> <dbl> <dbl>
# 1 2019-08-14 12:49:09 0.507 0.404
# 2 2019-08-14 12:50:08 0.307 0.347
# 3 2019-08-14 12:51:07 0.427 0.341
# 4 2019-08-14 12:52:07 0.693 NA
# 5 2019-08-14 12:53:06 0.0851 NA
# 6 2019-08-14 12:54:05 0.225 NA
# 7 2019-08-14 12:55:04 0.275 NA
# 8 2019-08-15 12:56:02 0.272 0.507
# 9 2019-08-15 12:57:01 0.616 0.476
#10 2019-08-15 12:58:01 0.430 0.472
#11 2019-08-15 12:59:00 0.652 0.457
#12 2019-08-15 12:59:58 0.568 0.413
#13 2019-08-15 13:00:58 0.114 NA
#14 2019-08-15 13:01:56 0.596 NA
#15 2019-08-15 13:02:56 0.358 NA
#16 2019-08-15 13:03:54 0.429 NA
data
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
dxf$date <- dxf$date + runif(length, 0, 1)
Here w[i] is number of elements of date that are less than or equal to date[i] + 300 minus i - 1 noting that 300 refers to 300 seconds.
date <- dxf$date
w <- findInterval(date + 300, date) - seq_along(date) + 1
rollapply(dxf$value, w, mean, align = "left") * ifelse(w < 5, NA, 1)
# same
sapply(seq_along(w), function(i) mean(dxf$value[seq(i, length = w[i])])) *
ifelse(w < 5, NA, 1)

Replacing NA's with LOCF using Sparklyr

My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)

Create a new tibble using the previous row value of a column as a parameter of the current row value

I want to manually create a tibble where one column values are calculated depending on the previous value of the same column.
For example:
tibble(
x = 1:5,
y = x + lag(y, default = 0)
)
I expect the following result:
# A tibble: 5 x 2
x y
<int> <dbl>
1 1 1
2 2 3
3 3 6
4 4 10
5 5 15
But I obtain the error:
Error in lag(y, default = 0) : object 'y' not found
Update - more real example:
tibble(
years = 1:5,
salary = 20000 * (1.01) ^ lag(years, default = 0),
qta = salary * 0.06
) %>%
mutate(
total = ifelse(row_number() == 1,
(qta + 50000) * (1.02),
(qta + lag(total, default = 0)) * (1.02))
)
In this example I have a tibble, and I want to add a column 'total' that is defined depending on its previous value, but the lag(total, default = 0) doesn't work.
We can use accumulate
library(tidyverse)
tibble(x = 1:5, y = accumulate(x, `+`))
# A tibble: 5 x 2
# x y
# <int> <int>
#1 1 1
#2 2 3
#3 3 6
#4 4 10
#5 5 15
For a general function, it would be
tibble(x = 1:5, y = accumulate(x, ~ .x + .y))
We can also specify the initialization value
tibble(x = 1:5, y = accumulate(x[-1], ~ .x + .y, .init = x[1]))
You're missing x instead of y in the lag() function to run without an error:
tibble(
x = 1:5,
y = x + lag(x, default = 0)
)
But as per #Ronak Shah's comment, you need the cumsum() function to get the same result as your example:
tibble(
x = 1:5,
y = cumsum(x)
)

Finding the average of last observed result (on different days)

dx = data.frame(
Day = c(3, 3, 4, 4, 5, 6, 7, 7, 7),
Name = c("Amy", "Bob", "Jim", "Amy", "Bob", "Jim", "Amy", "Bob", "Jim"),
Result = c(11:19)
)
Day Name Result
3 Amy 11
3 Bob 12
4 Jim 13
4 Amy 14
5 Bob 15
6 Jim 16
7 Amy 17
7 Bob 18
7 Jim 19
Here is a table that shows the trial results of 3 subjects on specific days. How can I create a summary table that shows the average latest-observed result before that day, (as a comparison for the upcoming results)? For example, the average latest-observed result for Day 6 would be Amy's Day 4, Bob's Day 5, Jim's Day 4. For Day 7, it would be Amy/4, Bob/5, Jim/6.
This is the solution I have but a for loop seems inefficient if I were to have thousands of days and subjects.
output = data.frame(Day = unique(dx$Day)) #Extract unique days
for (i in 1:nrow(output)) {
dfTemp = dx[dx$Day < dx2[i, "Day"],] #Find all results prior to day
dfTemp = dfTemp[with(dfTemp, order(Name, -Day)),] #Sort descending by day
dfTemp = dfTemp[match(unique(dfTemp$Name), dfTemp$Name),] # Take only the 1st row for each person (will be latest result due to sorting above)
output[i, "AvgLatestResult"] = mean((dfTemp[, "Result"])) #Find mean
}
Day AvgLatestResult
3 NaN
4 11.5
5 13.0
6 14.0
7 15.0
You may cast from long to wide (dcast), fill NA with last previous values (na.locf), take row means of all but first row (rowMeans(head(..., -1)..., pad with NA.
library(zoo)
library(data.table)
c(NA, rowMeans(head(na.locf(dcast(dx, Day ~ Name, value.var = "Result")[ , -1]), -1), na.rm = TRUE))
# [1] NA 11.5 13.0 14.0 15.0
Or expand 'dx' with one row per 'Day' and 'Name' (CJ), fill missing values by 'Name' (na.locf), create a "day index", calculate mean of previous day.
setDT(dx)
d2 <- dx[CJ(Day = unique(dx$Day), Name = unique(dx$Name)), on = .(Day, Name)]
d2[ , Result2 := na.locf(Result, na.rm = FALSE), by = Name]
d2[ , ix := .GRP, by = Day]
d2[ , .(avg = mean(d2[ix == .GRP - 1, Result2], na.rm = TRUE)), by = Day]
# Day avg
# 1: 3 NaN
# 2: 4 11.5
# 3: 5 13.0
# 4: 6 14.0
# 5: 7 15.0
Read dx creating a zoo object z splitting on Name to create a 5 x 3 wide form object with Days as rows and Names as columns. Then use na.locf to fill in NA values and rollapply with list(-1), meaning use prior value, and mean. This gives a zoo object AvgLatest which we can optionally convert to a data frame using fortify.zoo. (If a zoo object is OK as the result then omit the fortify.zoo line.)
library(zoo)
z <- read.zoo(dx, split = "Name")
z.fill <- na.locf(z, na.rm = FALSE)
AvgLatest <- rollapply(z.fill, list(-1), mean, na.rm = TRUE, by.column = FALSE, fill = NA)
fortify.zoo(AvgLatest, names = "Day") # optional
giving:
Day AvgLatest
1 3 NA
2 4 11.5
3 5 13.0
4 6 14.0
5 7 15.0
Variation
This code could be expressed using a magrittr pipeline like this:
library(zoo)
library(magrittr)
dx %>%
read.zoo(split = "Name") %>%
na.locf(na.rm = FALSE) %>%
rollapply(list(-1), mean, na.rm = TRUE, by.column = FALSE, fill = NA) %>%
fortify.zoo(names = "Day") # optional
For larger dataframes, this could be a bit faster:
# https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value
repeat.before = function(x) {
ind = which(!is.na(x))
if(is.na(x[1]))
ind = c(1,ind)
rep(x[ind], times = diff(
c(ind, length(x) + 1) ))
}
day_seq <- data.frame(Day = seq(min(dx$Day), max(dx$Day)))
out <- c(NA,
rowMeans(
do.call(
cbind, by(dx, dx$Name, function(x) {
out <- merge(x, day_seq, by.x = "Day", by.y = "Day", all.x = TRUE, all.y = TRUE)
out$Name <- x$Name[1]
out$Result <- repeat.before(out$Result)
})),
na.rm = TRUE))
out[-length(out)]
NA 11.5 13.0 14.0 15.0

Resources