Finding the average of last observed result (on different days) - r

dx = data.frame(
Day = c(3, 3, 4, 4, 5, 6, 7, 7, 7),
Name = c("Amy", "Bob", "Jim", "Amy", "Bob", "Jim", "Amy", "Bob", "Jim"),
Result = c(11:19)
)
Day Name Result
3 Amy 11
3 Bob 12
4 Jim 13
4 Amy 14
5 Bob 15
6 Jim 16
7 Amy 17
7 Bob 18
7 Jim 19
Here is a table that shows the trial results of 3 subjects on specific days. How can I create a summary table that shows the average latest-observed result before that day, (as a comparison for the upcoming results)? For example, the average latest-observed result for Day 6 would be Amy's Day 4, Bob's Day 5, Jim's Day 4. For Day 7, it would be Amy/4, Bob/5, Jim/6.
This is the solution I have but a for loop seems inefficient if I were to have thousands of days and subjects.
output = data.frame(Day = unique(dx$Day)) #Extract unique days
for (i in 1:nrow(output)) {
dfTemp = dx[dx$Day < dx2[i, "Day"],] #Find all results prior to day
dfTemp = dfTemp[with(dfTemp, order(Name, -Day)),] #Sort descending by day
dfTemp = dfTemp[match(unique(dfTemp$Name), dfTemp$Name),] # Take only the 1st row for each person (will be latest result due to sorting above)
output[i, "AvgLatestResult"] = mean((dfTemp[, "Result"])) #Find mean
}
Day AvgLatestResult
3 NaN
4 11.5
5 13.0
6 14.0
7 15.0

You may cast from long to wide (dcast), fill NA with last previous values (na.locf), take row means of all but first row (rowMeans(head(..., -1)..., pad with NA.
library(zoo)
library(data.table)
c(NA, rowMeans(head(na.locf(dcast(dx, Day ~ Name, value.var = "Result")[ , -1]), -1), na.rm = TRUE))
# [1] NA 11.5 13.0 14.0 15.0
Or expand 'dx' with one row per 'Day' and 'Name' (CJ), fill missing values by 'Name' (na.locf), create a "day index", calculate mean of previous day.
setDT(dx)
d2 <- dx[CJ(Day = unique(dx$Day), Name = unique(dx$Name)), on = .(Day, Name)]
d2[ , Result2 := na.locf(Result, na.rm = FALSE), by = Name]
d2[ , ix := .GRP, by = Day]
d2[ , .(avg = mean(d2[ix == .GRP - 1, Result2], na.rm = TRUE)), by = Day]
# Day avg
# 1: 3 NaN
# 2: 4 11.5
# 3: 5 13.0
# 4: 6 14.0
# 5: 7 15.0

Read dx creating a zoo object z splitting on Name to create a 5 x 3 wide form object with Days as rows and Names as columns. Then use na.locf to fill in NA values and rollapply with list(-1), meaning use prior value, and mean. This gives a zoo object AvgLatest which we can optionally convert to a data frame using fortify.zoo. (If a zoo object is OK as the result then omit the fortify.zoo line.)
library(zoo)
z <- read.zoo(dx, split = "Name")
z.fill <- na.locf(z, na.rm = FALSE)
AvgLatest <- rollapply(z.fill, list(-1), mean, na.rm = TRUE, by.column = FALSE, fill = NA)
fortify.zoo(AvgLatest, names = "Day") # optional
giving:
Day AvgLatest
1 3 NA
2 4 11.5
3 5 13.0
4 6 14.0
5 7 15.0
Variation
This code could be expressed using a magrittr pipeline like this:
library(zoo)
library(magrittr)
dx %>%
read.zoo(split = "Name") %>%
na.locf(na.rm = FALSE) %>%
rollapply(list(-1), mean, na.rm = TRUE, by.column = FALSE, fill = NA) %>%
fortify.zoo(names = "Day") # optional

For larger dataframes, this could be a bit faster:
# https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value
repeat.before = function(x) {
ind = which(!is.na(x))
if(is.na(x[1]))
ind = c(1,ind)
rep(x[ind], times = diff(
c(ind, length(x) + 1) ))
}
day_seq <- data.frame(Day = seq(min(dx$Day), max(dx$Day)))
out <- c(NA,
rowMeans(
do.call(
cbind, by(dx, dx$Name, function(x) {
out <- merge(x, day_seq, by.x = "Day", by.y = "Day", all.x = TRUE, all.y = TRUE)
out$Name <- x$Name[1]
out$Result <- repeat.before(out$Result)
})),
na.rm = TRUE))
out[-length(out)]
NA 11.5 13.0 14.0 15.0

Related

Multiply a set of columns in data frame R to another set of columns

I have a data.frame with 156 variables and I would like to multiply a subset of those variables to another subset of the 156. How can I do that: The variables are of the following form:
ID||quantity_1||quantity_2||...||quantity_156||priceperunit_q1||...priceperunit||q156
essentially I would like to multiply each quantity by its priceperunit
Given you first have all the quantity columns and then all the priceperunit columns in the correct order, you can make two different data sets and multiply them using mapply, like this. The result set will contain the product of the two columns, rename this if needed:
quantity_1 <- c(1, 2, 3)
quantity_2 <- c(1, 2, 3)
quantity_3 <- c(1, 2, 3)
quantity_40 <- c(1, 2, 3)
priceperunit_1 <- c(20, 20, 20)
priceperunit_2 <- c(30, 30, 30)
priceperunit_3 <- c(15, 15, 15)
priceperunit_40 <- c(1.25, 1.25, 1.65)
df <- data.frame(quantity_1, quantity_2, quantity_3, quantity_40,
priceperunit_1, priceperunit_2, priceperunit_3, priceperunit_40)
df
#> quantity_1 quantity_2 quantity_3 quantity_40 priceperunit_1 priceperunit_2
#> 1 1 1 1 1 20 30
#> 2 2 2 2 2 20 30
#> 3 3 3 3 3 20 30
#> priceperunit_3 priceperunit_40
#> 1 15 1.25
#> 2 15 1.25
#> 3 15 1.65
as.data.frame(mapply("*",
df[, grep("^quantity", names(df))],
df[, grep("^priceperunit", names(df))]))
#> quantity_1 quantity_2 quantity_3 quantity_40
#> 1 20 30 15 1.25
#> 2 40 60 30 2.50
#> 3 60 90 45 4.95
A simple solution would be to reshape your data from wide to long format, see ?reshape. This will also help you keeping your original data format.
The trick is to store all your price and all your quantity data into the same 2 columns, but records will be differentiated thanks to the newly created time variable of reshape (1 to 156 according to your post).
You can then compute the volume (price x quantity), and go back to the wide format if desired.
See the reproducible example below.
df <- data.frame(price1 = 1:10,
price2 = 11:20,
quantity1 = c(1:5, 1:5),
quantity2 = c(5:9, 5:9))
df_long <- reshape(df,
direction = "long",
v.names = c("price", "quantity"),
varying = list(paste0("price", 1:2), paste0("quantity", 1:2)),
times = c("1", "2"))
df_long$volume <- df_long$price*df_long$quantity
df_wide <- reshape(df_long,
direction = "wide",
v.names = c("price", "quantity", "volume"),
timevar = "time",
sep = "")

Calculating repeated means from columns using R

This is hopefully a simple question about loops in R. I have a dataset that is made up of results from a simulation. Each column is the results from a single cow, taken each day for a month, then repeated 100 times. So the total length of the column is 3000.
I would like to calculate the mean of the simulated results for each day, to get a single value for each day, for each cow. So I need to calculate the mean of the first entry, the 31st entry, the 61st entry and so on, and then the mean of the second entry, the 32nd entry, the 62nd entry and so on. I would like to end up with a 30 entry column for each cow. I have been trying to do it using a loop in R but can't work out how. Any advice would be greatly appreciated.
Here is some example data:
a<-seq(from = 1, by = 1, length = 30)
b<-seq(from = 1, by = 0.5, length = 30)
c<-seq(from = 1, by = 2, length = 30)
cow1<-rep(a,100)
cow2<-rep(b,100)
cow3<-rep(c,100)
dat<-as.data.frame(cbind(cow1,cow2,cow3))
I think it is better to construct a column "day" and then use it with tapply, as Xi'an said, there is no need for a loop and a loop would be slower and less clean. In code this gives us :
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
# Construct a data frame, I find this cay is better as it gives names to the columns.
dat <- data.frame(day,cow1,cow2,cow3)
# Here are the results
tapply(dat$cow1, dat$day, mean)
tapply(dat$cow2, dat$day, mean)
tapply(dat$cow3, dat$day, mean)
I agree with TMat, including a column with day is useful.
Here is my working example using tidyverse
library(tidyverse)
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
dat <- data.frame(day,cow1,cow2,cow3) %>%
pivot_longer(cols = 2:4) %>%
group_by(day, name) %>%
summarize(mean = mean(value))
#> `summarise()` regrouping output by 'day' (override with `.groups` argument)
dat
#> # A tibble: 90 x 3
#> # Groups: day [30]
#> day name mean
#> <dbl> <chr> <dbl>
#> 1 1 cow1 1
#> 2 1 cow2 1
#> 3 1 cow3 1
#> 4 2 cow1 2
#> 5 2 cow2 1.5
#> 6 2 cow3 3
#> 7 3 cow1 3
#> 8 3 cow2 2
#> 9 3 cow3 5
#> 10 4 cow1 4
#> # ... with 80 more rows
ggplot(dat, aes(x = day, y = mean, fill = name)) +
geom_col(position = "dodge")
Created on 2020-07-08 by the reprex package (v0.3.0)

In R, apply a function separately between columns with same names in different data frames

I have two data frames:
require(tidyverse)
set.seed(42)
df1 = data_frame(x = c(4,3), y = c(0, 0), z = c(NA, 3))
df2 = data_frame(x = sample(1:4, 100, replace = T), y = sample(c(-3, 0, 3), 100, replace = T), z = c(NA, NA, rep(3, 98))) %>% mutate(Tracking = row_number())
I would like to separately for each row of df1 AND for each column of df1 to find the indices of df2 for which df2 is equal to df1. If I tried to loop then each iteration would look like:
for (i in 1: nrow(df1)){
for (j in 1: ncol(df1)) {
L[[i]][j] = inner_join(df1[i,j], df2)
}
}
for example, the first element of the list is:
inner_join(df1[1,1], df2)
Joining, by = "x"
# A tibble: 26 x 4
x y z Tracking
<dbl> <dbl> <dbl> <int>
1 4. 0. NA 1
2 4. -3. NA 2
3 4. 0. 3. 4
4 4. 3. 3. 13
5 4. 0. 3. 16
6 4. -3. 3. 17
7 4. 0. 3. 21
8 4. 0. 3. 23
9 4. 0. 3. 24
10 4. 3. 3. 28
# ... with 16 more rows
However I am sure there's a more efficient way to do this. Possibly dplyr + purrr? I don't have much experience with purrr, but I have a feeling the map function can come in handy. I just don't know how to call the columns separately.
You could do something like
L <- map(names(df1),
function(.) {
out <- inner_join(x = df1[, ., drop = FALSE],
y = df2,
by = .)
split(out, out[[.]])
})
but I'm not sure if this is better or more efficient than the for loop you started with.

Calculate value of all numeric variables for empty rows based on previous and next row of same id

I have a very large data.frame (thousands of variables) each row having an identifier and a year. A row might appear for several years or appear in the middle of the data. One year is missing (1997) and I want to interpolate the values of all numeric variables in a certain way:
Copy all rows from previous year (1996) who's identifier exist in
the next year (1998).
For all numeric variables, calculate the mean of the previous year's variable with next year's - of the specific two corresponding rows (with the same identifier).
As this is a very large data set, I am eager to avoid using loops.
Example data:
test_df <- data.frame(id = c(1,2,3,1,3), year = c(96,96,96,98,98),
state = c("MA","MD","NY","MA", "NY"),
num1 = c(10,11,22,9,27), num2 = c(11566,32340,97555,14200,100025))
> test_df
id year state num1 num2
1 1 96 MA 10 11566
2 2 96 MD 11 32340
3 3 96 NY 22 97555
4 1 98 MA 9 14200
5 3 98 NY 27 100025
And the final data should be:
id year state num1 num2
1 1 96 MA 10 11566
2 2 96 MD 11 32340
3 3 96 NY 22 97555
4 1 97 MA 9.5 12883
5 3 97 NY 24.5 98790
6 1 98 MA 9 14200
7 3 98 NY 27 100025
What I did so far was sub-setting the rows from the previous year that have the same id's as next year, and selecting numeric variables. After the computation, I would just rbind them to the main data.
common_ids <- test_df[test_df$year==1996,]
common_ids <- common_ids[test_df[test_df$year==1996,]$id %in% test_df[test_df$year==1998,]$id,]
numeric_vars <- sapply(common_ids,is.numeric)
common_ids[,numeric_vars] <- lapply(common_ids[,numeric_vars], function(x)???)
Final question - will the solution work also if two years are missing in a row?
Thanks!
Using data.table and zoo you could start with something like this
library(data.table)
library(zoo)
test_df <- data.table(id = c(1,2,3,1,3), year = c(96,96,96,98,98),
state = c("MA","MD","NY","MA", "NY"),
num1 = c(10,11,22,9,27), num2 = c(11566,32340,97555,14200,100025))
test_df <- test_df[order(id, year)]
missing.ids <- test_df[, c(NA, id[-.N]), by = id][!is.na(V1),V1]
temp_df <- data.table(id = missing.ids, year = rep(97, length(missing.ids)), state = NA, num1 = NA, num2 = NA)
new.test_df <- rbind(test_df, temp_df)[order(id, year)]
new.test_df[, state := na.locf(state, na.rm = FALSE), by = id]
new.test_df[, `:=` (num1 = na.approx(num1, na.rm = FALSE), num2 = na.approx(num2, na.rm = FALSE)), by = id]
EDIT
Without naming specific variables
library(data.table)
library(zoo)
test_df <- data.table(id = c(1,2,3,1,3), year = c(96,96,96,98,98),
state = c("MA","MD","NY","MA", "NY"),
num1 = c(10,11,22,9,27), num2 = c(11566,32340,97555,14200,100025))
test_df <- test_df[order(id, year)]
mynum.cols <- names(test_df)[!(names(test_df) %in% c("id", "year", "state"))]
missing.ids <- test_df[, c(NA, id[-.N]), by = id][!is.na(V1),V1]
temp_df <- data.table(id = missing.ids, year = rep(97, length(missing.ids)), state = NA,
data.table(matrix(NA, nrow = length(missing.ids), ncol = length(mynum.cols),
dimnames = list(rep(NA, length(missing.ids)), mynum.cols))))
new.test_df <- rbind(test_df, temp_df)[order(id, year)]
new.test_df[, state := na.locf(state, na.rm = FALSE), by = id]
new.test_df[, (mynum.cols) := lapply(.SD, function(x) na.approx(x, na.rm = FALSE)), by = id, .SDcols = mynum.cols]
new.test_df <- new.test_df[order(year, id)]
new.test_df

Replacing certain values in data.frame in R

I am trying to replace the NAs in "test" with the forecast values in "forecast". I am trying to use match, but I can't figure it out. keep in mind the id and time create a two-part unique id. Any suggestions? ( keep in mind my data set is much larger than this example (rows=32000))
test = data.frame(id =c(1,1,1,2,2,2), time=c(89,99,109,89,99,109), data=c(3,4,NA,5,2,NA))
forecast = data.frame(id =c(1,2), time=c(109,109), data=c(5,1))
Desired output
out = data.frame(id =c(1,1,1,2,2,2), time=c(89,99,109,89,99,109), data=c(3,4,5,5,2,1))
Here is the data.table solution
test_dt <- data.table(test, key = c('id', 'time'))
forecast_dt <- data.table(test, key = c('id', 'time'))
forecast[test][,data := ifelse(is.na(data), data.1, data)]
EDIT. Benchmarking Tests: Data Table is ~ 3x faster even for a small dataset.
library(rbenchmark)
f_merge <- function(){
out2 <- merge(test, forecast, by = c("id", "time"), all.x = TRUE)
out2 <- transform(out2,
newdata = ifelse(is.na(data.x), data.y, data.x), data.x = NULL, data.y = NULL)
return(out2)
}
f_dtable <- function(){
test <- data.table(test, key = c('id', 'time'))
forecast <- data.table(forecast, key = c('id', 'time'))
test <- forecast[test][,data := ifelse(is.na(data), data.1, data)]
test$data.1 <- NULL
return(test)
}
benchmark(f_merge(), f_dtable(), order = 'relative',
columns = c('test', 'elapsed', 'relative'))
test elapsed relative
2 f_dtable() 0.86 1.00
1 f_merge() 2.26 2.63
I would use merge to join the data together and then compute your new column in two steps:
out2 <- merge(test, forecast, by = c("id", "time"), all.x = TRUE)
> out2
id time data.x data.y
1 1 89 3 NA
2 1 99 4 NA
3 1 109 NA 5
4 2 89 5 NA
5 2 99 2 NA
6 2 109 NA 1
#Compute new variable and clean up old ones:
out2 <- transform(out2, newdata = ifelse(is.na(data.x), data.y, data.x), data.x = NULL, data.y = NULL)
> out2
id time newdata
1 1 89 3
2 1 99 4
3 1 109 5
4 2 89 5
5 2 99 2
6 2 109 1
Try this:
test$data[is.na(test$data)] <- forecast[((forecast$id %in% test$id) & (forecast$time %in% test$time)),]$data

Resources