Maintain data frame rows after subet - r

I am trying to calculate a % yield of some data based on a subset:
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
pass <- subset(df, Measurement > 6 & Measurement < 7)
# Calculate number of rows in data frame (by Batch then ID)
ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")
# Calculate number of rows in subset (by Batch then ID)
bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")
# Calculate yield
bc$Yield <- (bc$Pass / ac$Total) * 100
# plot yield
ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()
My problem is that, due to my filter range (between 6 and 7) my subset (pass) has less rows than my data frame (df)
nrow(ac)
[1] 100
nrow(bc)
[1] 83
Therefore I cannot use
bc$Yield <- (bc$Pass / ac$Total) * 100
Or I get the error
replacement has 100 rows, data has 83
The reason I am trying to keep in generic is because my real data has varying batch and ID amounts (otherwise I could just divide by a constant in my yield calculation). Can anyone tell me how to put a 0 in my subset if the data falls outside of the limits (6 to 7 in this case). Or point out an more elegant way of calculating yield. Thank you
Update:
str(df)
'data.frame': 1000 obs. of 3 variables:
$ Batch : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: num 5.04 4.63 2.26 3.8 5.59 ...

I think this is what you want. I've done it using dplyr's group_by and summarize here.
For each Batch/ID it calculates the number of observations, the number of observations where measurement is between 6 and 7 and the ratio of those two.
library(dplyr)
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
countFunc <- function(x) sum((x > 6)&(x<7))
# Calculate number of rows, rows that meet criteria, and yield.
totals <- df %>% group_by(Batch, ID) %>%
summarize(total = length(Measurement), x = countFunc(Measurement)) %>%
mutate(yield = x/total) %>%
as.data.frame()

Related

How to calculate overall-mean for subsets in R

Assume I have longitudinal data (e.g. 21 countries' gdp over 5 years) with 2 variables of the same scale. data is sorted as follows
Country-year
x1
x2
USA-1
50
2
USA-2
47
1.5
...
...
...
USA-5
52
1.9
UK-1
63
2
...
...
...
UK-5
41
3
...
...
...
n = 21
m = 5
N = n * m
set.seed(123)
x <- MASS::mvrnorm(N, rep(0,2), diag(2))
I want to do two things:
First, calculate the mean country-wise
Second, calculate the mean year-wise
For the first task, I have the following code
x_barbar <- NULL
k = 1
for(i in 1:n){
x_barbar[i] <- mean(x[k:(k + (m - 1)), ])
k = k + m
}
How can I optimize my code for the first task and figure out a code to do the second task?
Your help is appreciated.
If you want to calculate mean of odd/even rows you can use vector recycling.
mean_odd <- mean(x[c(TRUE, FALSE), ])
mean_even <- mean(x[c(FALSE, TRUE), ])
For the updated dataset we can separate country and year column and calculate mean.
library(dplyr)
library(tidyr)
data <- df %>% separate(`Country-year`, c('Country', 'Year'), sep = '-')
country_mean <- data %>% group_by(Country) %>% summarise(across(c(x1, x2), mean))
year_mean <- data %>% group_by(Year) %>%summarise(across(c(x1, x2), mean))
All you have to do is use tapply
For example the first case
tapply(x, (row(x)+1)%/%2, mean)
For the second case:
tapply(x, row(x)%%2, mean)

Add the number in every row and take the sum

Having a dataframe like this
data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
How is it possible to take the sum of every row from the column num, and include the minuse into the calculation?
Example of expected output?
data.frame(id = c(1,2), sum = c(32, 30)
Using Base R you could do the following:
# data
df <- data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
# split by ",", convert to numeric and then sum
df[, 2] <- sapply(strsplit(as.character(df$num), ","), function(x){
sum(as.numeric(x))
})
# result
df
# id num
# 1 1 32
# 2 2 30
If you can use packages, the tidy packages make this easy and use tidy data principals which are quick and easy once you get used to thinking this way.
library(tidyr)
library(dplyr)
df %>%
# Convert the string of numbers to a tidy dataframe
# with one number per row with the id column for grouping
separate_rows(num,sep = ",") %>%
# Convert the text to a number so we can sum
mutate(num = as.numeric(num)) %>%
# Perform the calculation for each id
group_by(id) %>%
# Sum the number
summarise(sum = sum(num,na.rm = TRUE)) %>%
# Ungroup for further use of the data
ungroup()
# A tibble: 2 x 2
# id sum
# <dbl> <dbl>
# 1 1 32
# 2 2 30
library(stringr)
df <- data.frame(id = c(1,2), num = c("30, 4, -2","10, 20"))
df$sum <- NA
for (i in 1:nrow(df)) {
temp <- as.character(df[i,2])
n_num <- str_count(temp, '[0-9.]+')
total <- 0
for (j in 1:n_num) {
digit <- strsplit(temp, ',')[[1]][j]
total <- total + as.numeric(digit)
temp <- sub(digit, '', temp)
}
df[i, 'sum'] <- total
}
print(df)
id num sum
1 1 30, 4, -2 32
2 2 10, 20 30

Prophet forecasting by id and populating a data frame with one month ahead forecasts

I have a dataframe containing multiple (thousands) unequal-length monthly time series separated by a non-sequencial id variable. The data set looks like this,
id1 <- rep(12, 60)
ds1 <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
value1 <- sample(60)
id2 <- rep(132, 48)
ds2 <- seq(as.Date("2015-01-01"), as.Date("2018-12-31"), by = "month")
value2 <- sample(48)
id3 <- rep(210, 72)
ds3 <- seq(as.Date("2013-01-01"), as.Date("2018-12-31"), by = "month")
value3 <- sample(72)
id <- c(id1, id2, id3)
ds <- c(ds1, ds2, ds3)
y <- c(value1, value2, value3)
df <- data.frame(id, ds, y)
> head(df)
id ds y
1 12 2014-01-01 51
2 12 2014-02-01 22
3 12 2014-03-01 34
4 12 2014-04-01 53
5 12 2014-05-01 26
6 12 2014-06-01 56
I want to run the prophet forecasting model on every time series separated by id and generate a data frame with one month ahead forecast with one or two diagnostic statistics. The rows of that data frame should start with the id variable, ie. the first column should be id.
For a single id case, the procedure looks like this,
library(prophet)
set.seed(1234)
id <- rep(23, 60)
ds <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
y <- sample(60)
df <- data.frame(ds, y)
m <- prophet(df, seasonality.mode = 'multiplicative')
future <- make_future_dataframe(m, periods = 1)
fcst <- predict(m, future)
last_fcst <- fcst[61,]
mse <- mean((df$y - fcst$yhat[c(1:60)])^2)
mae <- mean(abs((df$y - fcst$yhat[c(1:60)])))
final <- cbind(last_fcst, mse, mae)
final
> final
ds trend multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yearly
61 2018-12-02 27.19465 -0.1401155 -0.1401155 -0.1401155 -0.1401155
yearly_lower yearly_upper additive_terms additive_terms_lower additive_terms_upper yhat_lower yhat_upper
61 -0.1401155 -0.1401155 0 0 0 3.689257 42.66293
trend_lower trend_upper yhat mse mae
61 27.19465 27.19465 23.38425 242.4414 12.80532
I want to repeat this procedure and create a dataset with each one-month forecast with their corresponding row ids. Any idea what's the best way to do that?
As I said in the comments, it's best to split() by id in a list(). This way you can use lapply() or (purrr::map()) to make predictions and calculate the metrics per each id.
library(prophet)
library(dplyr) # for data wrangling
library(purrr) # for map/map2, equivalents are lapply/mapply from baseR
# preparations
l_df <- df %>%
split(.$id)
m_list <- map(l_df, prophet) # prophet call
future_list <- map(m_list, make_future_dataframe, periods = 1) # makes future obs
forecast_list <- map2(m_list, future_list, predict) # map2 because we have two inputs
So, forecast_list will contain the output from the predict, again divided by id.
You can "merge" them back in a data.frame by using bind_rows(forecast_list), as long as each df is equal (same structure).
For the metrics I'd follow the same principle:
# to evaluate the model: create a new list
eval_list <- map2(forecast_list, l_df, function(x,z) {
# x is the single dataframe of predictions
# z is the original dataframe with actuals
x <- x[1:(nrow(x)-1), ] # subset to exclude first true forecast
x <- x %>% mutate(y_true = (z %>% select(y) %>% pull()) ) # add the column of actual values
})
# metrics evaluation:
eval_list <- map(eval_list, function(x) {
x <- x %>%
summarise(mse = mean((y_true - yhat)^2)) # add more scores
})
# $`12`
# mse
# 1 199.1829
#
# $`132`
# mse
# 1 156.6394
#
# $`210`
# mse
# 1 415.9659
You can use map2() like I did for eval_list to bind the true forecast with the metrics if you want.

Vectorising linear interpolation function for use with mutate

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

Iteratively rbind 10% of the data from data frame and plotting

I have three data frames, each having 1 column but having different number of rows 100,100,1000 for df1,df2,df3 respectively. I want to do an rbind iteratively and calculate measures like mean repeatedly for the small chunks of data by taking 10% of the data each time. Meaning in the first iteration I need to have 10 rows from df1, 10 from df2 and 100 from df3 and for this set i need to get a mean and the process should continue 10 times. And I need to plot the iterations chunks over time showing the mean in y-axis over iterations and get an overall mean with this procedure. Any suggestions?
df1<- data.frame(A=c(1:100))
df2<- data.frame(A=c(1:100))
df3<- data.frame(A=c(1:1000))
library(dplyr)
for i in (1:10)
{ df[i]<- rbind_list(df1,df2,df3)
mean=mean(df$A)}
You're making things complicated by trying to keep separate data frames. Add a "group" column---call it "iteration" if you prefer---and get your data in one data frame:
df1$group = rep(1:10, each = nrow(df1) / 10)
df2$group = rep(1:10, each = nrow(df2) / 10)
df3$group = rep(1:10, each = nrow(df3) / 10)
df = rbind(df1, df2, df3)
means = group_by(df, group) %>% summarize(means = mean(A))
means
# Source: local data frame [10 x 2]
#
# group means
# 1 1 43
# 2 2 128
# 3 3 213
# 4 4 298
# 5 5 383
# 6 6 468
# 7 7 553
# 8 8 638
# 9 9 723
# 10 10 808
Your overall mean is mean(df$A). You can plot with with(means, plot(group, means)).
Edits:
If the groups don't come out exactly, here's how I'd assign the group column. Make sure your dplyr is up-to-date, this uses the the .id argument of bind_rows() which was new this month in version 0.4.3.
library(dplyr)
# dplyr > 0.4.3
df = bind_rows(df1, df2, df3, .id = "id")
df = df %>% group_by(id) %>%
mutate(group = (0:(n() - 1)) %/% (n() / 10) + 1)
The id column tells you which data frame the row came from, and the group column splits it into 10 groups. The rest of the code from above should work just fine.

Resources