Simulating samples from Gamma distribution in R - r

I'm having trouble with a programming assignment.
From the previous questions, I have a list of 49 elements.
Each element is sample data of size=10000. For the last question, I have to calculate the mean of the first n sample values.
With n between one and ten thousand, within each dataset.
I then have to plot these running averages for each data set.
I've been trying to create lists/vectors of the running averages but it's not working out.
Is there anything I can do?

Function for running average:
run_avg <- function(x, n_max){
a <- c(1:n_max)
r_avg <- sapply(a, FUN = function(y) mean(x[1:y]))
return(r_avg)
}
In your case, n_max should equal 10000;
This function then creates, for one dataset, the running averages.
This has then to be applied to all datasets. You could use lapply for this, if your datasets are stored within a list. Another approach could be a loop or something like that.
Edit: I see that your datasets are in a list, so simply use:
lapply(my_list, run_avg, n_max = 10000)

The running averages can be computed with the following.
res <- lapply(x, function(y){
sapply(seq_along(y), function(k) mean(y[1:k]))
})
Then in order to have the resulting list in a format better for plotting with package ggplot2, format it as a data frame first, with the row names as a column.
df_res <- do.call(cbind.data.frame, res)
names(df_res) <- paste("Mean", seq_len(ncol(df_res)), sep = ".")
df_res <- cbind(df_res, id = as.integer(row.names(df_res)))
Now reshape from wide to long and plot.
library(tidyverse)
df_res %>%
pivot_longer(
cols = starts_with("Mean"),
names_to = "Vector",
values_to = "Mean"
) %>%
ggplot(aes(id, Mean, colour = Vector)) +
geom_point() +
geom_line()
Test data.
set.seed(1234)
list_size <- 4 # 49 in the question
samp_size <- 20 # 10000 in the question
x <- lapply(seq.int(list_size), function(i) rgamma(samp_size, shape = i))

Related

R: apply function per group using dplyr

I have a function which detrends data:
detrender <- function(Data,Vars,timevar){
Data_detrend <- Data
for (v in seq_along(Vars)){
ff <- as.formula(paste0(Vars[[v]]," ~ ",timevar))
fit <- lm(ff, data = Data_detrend)
if (anova(fit)$P[1] < 0.05){
message(paste("Detrending variable",v))
Data_detrend[[Vars[v]]][!is.na(Data_detrend[[Vars[[v]]]])] <- residuals(fit)
}
}
Data_detrend
}
useVar <- c("depression_sum")
inertia <- detrender(inertia,useVar, "timestamp")
This works fine, and detrends my data. However, as my datafile has many groups (Rid), detrend is conducted across groups. I want to detrend per group (Rid) and have tried to apply my function using group_by in dplyr. However, this code seems to add thousands of new rows with data. This is wrong, the number of rows should stay the same, while the data is detrended.
inertia_2<- inertia %>%
group_by(Rid) %>%
do(detrender(inertia,useVar, "timestamp")) %>%
ungroup()

How to create a vector that sums random numbers generated repeatedly?

Goal
I want to see how random profits/losses can accumulate over time.
library(dplyr)
names <- seq(1:1000)
wealth <- rep(0, 1000)
df <- as.data.frame(bind_cols(names, wealth))
colnames(df) <- c("households", "wealth")
View(df)
Attempt
df2 <- df %>%
mutate(new_wealth = runif(1000,-10,10),
new_wealth2 = runif(1000,-10,10),
final_wealth = wealth + new_wealth+new_wealth2) %>%
select(households, final_wealth)
Solution Wanted
Instead of creating loads of "new_wealth" columns, I want to add a vector of random numbers to the final_wealth column, say 100 times, and then see the result. Ideally, I would do it 1000 times.
I don't care about the new_wealth columns, and just want the final_wealth column. Can I use lapply to do this? If not, do any of you have a better solution?
Thanks!
We can use tidyverse
library(dplyr)
library(purrr)
df1 <- df %>%
mutate(final_wealth = rerun(100, runif(n(), -10, 10)) %>%
reduce(`+`))
We can use replicate to repeat the runif code any number of times. Use rowSums to perform the sum and add a new column.
n <- nrow(df)
n_repeat <- 100
df$final_wealth <- rowSums(replicate(n_repeat, runif(n,-10,10)))

lapply strings to subset via brackets[]

There is almost certainly an easier way to go about this, but perhaps I've just been awake too long. I want to use the following vector of strings:
lap_list <- paste0(seq(1,length(mpg[[1]]),10), ":", seq(10,length(mpg[[1]]),10))
and use the vector to subset such as mpg[lap_list[1], ]. Alternatively, I could use dplyr for something with slice:
mpg %>%
slice(lap_list[1])
Both methods are giving the same error, and beyond parse(eval()) or as.numeric() I'm having a hard time wording my question for google.
The ultimate goal is to have a function such that I could lapply the graph outputs. Say:
barchart <- function(data_slice) {
mpg %>%
slice(data_slice) %>%
ggplot(aes(x=model)) + geom_bar()
}
lapply(lap_list, barchart)
If you paste the sequence of rows you want to subset using paste0, you don't have much option then to use eval(parse)) in some way or the other.
An alternative is to create a sequence of rows that you want to subset and store it in vectors. Pass them in Map to slice from the data and then plot.
library(dplyr)
library(ggplot2)
n <- nrow(mpg)
start <- seq(1,n,10)
#Added an extra `n` here to make the length of start and end equal
end <- c(seq(10,n,10), n)
barchart <- function(data, start, end) {
data %>%
slice(start:end) %>%
ggplot(aes(x=model)) + geom_bar()
}
list_of_plots <- Map(barchart, start, end, MoreArgs = list(data = mpg))
You can access each individual plots using list_of_plots[[1]], list_of_plots[[2]] etc.
Perhaps, you can also create groups of 10 rows and store the plots in the dataframe :
mpg %>%
group_by(grp = ceiling(row_number()/10)) %>%
summarise(plot = list(ggplot(cur_data(), aes(x=model)) + geom_bar()))

Group R table entries by month

I have as CSV value with two columns, a unix timestamp and a version string. What I finally want to achieve is to group the data by months and plot the data, so that the single months are entries on the x axis, and a line is plotted for each unique version string, where the y axis values should represent the number of hits on that month.
Here is a small example CSV:
timestamp,version
1434974143,1.0.0
1435734004,1.1.0
1435734304,1.0.0
1435735386,1.2.0
I'm new to R, so I encountered several problems. First I successfully read the csv with
mydata <- read.csv("data.csv")
and figured out an ungly function that convers a single timestamp into an R date:
as_time <- function(val){
return(head(as.POSIXct(as.numeric(as.character(val)),origin="1970-01-01",tz="GMT")))
}
But non of the several apply functions seemed to work on the table column.
So how do I create a data structure, that groups the version hits by month, and can be plotted later?
It's easier than you think!
You are essentially looking for the hist function.
#Let's make some mock data
# Set the random seed for reproducibility
set.seed(12345)
my.data <- data.frame(timestamp = runif(1000, 1420000000, 1460000000),
version = sample(1:5, 1000, replace = T))
my.data$timestamp <- as.POSIXct(my.data$timestamp, origin = "1970-01-01")
# Histogram of the data, irrespective of version
hist(my.data$timestamp, "month")
# If you want to see the version then split the data first...
my.data.split <- split(my.data, my.data$version)
# Then apply hist
counts <- sapply(my.data.split, function (x)
{
h <- hist(x$timestamp, br = "month", plot = FALSE)
h$counts
})
# Transform into a matrix and plot
counts <- do.call("rbind", counts)
barplot(counts, beside = T)
You can use the as.yearmon function from the zoo package to get year/month formats:
library(zoo)
dat$yearmon <- as.yearmon(as.POSIXct(dat$timestamp, origin = "1970-01-01", tz = "GMT"))
Then it depends what you want to do with your data. For example, number of version hits per month (thanks to #Frank for fixing):
dat %>% group_by(yearmon, version) %>%
summarise(hits = n())

How to add up data frames from a loop within a function?

I have a function whose purpose is to predict revenue from cost. The twist is, I have as inputs many different dataframes, and many different corresponding models to predict with - the function loops through each dataframe, predicting on it its corresponding model and outputing a prediction output with confidence intervals. Now, I need to find a way to add all of these prediction outputs up.
Here's a simplified example of what I'm doing, feel free to skip over it if you don't need it to answer the question (it might be hard to read), but if it helps read away. Note that each prediction output isn't a dataframe of cost and revenue, but a summary of what revenues you can expect from a variable cost.
predictions <- function(df_list, model_list) {
for(i in 1:length(df_list)) {
sapply(seq(1, 2, .25), function(x) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output
}) -> output
output %>% t %>% as.data.frame -> output
}
}
With the output for each index looking like this:
Cost Lower_Rev Fit_Rev Upper_Rev
1 2048884 18114566 20898884 24145077
2 2561105 21684691 25085853 29064495
3 3073326 25092823 29122421 33853693
4 3585547 28369901 33038060 38539706
5 4097768 31537704 36853067 43140547
I need some way to add together each output into one master output, whose cost and revenue values will be the sum of all others. Any ideas?
Your output simply gets replaced every time. You need to assign it to something or just simply use a lapply/sapply. Also added i as an argument as opposed to abusing R's scoping rules and grabbing the argument i from .GlobalEnv
L = lapply(1:length(df_list),
function(i) sapply(seq(1, 2, .25), function(x, i) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output %>% t %>% as.data.frame
})
)
Found a solution: Kind of wonky, but it works.
At the beginning of the function, I establish the following vectors:
costs <- c()
lows <- c()
fits <- c()
highs <- c()
Those are in accordance with the four columns of my output. Then, at the end of the loop, after the -> output statement, I run this:
costs[i] <- output[1]
lows[i] <- output[2]
fits[i] <- output[3]
highs[i] <- output[4]
For some reason it wouldn't work to simply assign each full df to a list of dfs; I had to do it vector by vector. Then, once each vector is stored with the full extent of my index, I ran this:
costs <- sapply(costs, unlist) %>% rowSums %>% as.data.frame
lows <- sapply(lows, unlist) %>% rowSums %>% as.data.frame
fits <- sapply(fits, unlist) %>% rowSums %>% as.data.frame
highs <- sapply(highs, unlist) %>% rowSums %>% as.data.frame
output <- cbind(costs, lows, fits, highs)
output
...to end the function. So really weird all in all, but it works.

Resources