How to add up data frames from a loop within a function? - r

I have a function whose purpose is to predict revenue from cost. The twist is, I have as inputs many different dataframes, and many different corresponding models to predict with - the function loops through each dataframe, predicting on it its corresponding model and outputing a prediction output with confidence intervals. Now, I need to find a way to add all of these prediction outputs up.
Here's a simplified example of what I'm doing, feel free to skip over it if you don't need it to answer the question (it might be hard to read), but if it helps read away. Note that each prediction output isn't a dataframe of cost and revenue, but a summary of what revenues you can expect from a variable cost.
predictions <- function(df_list, model_list) {
for(i in 1:length(df_list)) {
sapply(seq(1, 2, .25), function(x) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output
}) -> output
output %>% t %>% as.data.frame -> output
}
}
With the output for each index looking like this:
Cost Lower_Rev Fit_Rev Upper_Rev
1 2048884 18114566 20898884 24145077
2 2561105 21684691 25085853 29064495
3 3073326 25092823 29122421 33853693
4 3585547 28369901 33038060 38539706
5 4097768 31537704 36853067 43140547
I need some way to add together each output into one master output, whose cost and revenue values will be the sum of all others. Any ideas?

Your output simply gets replaced every time. You need to assign it to something or just simply use a lapply/sapply. Also added i as an argument as opposed to abusing R's scoping rules and grabbing the argument i from .GlobalEnv
L = lapply(1:length(df_list),
function(i) sapply(seq(1, 2, .25), function(x, i) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output %>% t %>% as.data.frame
})
)

Found a solution: Kind of wonky, but it works.
At the beginning of the function, I establish the following vectors:
costs <- c()
lows <- c()
fits <- c()
highs <- c()
Those are in accordance with the four columns of my output. Then, at the end of the loop, after the -> output statement, I run this:
costs[i] <- output[1]
lows[i] <- output[2]
fits[i] <- output[3]
highs[i] <- output[4]
For some reason it wouldn't work to simply assign each full df to a list of dfs; I had to do it vector by vector. Then, once each vector is stored with the full extent of my index, I ran this:
costs <- sapply(costs, unlist) %>% rowSums %>% as.data.frame
lows <- sapply(lows, unlist) %>% rowSums %>% as.data.frame
fits <- sapply(fits, unlist) %>% rowSums %>% as.data.frame
highs <- sapply(highs, unlist) %>% rowSums %>% as.data.frame
output <- cbind(costs, lows, fits, highs)
output
...to end the function. So really weird all in all, but it works.

Related

Access dataframe from inside a function from the outside

I wrote a function that randomly increases the c% percentage of data by i%. I need to do this on multiple dataframes. So the function does that, but I am unable to access the processed value from the outside.
library(tidyverse)
value <- iris[1:120,]
iris1 <- value[2:95,]
set.seed(42)
attackfunc <- function(day,dataattack,howmuchattack){
shuffled= day[sample(1:nrow(day)), ]
n = as.integer((dataattack/100)*nrow(day)) #select percentage of data to be changed
extracted <- shuffled[1:n, ]
extracted$changedload <- extracted[,1]*((howmuchattack/100)+1) #how much the data changes
pertubeddata<- shuffled %>% mutate(Sepal.Length = ifelse(row_number() <= n, extracted$changedload, Sepal.Length))
reshuffled <- pertubeddata[order(as.numeric(rownames(pertubeddata))),]
reshuffled}
I would like to access reshuffled from outside the function so that I can use that to do some more calculations.
Thank you.
set reshuffled as an empty object before your formula, then in your formula change <- to <<- as per
reshuffled <<- pertubeddata[order(as.numeric(rownames(pertubeddata))),]
Hope this is useful

Extending an sapply to apply list of variables and saving output as list of data frames in R

I have a data set similar to the example below, complex sample data. Thanks to SO user IRTFM, I was able to adapt the code and save results (i'm only interested in the total proportions, not the confidence intervals) as a reshaped object for further processing. What I would like to do is extend this sapply to generate results for 20 other variables. I would like to save the results as data frames in a list, ideally, since I think this is the most efficient way. My struggle is how to extend the sapply so that I can process multiple variables at once. I thought about a for loop over a list that holds the names of the variables and started to make this list, var_list below, but this seems not the way forward. I'd rather take advantage of the apply family since I would like the results to be stored in a list.
library(survey) # using the `dclus1` object that is standard in the examples.
library(reshape)
library(tidyverse)
data(api)
stype_t <- sapply( levels(dclus1$variables$stype),
function(x){
form <- as.formula( substitute( ~I(stype %in% x), list(x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c( z, c(attr(z,"ci")) )} ) %>%
as.data.frame() %>% slice(1) %>% reshape::melt() %>% dplyr::mutate(value = round(value, digits = 4)*100)
Lets say you then wanted to repeat the above using the variable awards. You could copy the lines and do it that way but it would be better to be more efficient. So I started by making a list of the names of the two variables in this example data but I am stumped as to how to apply this list to the code above and retain the results in a list of dataframes. I tried wrapping the sapply with an lapply but this did not work because I'm betting that was wrong. Any advice or thoughts would be appreciated.
var_list <- list("stype", "awards")
Instead of $ to reference named elements, consider [[ extractor to reference names by string. Also, extend substitute for dynamic variable:
# DEFINED METHOD
df_build <- function(var) {
sapply(levels(dclus1$variables[[var]]), function(x) {
form <- as.formula(substitute(~I(var %in% x),
list(var=as.name(var), x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c(z, c(attr(z,"ci")))
}) %>%
as.data.frame() %>%
slice(1) %>%
reshape::melt() %>%
dplyr::mutate(value = round(value, digits = 4)*100)
}
# ITERATE THROUGH CHARACTER VECTOR AND CALL METHOD
var_list <- list("stype", "awards")
df_list <- lapply(var_list, df_build)

Simulating samples from Gamma distribution in R

I'm having trouble with a programming assignment.
From the previous questions, I have a list of 49 elements.
Each element is sample data of size=10000. For the last question, I have to calculate the mean of the first n sample values.
With n between one and ten thousand, within each dataset.
I then have to plot these running averages for each data set.
I've been trying to create lists/vectors of the running averages but it's not working out.
Is there anything I can do?
Function for running average:
run_avg <- function(x, n_max){
a <- c(1:n_max)
r_avg <- sapply(a, FUN = function(y) mean(x[1:y]))
return(r_avg)
}
In your case, n_max should equal 10000;
This function then creates, for one dataset, the running averages.
This has then to be applied to all datasets. You could use lapply for this, if your datasets are stored within a list. Another approach could be a loop or something like that.
Edit: I see that your datasets are in a list, so simply use:
lapply(my_list, run_avg, n_max = 10000)
The running averages can be computed with the following.
res <- lapply(x, function(y){
sapply(seq_along(y), function(k) mean(y[1:k]))
})
Then in order to have the resulting list in a format better for plotting with package ggplot2, format it as a data frame first, with the row names as a column.
df_res <- do.call(cbind.data.frame, res)
names(df_res) <- paste("Mean", seq_len(ncol(df_res)), sep = ".")
df_res <- cbind(df_res, id = as.integer(row.names(df_res)))
Now reshape from wide to long and plot.
library(tidyverse)
df_res %>%
pivot_longer(
cols = starts_with("Mean"),
names_to = "Vector",
values_to = "Mean"
) %>%
ggplot(aes(id, Mean, colour = Vector)) +
geom_point() +
geom_line()
Test data.
set.seed(1234)
list_size <- 4 # 49 in the question
samp_size <- 20 # 10000 in the question
x <- lapply(seq.int(list_size), function(i) rgamma(samp_size, shape = i))

Creating a z-test over rows using dplyr mutate

I have a dataframe called clusters with various counts and totals and want to perform a simple two-prop z-test over rows to create a new column called pvals. When I use this code, pvals ends up being zero. When I replace the z-test with a simple sum() function, pvals becomes a summation for all values. This should be quite trivial, and I'm aware I could just alternatively iterate through, but I can't seem to figure out why this doesn't work. Thanks for the help.
clusters<- read.csv(file="clusters.csv", header=TRUE, sep=",")
clusters <- clusters %>% mutate(
pvals = prop.test(x=c(Charter.Count,Other.Count),n=c(Charter.Total,Other.Total), alternative="two.sided")$p.value
)
I can't test this as I don't have your clusters dataset, but it sounds like you need to use rowwise():
clusters<- read.csv(file="clusters.csv", header=TRUE, sep=",")
clusters <- clusters %>%
rowwise() %>%
mutate(
pvals = prop.test(x=c(Charter.Count,Other.Count),n=c(Charter.Total,Other.Total), alternative="two.sided")$p.value
)

XML2-Package: How to treat empty Nodes?

I am trying to extract some data from an html site. I got 500 nodes which should conatain a date, a title and a summary. By using
url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[#class='query']/ol") %>%
xml_children()
xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[#class='headline']") #title
xml_find_all(main_node, ".//p[#class='entry-content']") #summary
it returns three vectors with dates, titles and summaries, which than can be knitted together. At least in theory. Unfortunately my Code finds 500 dates, 500 titles but only 499 summaries. The reason for this is, that one of the nodes is just missing.
This leaves me with the problem, that I cannot bind this into an data frame because of the difference in length. The summaries wouldn't match the exact dates and titles.
An easy solution would be, to loop through the nodes and replace the empty node with a placeholder like an "NA".
dates <- c()
titles <- c()
summaries <- c()
for(i in 1:length(main_node)){
date_temp <- xml_find_all(main_node[i], ".//time") %>%
xml_text(trim = TRUE) %>%
as.Date(format = "%d.%m.%Y")
title_temp <- xml_find_all(main_node[i], ".//span[#class='headline']") %>%
xml_text(trim = TRUE)
summary_temp <- xml_find_all(main_node[i], ".//p[#class='entry-content']") %>%
xml_text(trim = TRUE)
if(length(summary_temp) == 0) summary_temp <- "NA"
dates <- c(dates, date_temp)
titles <- c(titles, title_temp)
summaries <- c(summaries, summary_temp)
}
But this makes a simple three line code unnecessary long. So my question I guess is: Is there a more sophisticated approach than a loop?
You could use the purrr library to help and avoid the explicit loop
library(purrr)
dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[#class='headline']") %>% xml_text())
summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[#class='entry-content']") %>% xml_text())
This uses the fact that xml_find_first will return NA if an elements is not found as pointed out by #Dave2e.
But also in general growing a list by appending each iteration in a loop is very inefficient in R. It's better to pre-allocate the vector (since it will be of a known length) and then assign values each iteration to the proper slot (out[i] <- val). There's not really anything wrong with loops themselves in R; it's really just memory reallocation that can slow things down.

Resources