I wrote a function that randomly increases the c% percentage of data by i%. I need to do this on multiple dataframes. So the function does that, but I am unable to access the processed value from the outside.
library(tidyverse)
value <- iris[1:120,]
iris1 <- value[2:95,]
set.seed(42)
attackfunc <- function(day,dataattack,howmuchattack){
shuffled= day[sample(1:nrow(day)), ]
n = as.integer((dataattack/100)*nrow(day)) #select percentage of data to be changed
extracted <- shuffled[1:n, ]
extracted$changedload <- extracted[,1]*((howmuchattack/100)+1) #how much the data changes
pertubeddata<- shuffled %>% mutate(Sepal.Length = ifelse(row_number() <= n, extracted$changedload, Sepal.Length))
reshuffled <- pertubeddata[order(as.numeric(rownames(pertubeddata))),]
reshuffled}
I would like to access reshuffled from outside the function so that I can use that to do some more calculations.
Thank you.
set reshuffled as an empty object before your formula, then in your formula change <- to <<- as per
reshuffled <<- pertubeddata[order(as.numeric(rownames(pertubeddata))),]
Hope this is useful
Related
I have a data set similar to the example below, complex sample data. Thanks to SO user IRTFM, I was able to adapt the code and save results (i'm only interested in the total proportions, not the confidence intervals) as a reshaped object for further processing. What I would like to do is extend this sapply to generate results for 20 other variables. I would like to save the results as data frames in a list, ideally, since I think this is the most efficient way. My struggle is how to extend the sapply so that I can process multiple variables at once. I thought about a for loop over a list that holds the names of the variables and started to make this list, var_list below, but this seems not the way forward. I'd rather take advantage of the apply family since I would like the results to be stored in a list.
library(survey) # using the `dclus1` object that is standard in the examples.
library(reshape)
library(tidyverse)
data(api)
stype_t <- sapply( levels(dclus1$variables$stype),
function(x){
form <- as.formula( substitute( ~I(stype %in% x), list(x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c( z, c(attr(z,"ci")) )} ) %>%
as.data.frame() %>% slice(1) %>% reshape::melt() %>% dplyr::mutate(value = round(value, digits = 4)*100)
Lets say you then wanted to repeat the above using the variable awards. You could copy the lines and do it that way but it would be better to be more efficient. So I started by making a list of the names of the two variables in this example data but I am stumped as to how to apply this list to the code above and retain the results in a list of dataframes. I tried wrapping the sapply with an lapply but this did not work because I'm betting that was wrong. Any advice or thoughts would be appreciated.
var_list <- list("stype", "awards")
Instead of $ to reference named elements, consider [[ extractor to reference names by string. Also, extend substitute for dynamic variable:
# DEFINED METHOD
df_build <- function(var) {
sapply(levels(dclus1$variables[[var]]), function(x) {
form <- as.formula(substitute(~I(var %in% x),
list(var=as.name(var), x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c(z, c(attr(z,"ci")))
}) %>%
as.data.frame() %>%
slice(1) %>%
reshape::melt() %>%
dplyr::mutate(value = round(value, digits = 4)*100)
}
# ITERATE THROUGH CHARACTER VECTOR AND CALL METHOD
var_list <- list("stype", "awards")
df_list <- lapply(var_list, df_build)
Hello I want to apply dplyr arrange function on a column within a for loop, but for some reason it does not work. Here is a minimal example:
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each column
t <- select(df1, all_of(j))
t %>% arrange(j)
var_list[[j]] <- t
for (i in var_list[[j]]$Timestep )
### arrange each timestep df by each colun once
scenario[[i+1]] <- var_list[[j]][min:max,]
# subset data to the scenarios of interest
}
I guess the Problem is that j delivers a character string "variable", but dplyr arrange requires it without "". I have tried as.name(), paste() and eval parse functions but neither of them worked. Any ideas? Thank you!
This seems to work :
df1 <- mtcars
var_list <- list()
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each column
t <- select(df1, all_of(j))
var_list[[j]] <- t %>% arrange_at(1)
for (i in var_list[[j]]$Timestep )
### arrange each timestep df by each colun once
scenario[[i+1]] <- var_list[[j]][min:max,]
# subset data to the scenarios of interest
}
var_list
Unfortunately does not work. It should sort the column j in ascending order. Maybe arrange is not the best function to use - however order or sort do not work either. Cant get what I am doing wrong. However I found a work around using long data format:
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each timestep
t <- select(df1, Trial, Timestep, all_of(j))
var_list[[j]] <- melt(t, id.vars=c("Trial", "Timestep"))
var_list[[j]] <- var_list[[j]] %>% arrange(value)
I'm having trouble with a programming assignment.
From the previous questions, I have a list of 49 elements.
Each element is sample data of size=10000. For the last question, I have to calculate the mean of the first n sample values.
With n between one and ten thousand, within each dataset.
I then have to plot these running averages for each data set.
I've been trying to create lists/vectors of the running averages but it's not working out.
Is there anything I can do?
Function for running average:
run_avg <- function(x, n_max){
a <- c(1:n_max)
r_avg <- sapply(a, FUN = function(y) mean(x[1:y]))
return(r_avg)
}
In your case, n_max should equal 10000;
This function then creates, for one dataset, the running averages.
This has then to be applied to all datasets. You could use lapply for this, if your datasets are stored within a list. Another approach could be a loop or something like that.
Edit: I see that your datasets are in a list, so simply use:
lapply(my_list, run_avg, n_max = 10000)
The running averages can be computed with the following.
res <- lapply(x, function(y){
sapply(seq_along(y), function(k) mean(y[1:k]))
})
Then in order to have the resulting list in a format better for plotting with package ggplot2, format it as a data frame first, with the row names as a column.
df_res <- do.call(cbind.data.frame, res)
names(df_res) <- paste("Mean", seq_len(ncol(df_res)), sep = ".")
df_res <- cbind(df_res, id = as.integer(row.names(df_res)))
Now reshape from wide to long and plot.
library(tidyverse)
df_res %>%
pivot_longer(
cols = starts_with("Mean"),
names_to = "Vector",
values_to = "Mean"
) %>%
ggplot(aes(id, Mean, colour = Vector)) +
geom_point() +
geom_line()
Test data.
set.seed(1234)
list_size <- 4 # 49 in the question
samp_size <- 20 # 10000 in the question
x <- lapply(seq.int(list_size), function(i) rgamma(samp_size, shape = i))
I have a data-frame where I want to pass a column name to R in and then filter based on that column.
I have tried reading a few tutorials on this and it seems to be related to non standard evaluation in R.
I cant seem to wrap my head around the examples in the blog posts I have read.
Just for simplicity, I have taken the iris dataset and I want to pass a column to a function which will then filter that dataset where the column value is greater than one.
mydf <- iris
filter_measurements <- function(mydf, measurement){
mydf <- filter(measurement >= 1)
mydf
}
mydf %>%
filter_measurements(measurement = Petal.Width)
Do I have to add something to my function so that R knows I want a column and not use it as 'Petal.Width' for example.
I have seen Passing a variable name to a function in R which i was unable to adapt to my example
Thank you all for your time
A great resource for this is Programming with dplyr.
mydf <- iris
filter_measurements <- function(mydf, measurement){
measurement <- enquo(measurement)
mydf <- filter(mydf, (!!measurement) >= 1)
mydf
}
mydf %>%
filter_measurements(measurement = Petal.Width)
You have to tell the function that measurement is giving as a bare variable name. For this first use enquo to evaluate what is given in the measurement argument and store it as a quosure. Then with !! in front of measurement the filter function knows that it doesn't have to quote this argument, as it is already a quosure.
Alternative
You can also pass the column you want to filter on as a string and use filter_:
filter_measurements <- function(mydf, measurement){
mydf <- filter_(mydf, paste0(measurement, " >= 1"))
mydf
}
mydf %>%
filter_measurements(measurement = "Petal.Width")
You have to pass the column name either as character or integer index of the column. Also, the line
mydf <- filter(measurement >= 1)
within your function never states what is being filtered and will expect the "measurement" to be a stand-alone object, not a part of a data frame.
Try this:
filter_measurements <- function(mydf, measurement)
{
mydf <- filter(mydf, mydf[,measurement] >= 1)
mydf
}
iris %>% filter_measurements("Petal.Width")
A more convoluted invocation of the function would also work:
iris %>% filter_measurements(which(names(.)=="Petal.Width"))
I have a function whose purpose is to predict revenue from cost. The twist is, I have as inputs many different dataframes, and many different corresponding models to predict with - the function loops through each dataframe, predicting on it its corresponding model and outputing a prediction output with confidence intervals. Now, I need to find a way to add all of these prediction outputs up.
Here's a simplified example of what I'm doing, feel free to skip over it if you don't need it to answer the question (it might be hard to read), but if it helps read away. Note that each prediction output isn't a dataframe of cost and revenue, but a summary of what revenues you can expect from a variable cost.
predictions <- function(df_list, model_list) {
for(i in 1:length(df_list)) {
sapply(seq(1, 2, .25), function(x) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output
}) -> output
output %>% t %>% as.data.frame -> output
}
}
With the output for each index looking like this:
Cost Lower_Rev Fit_Rev Upper_Rev
1 2048884 18114566 20898884 24145077
2 2561105 21684691 25085853 29064495
3 3073326 25092823 29122421 33853693
4 3585547 28369901 33038060 38539706
5 4097768 31537704 36853067 43140547
I need some way to add together each output into one master output, whose cost and revenue values will be the sum of all others. Any ideas?
Your output simply gets replaced every time. You need to assign it to something or just simply use a lapply/sapply. Also added i as an argument as opposed to abusing R's scoping rules and grabbing the argument i from .GlobalEnv
L = lapply(1:length(df_list),
function(i) sapply(seq(1, 2, .25), function(x, i) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output %>% t %>% as.data.frame
})
)
Found a solution: Kind of wonky, but it works.
At the beginning of the function, I establish the following vectors:
costs <- c()
lows <- c()
fits <- c()
highs <- c()
Those are in accordance with the four columns of my output. Then, at the end of the loop, after the -> output statement, I run this:
costs[i] <- output[1]
lows[i] <- output[2]
fits[i] <- output[3]
highs[i] <- output[4]
For some reason it wouldn't work to simply assign each full df to a list of dfs; I had to do it vector by vector. Then, once each vector is stored with the full extent of my index, I ran this:
costs <- sapply(costs, unlist) %>% rowSums %>% as.data.frame
lows <- sapply(lows, unlist) %>% rowSums %>% as.data.frame
fits <- sapply(fits, unlist) %>% rowSums %>% as.data.frame
highs <- sapply(highs, unlist) %>% rowSums %>% as.data.frame
output <- cbind(costs, lows, fits, highs)
output
...to end the function. So really weird all in all, but it works.