How do I iterate over variables in R from a Python perspective? - r

Let's say I have this variable in R, data:
data.orange.lm = lm(...)
data.orange.avg = mean(...)
data.orange.sd = sd(...)
data.pear.lm = lm(...)
data.pear.sd = sd(...)
...
data.plum.sd = sd(...)
data.plum.summary = summary(lm(...))
How can I programmatically iterate over data? In Python iteritems for a dictionary will provide you with keys and the respective values. Is there a R equivalent?

Store everything in a nested list:
data[["orange"]][["lm"]] = lm(...)
data[["orange"]][["avg"]] = mean(...)
data[["orange"]][["sd"]] = sd(...)
data[["pear"]][["lm"]] = lm(...)
data[["pear"]][["sd"]] = sd(...)
Then use the apply family of commands.

For data that is a bit like this:
data <- data.frame(
fruit = rep(c("orange", "pear", "plum"), each = 10),
value = rnorm(30)
)
You can use tapply to get stats by fruit, and sapply to loop over functions.
fns <- c("mean", "sd")
stats <- sapply(fns, function(f) with(data, tapply(value, fruit, f)))
(If some of your functions don't return single numbers, then use lapply rather than sapply.)

Related

How to evaluate a function with different arguments without having to keep writing it out in R [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 1 year ago.
I have a function I would like to keep everything fixed apart form a single argument.
ls <- score_model_compound(data, pred, tmp$Prediction, score= "log")
bs <- score_model_compound(data, pred, tmp$Prediction, score="Brier")
ss <- score_model_compound(data, pred, score="spherical")
what I would like is something like
ls = data.frame()
ls <- score_model_compound(data, pred, score= c("log", "Brier", "spherical"))
is there a function I can use, like apply(), which lets me do this?
Thank you
You can create some kind of wrapping function with only the first argument being the one you want to vary and then pass it to lapply:
## Creating the wrapping function
my.wrapping.function <- function(score, data, pred, tmp) {
return(score_model_compound(data = data,
pred = pred,
tmp = tmp,
score = score))
}
## Making the list of variables
my_variables <- as.list(c("log", "Brier", "spherical"))
## Running the function for all the variables (with the set specific arguments)
result_list <- lapply(my_variables,
my.wrapping.function,
data = data, pred = pred, tmp = tmp$Prediction)
And finally, to transform it into a data.frame (or matrix), you can use the do.call(cbind, ...) function on the results:
## Combining the results into a table
my_results_table <- do.call(cbind, result_list)
Does that answer your question?
mapply() to the rescue:
score_v = c('spherical', 'log', 'Brier')
l = mapply(
score_model_compound, # function
score = score_v, # variable argument vector
MoreArgs = list(data = data, # fixed arguments
pred = pred),
SIMPLIFY = FALSE # don't simplify
)
You probably have to tweak it a little yourself, since you didn't provide a reproducible example. mapply(SIMPLIFY = FALSE) will output you a list. If the function returns data.frame's the resulting list of data.frame's can subsequently be bound with e.g. data.table::rbindlidst().
Alternatively you could just use a loop:
l = list()
for (i in seq_along(score_v)) {
sc = score_v[i]
message('Modeling: ', sc)
smc = score_model_compound(data, pred, score = sc)
l[[i]] = smc
names(l)[i] = sc
}

How do you use map() to apply function to data frame, when function calls for specific column input in R?

My goal is to apply wavelet analysis and image construction to large data set of time series data to be eventually used in pipeline for time series clustering. The function to do the first step is from WaveletComp and I am using purr map () from Tidyverse package. Ideally the output is a list labeled for each column that I can then apply other functions to in the pipeline.
library(WaveletComp)
The data set has 3 columns and 6000 values
df <- data.frame(replicate(3,sample(-5:5,6000,rep=TRUE)))
wave_emg <- function(df) {
analyze.wavelet(my.data = df, my.series = "X1", loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100)
Solution <- mutate(model = map(df, wave_emg))
I get the following error *Error in my.data[, ind] : incorrect number of dimensions
It appears to me that the my.series command in the analyze.wavelet function is looking for a single column to be specified. Is there a way to inform the command to take the next column successively?
You could write a function which takes two input, dataframe and column name/position.
library(WaveletComp)
library(purrr)
ave_emg <- function(df, col) {
analyze.wavelet(my.data = df, my.series = col, loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100)
}
analyze.wavelet function takes column names or column index as input so you could use any of these versions :
#column names
result <- map(names(df), ave_emg, df = df)
#column index
result <- map(seq_along(df), ave_emg, df = df)
You can also replace map with lapply to get the same output.
Looks like df needs to be split first before entering into the function to avoid error for 'analyze.wavlet()'. This code seems to work with this function, but you #Ronak code works with other functions.
library(tidyverse)
library(WaveletComp)
wave_emg <- function(df) {
analyze.wavelet(my.data = df, my.series = "X1", loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100
Solution <- df %>% split.default(.,seq_along(.)) %>% map(., ave_emg)

Is there an easy way to simplify this code using a loop in r?

I am working in Rstudio and have a series of codes just like these. There are 34 total and I am wondering if there is an easy ways to just write it once and have it loop through the defined variable of rsqRow.a{#}, combineddfs.a{#} and internally used variables of s_{## 'State'}
# s_WA.train.lr.Summary
rsqRow.a32 = summary(s_WA.train.lr)$r.squared
# rsqRow.a32
Coef = summary(s_WA.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WA.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a32 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a32
# s_WI.train.lr.Summary
rsqRow.a33 = summary(s_WI.train.lr)$r.squared
# rsqRow.a33
Coef = summary(s_WI.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WI.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a33 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a33
# s_WY.train.lr.Summary
rsqRow.a34 = summary(s_WY.train.lr)$r.squared
# rsqRow.a34
Coef = summary(s_WY.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WY.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a34 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a34
As already mentioned in comments you should get the data in a list to avoid such repetitive processes.
Find out a common pattern that represents all your dataframe names in the global environment and use that as a pattern in ls to get character vector of their names. You can then use mget to get dataframes in a list.
list_data <- mget(ls(pattern = 's_W.*train\\.lr'))
Once you have the list of dataframes, you can use lapply to iterate over it and in the function return the values that you want. Note that there might be a simpler way to write what you have in your attempt however as I don't have the data I am not going to take the risk to shorten your code. Here I am returning rsqRow and combineddfs for each dataframe. You can add/remove objects according to your preference.
all_values <- lapply(list_data, function(x) {
rsqRow = summary(x)$r.squared
Coef = summary(x)$coef[,1]
CoefRows = data.frame(Coef)
Pval = summary(x)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs <- merge(CoefRows, PvalRows, by=0, all=TRUE)
list(rsqRow, combineddfs)
})

How to speed up parallel foreach in R

I want to calculate a series of approx 1.000.000 wilcox.tests in R:
result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000 ) %do% {
w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)
df <- data.frame(Characters=character(),
Doubles=double(),
Doubles=double(),
stringsAsFactors=FALSE)
df[1,] = c(colnames(data)[i], w$statistic, w$p.value)
rownames(df) = colnames(beta_t1)[i]
colnames(df) = c("cg", "statistic", "p.value")
return(df)
}
If I do it with %dopar% and 15 cores it is slower than with single core %do%.
I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?
One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:
data.frame(
cg = colnames(data)[i],
statistic = w$statistic,
p.value = w$p.value
row.names = colnames(beta_t1)[i]
stringsAsFactors = FALSE
)
However, the upshot is that after the loop is run, foreach has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).
That is, change your code to
result = foreach(col = data) %do% {
w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
list(w$statistic, w$p.value)
}
# Combine result and transform it into a data.frame:
results = data.frame(
cg = colnames(data),
statistic = vapply(results, `[[`, double(1L), 1L),
p.value = vapply(results, `[[`, double(1L), 2L),
row.names = colnames(beta_t1),
stringsAsFactors = FALSE # only necessary for R < 4.0!
)
(I never use foreach so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply.)

Using an apply function to subset, match, and correlate multiple sets of variables in R

I'd like to use one of the apply functions to compactly subset, match, and correlate only the set of variables with the following strings: "hpi", "cpi", "eh".
Specifically, I'd like to apply all of the lines below the third comment (which only applies to "hip) below to each of the other strings.
Can you please advise?
MWE:
#Dataset 1
alias <- paste("v", seq( from = 1, to = 25 ), sep="" )
df1 = data.frame(replicate(25,sample(0:100,1000,rep=TRUE)))
colnames(df1) = alias
#Dataset 2
df2 = data.frame(replicate(3,sample(0:1,25,rep=TRUE)))
colnames(df2) = c("hpi","cpi","eh")
df2$alias = alias
df2$name = rep ( c("hpi housig", "cpi inflation", "eh econhealth", "unem unemployment", "inc personal income"), 5)
#I would like to use an apply function to do this to each of "hpi", "cpi", "eh"
df2$hpi = grepl("hpi", df2$name)
hpisub = df2[df2$hpi == 1, ]
hpisubvar = hpisub$alias
hpidf = df1[, hpisubvar]
corrhpi = cor(hpidf)
The *apply functions apply function FUN to their first argument, so you need to write a function to be applied.
fun <- function(x){
dfsub = df2[grepl(x, df2$name), ]
cor(df1[, dfsub$alias])
}
Now we test it against your result, corrhpi.
identical(corrhpi, fun("hpi"))
[1] TRUE
And, finally, apply it to the vector you need.
vec <- c("hpi", "cpi", "eh")
result <- lapply(vec, fun)
names(result) <- vec

Resources