I'm having a dataframe that consists a column that contains lm formulas. When I run this column for a specific row [[2]], I get my summary output of that LM. That works perfectly, but since I have 959 rows in that column, I want to write a for loop in order to do an anova on these regressions. How do I specify that I want to address all the objects in that list in a for loop?
In order for you to have a good understanding, here a MWE:
Dataframe:
structure(list(Week = 7:17, Category = c("2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2"), Brand = c("3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3"), Display = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), Sales = c(0, 0, 0, 0, 13.440948, 40.097397,
32.01384, 382.169189, 2830.748779, 4524.460938, 1053.590576),
Price = c(0, 0, 0, 0, 5.949999, 5.95, 5.950003, 4.87759,
3.787015, 3.205987, 4.898724), Distribution = c(0, 0, 0,
0, 1.394019, 1.386989, 1.621416, 8.209759, 8.552915, 9.692097,
9.445554), Advertising = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), lnSales = c(11.4945151554497, 11.633214247508, 11.5862944141137,
11.5412559646132, 11.4811122484454, 11.4775106999991, 11.6333660772506,
11.4859819773102, 11.5232680456161, 11.5572670584292, 11.5303686934256
), IntrayearCycles = c(4.15446534315765, 3.62757053512638,
2.92387946552647, 2.14946414386239, 1.40455011205262, 0.768856938870769,
0.291497141953598, -0.0131078404184544, -0.162984144025091,
-0.200882782749248, -0.182877633924882), `Competitor Advertising` = c(10584.87063,
224846.3243, 90657.72553, 0, 0, 0, 2396.54212, 0, 0, 0, 40343.49444
), `Competitor Display` = c(0.385629, 2.108133, 2.515806,
4.918288, 3.81749, 3.035847, 2.463194, 3.242594, 1.850399,
1.751096, 1.337943), `Competitor Prices` = c(5.30989, 5.372752,
5.3717245, 5.3295525, 5.298393, 5.319466, 5.1958415, 5.2941095,
5.296757, 5.294059, 5.273578), ZeroSales = c(1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0)), .Names = c("Week", "Category", "Brand",
"Display", "Sales", "Price", "Distribution", "Advertising", "lnSales",
"IntrayearCycles", "Competitor Advertising", "Competitor Display",
"Competitor Prices", "ZeroSales"), row.names = 1255:1265, class = "data.frame")
Then I apply a for loop to estimate an Error Correction Model (with ECM package) - this produces a Linear Model ouptut -. This for loop is applied to estimate 959 separate regressions.
f <- function(.) {
xeq <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
xtr <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
print(xeq)
print(xtr)
summary(ecm(.$lnSales, xeq, xtr, includeIntercept = TRUE))
}
Models <- DatasetThesisSynergyClean %>%
group_by(Category, Brand) %>%
do(Model = f(.))
To see the summary of a specific model (here model 2), you can address:
Models$model[[2]]
Consequently, I want to extract specific values from this summary output. But first I want to extract the Residuals Sum of Squares (RSS) to do an anova. I do this for one list object as follows:
anova_output_Unitmodels <- anova(Models$Model[[2]])
RSS_Unit <- anova_output_Unitmodels$`Sum Sq`[nrow(anova_output_Unitmodels)] #saving the RSS
Now, I want to for loop this accross all the list objects, from object [[1]] until [[959]]. This RSS output has to be saved end eventually I need to sum all these RSS values.
Furthermore, if this works, I need to extract all coefficients, t-values, and p-values of all variables, from all models. Then I also need to address the specific objects in the list and put $coefficients behind it, but I was not able to manage this too.
Here is how I implemented #Roman Lustrik's answer.
extractRSS <- function(x) {
an <- anova(x)
RSS_Unit <- an$`Sum Sq`[nrow(an)]
return(RSS_Unit)
}
sapply(Model, FUN = extractRSS)
I also tried to do it for one specific model, but this gives me an error:
SapplyRSS <- sapply(Models$Model, FUN = extractRSS)
I've had another idea and thought to for loop it differently, did not work out well but it's a start:
If you do
RSS2<- sum(Models$Model[[2]]$residuals^2)
So I thought replicate this in a for loop:
for(i in residuals.lm){
AllRSS<- as.matrix(c(1:949))
AllRSS <- as.data.frame(AllRSS)
SumRSS <- sum(Models$Model[[i]]$residuals^2)
SumRSS <- as.data.frame(SumRSS)
TotalRSS <- cbind(SumRSS, AllRSS)}
TotalRSS <- SumRSS[NULL,]
It starts with specifying the i in the for function, I do not know if this is right. Eventually it leaves me with an empty dataframe, or a dataframe with the value of the same brand.
#MichaelChirico probably had something like this in mind.
extractRSS <- function(x) {
an <- anova(x)
RSS_Unit <- an$`Sum Sq`[nrow(an)]
return(RSS_Unit)
}
sapply(Model, FUN = extractRSS)
sapply will traverse every Models$Model[[i]] object and extract RSS. You can modify this function to perhaps include other pieces of information. The result will probably be coerced to some simpler object. You can prevent this by sapply(..., simplify = FALSE).
A different way of doing this is by exporting all the list objects as objects in the dataframe. You do this through:
names(Models$Model) <- paste0("C", Models$Category, "B", Models$Brand)
list2env(Models$Model, .GlobalEnv)
Then I wrote a for loop to address these objects, and to fill an empty dataframe over and over with the values from this for loop. This goes as follows:
for(X in c("0","1","3")){
EmptyRSS <- data.frame(RSS = 0)
ModelX <- get(paste0("C", X, "B2"))
RSS <- sum(ModelX$residuals^2)
RSS <- as.data.frame(RSS)
DF <- ModelX$df[2]
DF <- as.data.frame(DF)
RSSDF <- cbind(RSS, DF)
TotalRSS2 <- rbind(TotalRSS2, RSSDF)
}
TotalRSS2 <- RSSDF[NULL,]
You should run the command outside the loop twice.
Related
I would like to create a loop in which the index is given by the column names of a dataframe. The idea is to select one column at a time and create a map based on the data in that column. I need i being the column name, as it identifies the name of the variable and I'll use that as part of the title of the map. However, I do not seem to be able to associate my index i to the name of the column. My code goes as follows:
# random data
x <- rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4)
y <- c(1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0 ,1, 0, 1, 0, 1)
z <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,0, 0, 0, 0, 0)
w <- c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1 ,0, 1, 0, 1, 0)
d <- as.data.frame(cbind(x,y,z,w))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
## do stuff
}
I get the following error at the first line:
Error: Can't subset columns that don't exist.
x Columns `1`, `2`, and `3` don't exist.
Run `rlang::last_error()` to see where the error occurred.
However, if I just run colnames(d[,2:ncol(d)]), it works properly
colnames(d[,2:ncol(d)])
[1] "typeA" "typeB" "typeC"
I could find a workaround by using columns numbers to make it work, but I would like to keep the column names since I am printing (10+) maps within the loop and I am using i to insert the title of the map each, as follows:
# I use geodata files from the library `Eurostat`.
geodata <- get_eurostat_geospatial(resolution = "60", nuts_level = "3", year = 2013)
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
colnames(var_to_map)[1] <- "geo"
# Joining, by = "geo"
map_data <- merge(var_to_map, geodata, by=c("geo"), all.y=T, all.x=T)
## creating ranges
map_data$cat <- with(map_data, cut(value,
breaks= qu <- unique(quantile(value,
probs=c(0, 0.2, 0.5, 0.8,
0.9, 0.95, 0.99, 1),
na.rm=TRUE, include.lowest=T )),
labels=qu[-1]),include.lowest=TRUE )
# Map
print(ggplot(data=map_data) + geom_sf(aes(fill=cat), size=.1) +
scale_fill_brewer(palette = "Darkred", na.value= "grey") + aes(geometry = geometry) +
guides(fill = guide_legend(reverse=T, title = "Percentiles")) +
labs(title = paste("The name of this graph is the column name", i) ## here is where I use the index
)+
theme_minimal() + theme(legend.position=c(.8,.6)) +
coord_sf(xlim=c(-12,44), ylim=c(35,70)) +
theme( axis.text.x=element_blank(), axis.text.y=element_blank()))
}
I could also use column numbers for i and create another object with column names to refer to when pasting the title of the map, but I am wondering why the above approach fails and what I could do to make it work in that setting.
In base R, you can either select the columns by position or by name, you can't combine them both in one command. If you use dplyr::select you can select columns by name and position in the same command.
So here are your options -
cols <- colnames(d)
for (i in cols[-1]) {
#Select columns by position
var_to_map <- d[,c(1,match(i, cols))]
#OR select column by name
var_to_map <- d[,c(cols[1],i)]
#OR select column by position and name
var_to_map <- dplyr::select(d, 1, i)
#...rest of the code
#...rest of the code
}
There's a lot going on in this question, but perhaps this minimal example will help:
library(tidyverse)
# random data
d <- data.frame(x = rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4),
y = sample(1:10, 20, replace = TRUE),
z = sample(1:10, 20, replace = TRUE),
w = sample(1:10, 20, replace = TRUE))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
type <- ensym(i)
p <- ggplot(d, aes(y = !!type, x = id, fill = id)) +
geom_boxplot() +
ggtitle(type)
print(p)
}
I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))
I am using the following CRAN package DMwR to deal with the problem of imbalanced data :
Code is the following:
require(DMwR)
dm = read.table("C:/data/exampleData.txt", sep=",")
ncols<-ncol(dm)
dm<-cbind(dm[2:ncols],dm[1])
dmSmote<-SMOTE(target ~ . , dm,k=5,perc.over = 1400,perc.under=140)
dm<-cbind(dmSmote[ncols],dmSmote[1:ncols-1])
Data :
5.901487,5.176487,1
6.917943,3.979710,0
5.247007,3.628324,1
5.157673,6.212658,0
4.836749,3.978392,0
4.590970,5.547353,0
3.895904,5.350865,0
4.312977,3.853151,0
5.844978,5.450767,0
4.009195,5.108031,0
Column 1 = variable 1, column 2 = variable 2, column 3 = Class
I am getting the following error: attempt to change an attribute to NULL
Link to library : http://cran.fhcrc.org/web/packages/DMwR/DMwR.pdf
What am I not getting right?
The classifier variable (target in your code) needs to be a factor.
require(DMwR)
## data
dm = structure(
c(5.901487, 6.917943, 5.247007, 5.157673, 4.836749,
4.59097, 3.895904, 4.312977, 5.844978, 4.009195, 5.176487, 3.97971,
3.628324, 6.212658, 3.978392, 5.547353, 5.350865, 3.853151, 5.450767,
5.108031, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0),
.Dim = c(10L, 3L),
.Dimnames = list(NULL, NULL))
dm = data.frame(dm)
## column names
colnames(dm) = c("var1", "var2", "target")
## you must convert the classifier variable to a factor
dm$target = factor(dm$target)
## SMOTE algorithm
dmSmote <- SMOTE(target ~ ., data = dm, k = 5,perc.over = 1400, perc.under = 140)
Using debug() on the function in question is a good starting point for diagnosing errors.
Sample data:
pp.inc <- structure(list(has.di.rec.pp = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), m.dist.km2 = c(-34.4150009155273, 6.80600023269653, -6.55499982833862,
-61.7700004577637, 15.6840000152588, -11.2869997024536, -26.9729995727539,
0, 81.9940032958984, -35.1459999084473, -12.5179996490479, 0,
21.5919990539551, 81.9940032958984, -20.7770004272461, 85.9469985961914,
-15.2959995269775, -75.5879974365234, 81.9940032958984, 3.04999995231628,
-17.1490001678467, -25.806999206543, -16.0060005187988, -14.91100025177,
-12.9020004272461, -16.0060005187988, 5.44000005722046, -34.4150009155273,
81.9940032958984, 3.61400008201599, 13.7379999160767, 2.71300005912781,
4.31300020217896), treated = c(0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
1, 1)), .Names = c("has.di.rec.pp", "m.dist.km2", "treated"), row.names = c(NA,
-33L), class = c("data.table", "data.frame"))
Code:
library(data.table)
library(ggplot2)
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
data <- data.table(data)
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
seq(-span, span, by = bw),
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
data.span.plot <- data.span.plot[ , runvar := head(seq(-span, span, by = bw), -1)]
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = factor(treatment)), ...)
bp
return(bp)
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
This code runs perfect if I do not wrap it in a function. I am a novice in R, only using it very infrequently. What am I doing wrong? Am I missing something obvious or is it to do with data.table or ggplot2? I thought it might be something with ggplot, as other questions mention there is an issue and aes_string should be used. I can rewrite the data.table parts to use base functions. But I think the error already occurs before that, on the second line. How do I make this work?
EDIT:
[Original title:
R function returns Error in eval(expr, envir, enclos) : object 'name' not found]
I had some time to look at this again and have worked out a solution, hence I also modified the title a bit. Using eval() didn't really work out for me, so I went the [['columname']] selection route. I've ditched data.table (and plyr as well), so that this only uses base functions except for ggplot2. I am happy for any comments on how to improve it. Please let me know if there are some essential flaws. If not I will add an answer with my solution later.
I have changed the bin calculation so that there is always a breakpoint at zero, which is necessary. Default binwidth is determined by the Silverman rule. I am thinking of calculating model fit separately and returning it, as the model choice within ggplot is limited, however I can't think of a nice way to incorporate this for a variety of diverse models such as lm or loess, and it's not strictly necessary. I actually wanted to overlay a thin bar plot displaying the number of observations in each bin, but found out this is impossible in ggplot (I know this generally is a bad idea, but there are several well-published papers which use similar graphs). I don't find the size aestetic to appealing here, but these are really minor gripes.
Thanks for getting me on the right path.
My solution:
rddplot <- function(data, outcome, runvar, treatment = treated,
span, bw = bw.nrd0(data[[runvar]]), ...){
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
data.span <- data[abs(data[[runvar]]) <= max(breaks), ]
data.span$bins <- cut(data.span[[runvar]], breaks,
include.lowest = TRUE, right = FALSE)
data.span.plot <- as.data.frame(cbind(tapply(data.span[[outcome]], data.span$bins, mean),
tapply(data.span[[runvar]], data.span$bins, mean),
tapply(data.span[[treatment]], data.span$bins, max),
tapply(data.span[[outcome]], data.span$bins, length),
tapply(data.span[[outcome]], data.span$bins, sum)))
colnames(data.span.plot) <- c("avg.outcome", "avg.runvar", "treated", "n.iid", "n.rec")
data.span.plot$runvar <- head(breaks, -1)
print(data.span.plot)
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(size = n.iid))
bp <- bp + stat_smooth(data = data.span, aes_string(x = runvar, y = outcome,
group = treatment), ...)
print(bp)
}
Call:
rddplot(pp.inc, "has.di.rec.pp", "m.dist.km2", "treated", 50,
method = lm, formula = y ~ poly(x, 4, raw = TRUE))
I have an approach using data.table and some deparse(substitute()) and setnames trickery....
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
# convert to data.table
data <- data.table(data)
# get the column names as defined in the call to rddplot
outname <- deparse(substitute(outcome))
runname <- deparse(substitute(runvar))
treatname <- deparse(substitute(treatment))
# rename these columns with the argument namses
setnames(data, old = c(outname,runname,treatname), new = c('outcome','runvar', 'treatment'))
# breaks as defined in the second example
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
# the stuff you were doing before
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
breaks,
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
# note I've removed trying to add `runvar` column to data.span.plot....)
bp <- ggplot(data = data.span.plot, aes(x = avg.runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = treatment), ...)
bp
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
Note that if you didn't convert to data.table within the function, and assumed the data argument was a data.table, then you could use on.exit() to revert the names changed by reference.
I haven't quite got my head around R and how to rearrange data. I have an old SPSS data file that needs rearranging so I can conduct an ANOVA in R.
My current data file has this format:
ONE <- matrix(c(1, 2, 777.75, 609.30, 700.50, 623.45, 701.50, 629.95, 820.06, 651.95,"nofear","nofear"), nr=2,dimnames=list(c("1", "2"), c("SUBJECT","AAYY", "BBYY", "AAZZ", "BBZZ", "XX")))
And I need to rearrange it to this:
TWO <- matrix(c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 777.75, 701.5, 700.5, 820.06, 609.3, 629.95, 623.95, 651.95), nr=8, dimnames=list(c("1", "1", "1", "1", "2", "2", "2", "2"), c("SUBJECT","AA", "ZZ", "XX", "RT")))
I am sure that there is an easy way of doing it, rather than hand coding. Thanks for the consideration.
This should do it. You can tweak it a bit, but this is the idea:
library(reshape)
THREE <- melt(as.data.frame(ONE),id=c("SUBJECT","XX"))
THREE$AA <- grepl("AA",THREE$variable)
THREE$ZZ <- grepl("ZZ",THREE$variable)
THREE$variable <- NULL
# cleanup
THREE$XX <- as.factor(THREE$XX)
THREE$AA <- as.numeric(THREE$AA)
THREE$ZZ <- as.numeric(THREE$ZZ)
Reshape and reshape() both help with this kind of stuff but in this simple case where you have to generate the variables hand coding is pretty easy, just take advantage of automatic replication in R.
TWO <- data.frame(SUBJECT = rep(1:2,each = 4),
AA = rep(1:0, each = 2),
ZZ = 0:1,
XX = 1,
RT = as.numeric(t(ONE[,2:5])))
That gives the TWO you asked for but it doesn't generalize to a larger ONE easily. I think this makes more sense
n <- nrow(ONE)
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
AB = rep(1:0, each = n),
YZ = rep(0:1, each = 2*n),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
This latter one gives more representative variable names, and handles the likely case that your data is actually much bigger with XX (fear) varying and more subjects. Also, given that you're reading it in from an SPSS data file then ONE is actually a data frame with numeric numbers and factored character columns. The reshaping was only this part of the code...
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
You could add in other variables afterward.