Use Master dataframe to aggregate regression loop using rbind

Use Master dataframe to aggregate regression loop using rbind - r

All, I'm very new to R, and can't find anything in the existing questions database that fits my exact issue. I'm running a loop of several regressions (200), and am trying to bind the results/coefficients into a single dataframe that I can export to Excel, with one set of headers. All variables in each regression are identical. The regression part of my loop looks like,
getreg<-OutChg~USInput
stepreg<-lm(getreg,data=mydata)
I'm trying use a "master" dataframe to bind everything together, such as,
master<-data.frame()
master<-rbind(master,stepreg$coefficients)
But I get the response Error in stepreg$coefficients : $ operator is invalid for atomic vectors. Ideally, I'd like to have something where I don't even have to define master<-data.frame().
Any advice is much appreciated. Thank you!

Try using getreg <- as.formula(OutChg ~ USInput) or just put that in the for the lm() function.
If you use str(stepreg) you will probably find that it is not a list but some other data type (in this case an atomic vector).
In order to use rbind() the variable "master" has to exist (as something)

Using data.table
datandfit <- function(x) {
USInput <- rnorm(100, 0, 5)
OutChg <- USInput*5 + 10 + rnorm(100, 0, 5)
mydata <- data.table(USInput, OutChg)
stepreg <- lm(OutChg ~ USInput, data = mydata)
data.table(t(stepreg$coefficients))
}
This will generate some random data, fit a model to it, and return a data.table of the results. You would skip the first three lines of the function, since you would already have data. Then, you can lapply over the function, which will return a list of 200 data.tables, and use rbindlist to combine them all into one data.table.
rbindlist(lapply(1:200, datandfit))
(Intercept) USInput
1: 9.979968 4.909842
2: 10.086159 5.083225
3: 10.285307 4.873432
4: 10.457751 4.905266
5: 9.108176 5.005555
---
196: 10.715356 4.846002
197: 9.938905 4.966180
198: 9.968473 5.073163
199: 10.098703 5.065169
200: 9.538539 4.946085

All, I finally figured this out! As a new user and non-programmer, figuring out how the different R objects work together is cumbersome, but using master<-list() before doing any of the "binding" got it to work...took me a minute or two to realize I can't have the loop designate the master as a list every time or it erases previous aggregation, too...thanks all for your help!

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!

I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

R - Assign the mean of a column sub-sector to each row of that sub-sector

I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.

You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.

Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!

Unable to Convert Chi-Squared Values into a Numeric Column in R

I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!

I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy

First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Return multiple data frames from function R

I am trying to put together a function that will loop thru a given data frame in blocks and return a new data frame containing stuff calculated from the original. The length of x will be different each time and the actual problem will have more loops in the function. New-ish to R and have not been able to find anything helpful (I don't think using a list will help)
func<-function(x){
tmp # need to declare this here?
for (i in 1:dim(x)[1]){
tmp[i]<-ave(x[i,]) # add things to it
}
return(tmp)
}
df<-cbind(rnorm(10),rnorm(10))
means<-func(df)
This code does not work but I hope it gets across what I want to do. thanks!

Do you mean you want to loop through each row of df and return a data frame with the calculated values?
You may want to look in to the apply function:
df <- cbind(rnorm(10),rnorm(10))
# apply(df,1,FUN) does FUN(df[i,])
# e.g. mean of each row:
apply(df,1,mean)
For more complicated looping like performing some operation on a per-factor basis, I strongly recommend package plyr, and function ddply within. Quick example:
df <- data.frame( gender=c('M','M','F','F'), height=c(183,176,157,168) )
# find mean height *per gender*
ddply(df,.(gender), function(x) c(height=mean(x$height)))
# returns:
gender height
1 F 162.5
2 M 179.5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use Master dataframe to aggregate regression loop using rbind - r

Try using getreg <- as.formula(OutChg ~ USInput) or just put that in the for the lm() function. If you use str(stepreg) you will probably find that it is not a list but some other data type (in this case an atomic vector). In order to use rbind() the variable "master" has to exist (as something)

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

R - Assign the mean of a column sub-sector to each row of that sub-sector

Unable to Convert Chi-Squared Values into a Numeric Column in R

perform function on pairs of columns

Return multiple data frames from function R

Categories

Resources