undefined columns selected (Bayesian analysis) - r

I am replicating a R code for the Bayesian analysis but I got this error that I have tried to solve it, also reading other questions here but still it does not work.
I use the same dataset and same variables (from OECD). Can anyone tell me why it does not work?
My code is this:
rm(list=ls())
# Name of variables to be extracted
v.resp=c("pv1math") # Response Variable
v.treat=c("IC02Q01","IC02Q02","IC02Q03") # Treatment variable(s)
# Student Confoundings
v.student.conf=c("Age", "Gender", "isced_0", "IMMIG", "HEDRES", "WEALTH", "ESCS","FAMSTRUC","hisced","hisei","HOMEPOS", "TIMEINT")
# School Confoundings
v.school.conf=c("CLSIZE","SCMATEDU","STRATIO","SMRATIO","PublicPrivate")
## LOAD DATA
dat <- read.dta("name.dta")
## Weighted sample with weights in the w vector
w=dat$W_FSTUWT
Subset data in R
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)]
names(dat)[names(dat)==v.resp]="y"
w=w[complete.cases(dat)]
w=w/sum(w)
nw=function(w) w/sum(w)
dat=dat[complete.cases(dat),]
dim(dat)
When I run the line
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)] I got the error
Error in [.data.frame(dat, c(v.resp, v.treat, v.student.conf, v.school.conf)) :undefined columns selected
I have 25000 observation and 900 variables but I want to subset my data with 21 variables and the observations related to them (less than 25000 for sure). I put comma between )] but nothing, run other lines I lose all data.
I also run this code from "Quick-R website" but again the same error message
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
I would like to understand why it does not work. I am copying and pasting these codes from a paper that used them for the same dataset.
Thank you.

The message stated: undefined columns selected. That is just what is the situation here: you only selected the rows you wanted, but forgot to tell which columns. When you use [ ] for subsetting, you must specify the rows and the columns. So, you need a comma to separate the info for the rows and for the columns. Since you have no selection on rows, you don't need to specify anything after the comma. But the comma is needed. The adjusted code:
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf),]
The only difference is the comma before the closing ]

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

min() does not work as expected

I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}
In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)

R: "undefined columns selected" error after check.names=FALSE?

Brand new to R; trying to get my data read in and reshaped properly. File format has seven columns of "id"-ish data, then about sixty columns of annual growth values, columns labelled by year. First pass was:
> firstData <- read.csv("~/theData.csv")
> nuData <- melt(firstData, id=1:7)
That made the right arrangement but read.csv() had prepended an X to all the years ("X1983", e.g.), so now they don't work as values. I get that, so:
> firstData <- read.csv("~/theData.csv",check.names = FALSE)
> nuData <- melt(firstData, id=1:7)
Error in `[.data.frame`(data, , x) : undefined columns selected
The Xs were kept away (plain "1983", etc.), but now it won't melt(). Many retries; lots of reference-consulting; hard to figure out the right way to find the answer. It seems to think the structure is okay:
> is.data.frame(firstData)
[1] TRUE
> ncol(firstData)
[1] 71
I suspect that something about the bare-number column labels for 8-71 is throwing it. How do I reassure it that everything's fine?
EDIT
Didn't want to dump the data-mess if someone could answer offhand, but here's a sample. I thought I'd figured it out when I found spaces in column labels... but I fixed them and still get the same error. Is it a problem that the rows don't all have values in the 2016 column?
Tree,Gap,TransX,TransY,DBH,Nodes,Ht,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,
1,1,3,0,4.4,23,366,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,3,3,7,3,4,4,13,7,23,17,34,25,30,23,19,25,22,29,28,20,14,6,
2,1,4,0,3.3,24,398,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12,11,16,10,7,7,16,13,16,12,25,14,24,21,20,22,20,24,15,27,18,17,15,16,
3,1,5,2,2.8,24,325,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,7,16,8,6,16,18,10,17,7,21,10,14,12,16,14,23,15,21,20,14,14,12,9,
4,1,5,2.5,3.5,22,388,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6,6,5,5,15,9,12,13,29,16,20,13,17,19,27,25,13,31,32,26,26,23,
5,1,10.2,0,9.5,43,739,,,,,,,,,,,,,,,,,,,,,16,18,9,14,18,13,14,10,6,8,8,10,12,11,13,11,6,6,7,8,8,9,11,13,20,27,17,23,11,38,21,29,27,31,29,19,23.1,22,33,40,24,22,24,

Recoding over multiple data frames in R

(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources