I have a code in Splus, but have to convert it into R, which is not a big thing. However I am very new to both softwares. This is the code I am struggling with:
name.x<-name.cols(x)
x is a matrix of independent variables where first length(keep1) columns correspond to variables that are always kept in BMA (Bayesian Model Averaging -- this isn't important. Essentially, x is a matrix)
R does not recognize this command. What is name.cols doing, and how can I do the same thing in R? How do I modify this command?
The function colnames returns the column names of an object in R:
name.x <- colnames(x)
Related
I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])
I've been experimenting with calling R through SPSS.
I have figured out how to pull SPSS data into an R dataframe, create a variable, and pass the dataframe with the new variable back to a SPSS data set.
What I cannot figure out how to do is pass back variables that are additional transformations of the first variable created using R.
Specifically, I first create the variable
index <- c("INDX","label",0,"F8.2","scale")
by scaling the variable B from 0 to 1 and create the dataframe casedata using the code below:
casedata <- data.frame(casedata, ave(casedata$B, casedata$Patient_Type,
FUN = function(x) (x- min(x))/(max(x)- min(x))))
I can successfully pass the new dataframe back to SPSS and everything's fine. But in the same call to R, I would like to create a new variable
indexave <- c("INDX_Ave","label",0,"F8.2","scale")
which indexes INDX to the average of itself using the code below:
casedata <- data.frame(casedata, casedata$INDX/mean(casedata$INDX))
I cannot figure out how to pass INDX_Ave back to SPSS.
I suspect that is has to do with the way SPSS assigns names to new variables. You'll notice that
ave(casedata$B, casedata$Patient_Type, FUN = function(x) (x- min(x))/(max(x) - min(x))
doesn't have casedata$INDX= in front of it. SPSS apparently knows from this line of code
index <- c("INDX","label",0,"F8.2","scale")
to pass the name INDX to the first variable created. I believe this disjointedness of the variable name from the variable itself is preventing the additional variable INDX_Ave from being created.
Below is my entire program block:
BEGIN PROGRAM R.
dict <- spssdictionary.GetDictionaryFromSPSS()
casedata <- spssdata.GetDataFromSPSS(factorMode="labels")
catdict <- spssdictionary.GetCategoricalDictionaryFromSPSS()
index <- c("INDX","Level Importance Index",0,"F8.2","scale")
indexave <- c("INDX_Ave","Level importance indexed to average importance",0,"F8.2","scale")
dict<-data.frame(dict,index,indexave)
casedata <- data.frame(casedata, ave(casedata$B, casedata$Patient_Type,
FUN = function(x) (x- min(x))/(max(x)- min(x))))
casedata <- data.frame(casedata, casedata$INDX/mean(casedata$INDX)) #dosent work
spssdictionary.SetDictionaryToSPSS("BWOverallBetas2",dict,categoryDictionary=catdict)
spssdata.SetDataToSPSS("BWOverallBetas2",casedata,categoryDictionary=catdict)
spssdictionary.EndDataStep()
END PROGRAM.
See the section "Writing Results to a New IBM SPSS Statistics Dataset" in the R Programmability doc. The names in the dictionary you pass govern the names on the SPSS side, but note that the rules for legal variable names in SPSS and R are different, although that isn't an issue here. Also, you can't create a dataset if SPSS is in procedure state (also not an issue with this code).
Your code adds INDX to the SPSS dictionary and computes it via ave but does not assign the name INDX in the casedata data frame. Then it adds another variable but does not add that to the dictionary to be sent to SPSS, so the sizes of the dictionary and the data frames don't match.
Note also that you can omit the factorMode argument in GetDataFromSPSS and then not bother with the categorical dictionary, because the values will be unchanged.
HTH
I have a matrix of sites. I want to develop a UPGMA aglomerative cluster. I want to use R and the vegan library for that. My matrix has sites in which not all the variables were measured.
Following a similar matrix of data:
Variable 1;Variable 2;Variable 3;Variable 4;Variable 5
0.5849774671338231;0.7962161133598957;0.3478909861199184;0.8027122599553912;0.5596553797833573
0.5904142034898171;0.18185393432022612;0.5503250366728479;NA;0.05657408486342197
0.2265148074206368;0.6345513807275411;0.8048128547418062;0.3303602674038131;0.8924461773052935
0.020429460126217602;0.18850489885886157;0.26412619465769416;0.8020472793070729;NA
0.006945970735023677;0.8404983401121199;0.058385134042814646;0.5750066564897788;0.737599672122899
0.9909722313946067;0.22356808747617019;0.7290078902086897;0.5621006367587756;0.3387823531518016
0.5932907022602052;0.899773235815933;0.5441346748937264;0.8045695319247985;0.6183003409599681
0.6520679140573288;0.5419713133237936;NA;0.7890033752744002;0.8561828607592286
0.31285906479192593;0.3396351688936058;0.5733594373520889;0.03867689654415574;0.1975784885854912
0.5045966366726562;0.6553489439611587;0.029929403932252963;0.42777351534900676;0.8787135401098227
I am planing to do it with the following code:
library(vegan)
# env <- read.csv("matrix_of_sites.csv")
env.norm <- decostand(env, method = "normalize") # Normalizing data here
env.ch <- vegdist(env.nom, method = "euclidean")
env.ch.UPGMA <- hclust(env.ch, method="average")
plot(env.ch.UPGMA)
After I run the second line, I get this error:
Error in x^2 : non-numeric argument to binary operator
I am not familiar with R, so I am not sure if this is due to the cells with no data. How can I solve this?
R does not think that data are numeric in your matrix, but at least some of them were interpreted as character variables and changed to factors. Inspect your data after reading int into R. If all your data are numbers, then sum(env) gives a numeric result. Use str() or summary() functions for detailed inspection.
From R's point of view, your data file has mixed formatting. R function read.csv assumes that items are separated by comma (,) and the decimal separator is period (.), and read.csv2 assumes that items are separated by colon (;) and decimal separator is comma ,. You mix these two conventions. You can read data formatted like that, but you may have to give both the sep and dec arguments.
If you get your data correctly in R, then decostand will stop with error: it does not accept missing values if you do not add na.rm = TRUE. The same also with the next vegdist command: it also needs na.rm = TRUE to analyse your data.
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().