very new to r.
I am trying to normalize multiple variables in matrix except the last column which has a categorical factor variable (in this case good/notgood).
I there any way to normalize the data without affecting the categorical column? I have tried to normalize while keeping the categorical column out, but can't seem to be able to add it back again.
minimum <- apply(mywines[,-12],2,min)
maximum <- apply(mywines[,-12],2,max)
mywinesNorm <- scale(mywines[,-12],center=minimum,scale=(maximum-minimum))
I still need the 12th column to build supervised models.
The short version is that you can simply reattach the column using cbind. However, it is just a little more complicated than that. scale returns a matrix not a data frame. In order to mix numbers and factors, you need a data.frame, not a matrix. So before the cbind, you will want to convert the scaled matrix back to a data.frame.
mywinesNorm = cbind(as.data.frame(mywinesNorm), mywines[ ,12])
A different approach would be to just change the data in place:
mywines[ ,12] = scale(mywines[ ,12])
Related
I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks
I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.
I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I am new to R and trying to use wilcox.test on my data : I have a dataframe 36021X246 with rownames as probeIDs and the last row is a label which indicates which group the samples belong to - "control" for the first 140 and "treated" for the last 106.
I would greatly appreciate knowing how to define the two groups when I perform the test....I am unable to find much information on the "formula" argument online except that -
"formula
a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups."
If someone could explain what lhs~rhs means and how to define this formula I would really appreciate it.
Thanks!
R typically assumes that each row is a case and the columns are associated variables. If the cases from both your samples occur in the same data frame, one column would be an indicator variable for sample membership. Let's call is IndSample. The Wilcoxon is a univariate test, so you would have another column containing the response values you are testing on. Let's call it Y. You then write
wilcox.test(y ~ IndSample, data=MyData, .....)
and the rest of your parameters for the test: is it two-sided? Do you want an exact statistic? (Probably not, in your case.)
It looks to me as if your data is on its side. That's problematic with a data frame, since you can't just pull out a row from a data frame, the way you would with a matrix.
You need to grab the last row and turn it into a factor - something like
factor(c(MyData[lastrow,]))
Then pull out the row that contains your response:
Y <- as.numeric(c(MyData[ResponseRow,]))
Then do the wilcoxon.
However, I am not sure that I have properly understood your situation. That seems to be a very large data matrix for a modest wilcoxon test.
Hopefully this has an easy answer I just haven't been able to find:
I am trying to write a simulation that will compare a number of statistical procedures on different subsets of rows (subjects) and columns (variables) of a large matrix.
Subsets of rows was fairly easy using a sample() of the subject ID numbers, but I am running into a little more trouble with columns.
Essentially, what I'd like to be able to do is create a random sample of column index numbers which will then be used to create a new matrix. What's got me the closest so far is:
testmat <- matrix(rnorm(10000),nrow=1000,ncol=100)
column.ind <- sample(3:100,20)
teststr <- paste("testmat[,",column.ind,"]",sep="",collapse=",")
which gives me a string that has a testmat[,column.ind] for every sampled index number. Is there any way to easily plug that into a cbind() function to make a new matrix? Is there any other obvious way I'm missing?
I've been able to do it using a loop (i.e. cbind(matrix,newcolumn) over and over), but that's fairly slow as the matrix I'm using is quite large and I will be doing this many times. I'm hoping there's a couple-line solution that's more elegant and quicker.
Have you tried testmat[, column.ind]?
Rows and columns can be indexed in the same way with logical vectors, a set of names, or numbers for indexes.
See here for an example: http://ideone.com/EtuUN.