Select multiple observations in a matrix based on a specific condition

Select multiple observations in a matrix based on a specific condition - r

I am very new to the R interface but need to use the program in order to run the relevant analyses for my clinical doctorate thesis. So, apologies in advance if this is a novice question.
I have a matrix of beta methylation values with the following dimensions:485577x894. The row names of the matrix refer to cpg sites which range in non-numerical and non-ascending order (e.g. "cg00000029" "cg00000108" "cg00000109" "cg00000165"), while the column names refer to participant IDs which are also in non-numerical and non-ascending order (e.g. "11209" "14140" "1260" "5414").
I would like to identify which beta methylation values are > 0.5 so that I can exclude them from further analyses. In doing so, I need the data to stay in a matrix format. All attempts I have made to conduct this analysis have resulted in retrieval of integer variables rather than the data in a matrix format.
I would be so grateful if someone could please advise me of the code to conduct this analysis.
Thank you for your time.
Cheers,
Alicia

set.seed(1) # so example is reproduceable
m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6]
m[m>0.5]<-NA # anything > 0.5 set to NA
z <- na.omit(m) # remove all rows with any NA's

Related

OasisR- RStudio Duncan index

I´m very very new at R and Rstudio and not great with programming and stats. I need to calculate a dissimilarity index, and I´m trying to use OasisR package. The function DIDuncan(x) computes an idex for every population group, but is does it in general for the entire data.frame. A need a calculation for each observation and each population gorup. According to:
# https://github.com/cran/OasisR/blob/419f40ff60eb1756a2b8ed0960c5c9e8cb90368d/R/SegFunctions.R
ISDuncan <- function(x) {
x <- segdataclean(as.matrix(x))$x
result <- vector(length = ncol(x))
for (i in 1:ncol(x))
result[i] <- 0.5 * sum(abs((x[,i]/sum(x[,i])) - ((rowSums(x)-x[,i])/sum((rowSums(x)-x[,i])))))
return(round(result, 4))
}
Can anyone help me? Thanks!!!
Jor
Thanks SOOOO much! results look good, but it works for the firts observarion? how can I get a matrix or data frame with every observation I have in the data.frame? I'm planning doing this for a large number of observarions
Anyway this is VERY helpfull!!!

Thank you for your interest and help, I'm very lost. I have an Excel sheet where every row is a district of a province. And I want to know ISDuncan value for the social groups in each district A, B and C
data example
What I hope is to get is a matrix where the rows represent the district and the columns the social groups, and every district has its own ISDuncan values. Now I'm trying with a small dataset, but I would do this analysis for a large amount of spatial units. Thanks!

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!

The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

sample() command is too slow in R

I want to create a random subset of a data.table df that is very large (around 2 million lines).
The data table has a weight column, wgt that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.
Is there a faster/more efficient way to do this?
Thank you!

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Add a vector as a single observation to a data.frame

I'm trying to save a number of spectral measurements in a data.frame. Each measurement has a number of attributes as well as two channels of spectral data, each with 2048 data points. I would like to have each channel be a single point of data in the data frame.
Something like this:
timestamp type integration channel1 channel2
1 2011-10-02 02:00:01 D 2000 (spec) (spec)
2 2011-10-02 02:00:07 D 2000 (spec) (spec)
Where each (spec) is a vector of 2048 values. I simply cannot get it to work, and I now turn to you guys for help.
Thanks in advance.

You can add matrix as one of data.frame fields, so you have to put all vectors as matrix rows.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- matrix(rnorm(3*2048), nrow=3)
DF$channel2 <- matrix(rnorm(3*2048), nrow=3)
ncol(DF)# == 5

I think what you want is doable but I may not be fully understanding your question. Heed Joris's suggestion though as this may be a better way of storing your data. You can accomplish what you want by storing the vectors of 2048 values in a list that you then add to the data frame as a column. Your table wasn't easily imported (for me anyway) with read.table so I made up my own data frame and example.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))
DF$channel2 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

I am trying to perform a multivariate test for normality on some density data from five sites, using mshapiro.test from the mvnormtest package. Each site is a column, and densities are below. It is 5 columns and 5 rows, with the top row as the header (site names). Here is how I loaded my data:
datafilename="/Users/megsiesiple/Documents/Lisa/lisadensities.csv"
data.nc5=read.csv(datafilename,header=T)
attach(data.nc5)`
The data look like this:
B07 B08 B09 B10 M
1 72571.43 17714.29 3142.86 22571.43 8000.00
2 44571.43 46857.14 49142.86 16857.14 7142.86
3 54571.43 44000.00 26571.43 6571.43 17714.29
4 57714.29 38857.14 32571.43 2000.00 5428.57
When I call mshapiro.test() for data.nc5 I get this message: Error in mshapiro.test(data.nc5) :
U[] is not a matrix with number of columns (sample size) between 3 and 5000
I know that to perform a Shapiro-Wilk test using mshapiro.test(), the data has to be in a numeric matrix, with a number of columns between 3 and 5000. However, even when I make the .csv a matrix with only numbers (i.e., when I omit the Site names), I still get the error. Do I need to set up the matrix differently? Has anyone else had this problem?
Thanks!

You need to transpose the data in a matrix, so that your variables are in rows, and observations in columns. The command will be :
M <- t(data.nc5[1:4,1:5])
mshapiro.test(M)
It works for me this way. The labels in the first row should be recognized during the import, so the data will start from row 1. Otherwise, there will be a "missing value" error.

If you read the numeric matrix into R via read.csv() using similar code to that you do show, it will be read in as a data frame, and that is not a matrix.
Try
mat <- data.matrix(data.nc5)
mshapiro.test(mat)
(Not tested as you don't give a reproducible example and it is late-ish in my time zone now ;-)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select multiple observations in a matrix based on a specific condition - r

set.seed(1) # so example is reproduceable m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6] m[m>0.5]<-NA # anything > 0.5 set to NA z <- na.omit(m) # remove all rows with any NA's

Related

OasisR- RStudio Duncan index

How to label CCA-Plot with row.names in R

sample() command is too slow in R

Add a vector as a single observation to a data.frame

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

Categories

Resources