how to find common columns between 2 data set in r? - r

I have two data sets: "datExprSTLMS" which its dimension is 53*17237 and "datExprSTF" which its dimension is 99*22144. In two data sets, some columns(gene_names) are common. Based on using match() between colnames of two data sets I have founded 15711(TRUE) gene_name as intersecting genes between them. Now, I would like to provide a subset of "datExprSTLMS" so that the dimension of "datExprSTLMS" will be 53*15711. For this purpose I wrote below code:
dim(datExprSTF)
#[1] 99 22144
dim(datExprSTLMS)
#[1] 53 17237
TCGA2STF <- match(colnames(datExprSTLMS), colnames(datExprSTF))
table(is.finite(TCGA2STF))
#FALSE TRUE
#1526 15711
#delete NA(mismatch gene_names which in my case are 1526)
TCGA2STF_final <- Filter(function(x)!all(is.na(x)), TCGA2STF)
datExprSTLMS_final <- as.data.frame(datExprSTLMS[,TCGA2STF_final])
but after running the last line of my code I get below Error:
Error in datExprSTLMS[, TCGA2STF_final] : subscript out of bounds
I write my code in the R language. I need to guide

We can use intersect to find common columns between two data sets and then use them to subset datExprSTLMS
datExprSTLMS[, intersect(colnames(datExprSTLMS), colnames(datExprSTF))]

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

Is there a way to run a wilcoxon test for variables with different lengths?

I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)

R unique function vs eliminating duplicated values

I have a data frame and an info file (also in table format) that describes the data within the data frame. The row names of the data frame need to be relabelled according to information within the info file. The problem is that the information corresponding to the data frame row names, in the info file, contains lots of duplicated values. Hence it is necessary to convert the df to a matrix such that the row names can have duplicate values.
matrix1<-as.matrix(df)
ptr<-match(rownames(matrix1), info_file$Array_Address_Id)
rownames(matrix1)<-info_file$ILMN_Gene[ptr]
matrix1<-matrix1[!duplicated(rownames(E.rna_conditions_cleaned)), ]
The above is my own code however a friend gave me some code with a similar goal but different results:
u.genes <- unique(info_file$ILMN_Gene)
ptr.u.genes <- match( u.genes, info_file$ILMN_Gene )
matrix2 <- as.matrix(df[ptr.u.genes,])
rownames(matrix2) <- u.genes
The problem is that these two strategies output different results:
> dim(matrix1)
[1] 30783 565
> dim(matrix2[,ptr.use])
[1] 34694 565
See above matrix2 has ~4000 more rows than the other.
As you can see the row names of the below output are indeed unique but that doesn't tell why the two methods selected different rows but which method is better and why is the output different?
U.95 JIA.65 DV.93 KD.76 HC.54 KD.77
7A5 5.136470 5.657738 5.122299 5.195540 5.378040 4.997210
A1BG 6.166210 6.210373 6.382051 6.494048 5.888900 5.914070
A1CF 5.222130 4.940529 4.715292 5.182658 4.510937 5.060749
A26C3 5.410403 5.148601 5.122299 3.967419 4.780758 4.868472
A2BP1 5.725115 4.817920 5.483607 5.444427 5.503358 5.121951
A2LD1 6.505271 6.558276 5.494096 4.833267 6.988192 6.082662
I need to know this because I wan the row values that will yield the most accurate downstream analysis by having the row values that are best.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources