R unique function vs eliminating duplicated values - r

I have a data frame and an info file (also in table format) that describes the data within the data frame. The row names of the data frame need to be relabelled according to information within the info file. The problem is that the information corresponding to the data frame row names, in the info file, contains lots of duplicated values. Hence it is necessary to convert the df to a matrix such that the row names can have duplicate values.
matrix1<-as.matrix(df)
ptr<-match(rownames(matrix1), info_file$Array_Address_Id)
rownames(matrix1)<-info_file$ILMN_Gene[ptr]
matrix1<-matrix1[!duplicated(rownames(E.rna_conditions_cleaned)), ]
The above is my own code however a friend gave me some code with a similar goal but different results:
u.genes <- unique(info_file$ILMN_Gene)
ptr.u.genes <- match( u.genes, info_file$ILMN_Gene )
matrix2 <- as.matrix(df[ptr.u.genes,])
rownames(matrix2) <- u.genes
The problem is that these two strategies output different results:
> dim(matrix1)
[1] 30783 565
> dim(matrix2[,ptr.use])
[1] 34694 565
See above matrix2 has ~4000 more rows than the other.
As you can see the row names of the below output are indeed unique but that doesn't tell why the two methods selected different rows but which method is better and why is the output different?
U.95 JIA.65 DV.93 KD.76 HC.54 KD.77
7A5 5.136470 5.657738 5.122299 5.195540 5.378040 4.997210
A1BG 6.166210 6.210373 6.382051 6.494048 5.888900 5.914070
A1CF 5.222130 4.940529 4.715292 5.182658 4.510937 5.060749
A26C3 5.410403 5.148601 5.122299 3.967419 4.780758 4.868472
A2BP1 5.725115 4.817920 5.483607 5.444427 5.503358 5.121951
A2LD1 6.505271 6.558276 5.494096 4.833267 6.988192 6.082662
I need to know this because I wan the row values that will yield the most accurate downstream analysis by having the row values that are best.

Related

Create a histogram of specific columns and rows from a `data.frame` in R

## my data frame
crime = read.csv("url")
## specific columns that need to be represented
property_crime = crime$Burglary + crime$Theft + crime$`Motor Vehical Theft`
## the rows that I am looking for have the name "harris" within the column named "county_name"
## my attempt
with(crime, hist(harris))
## Error in hist(harris) : object 'harris' not found
Not sure why I am getting object 'harris' not found as that is the name under the county_name column. I'm new to R, could someone walk me through the process of displaying a histogram only including the values of specific columns and specific rows?
the rows that I am looking for have the name "harris" within the column named "county_name"
You have to tell R the same logic that you are telling us.
There are several ways of making this in R but I am going to put here the base R way.
We can access the desired rows of object crime column county_name by indexing like data.frame[rows, columns]. So, in your case, crime[harris_rows, "county_name"] should work. To get harris_rows, we can make a boolean index like so crime$county_name == harris. If we put all of this together and call hist():
hist(crime[crime$county_name == "harris", "county_name"])
You don't provide a reproducible example, but you can check a similar logic with the mtcars dataset. Here, I am making the histogram of the cars with mpg > 15
hist(mtcars[mtcars$mpg >15, "mpg"])
# this is another option that produces the same result
# hist(mtcars$mpg[mtcars$mpg >15])

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

how to find common columns between 2 data set in r?

I have two data sets: "datExprSTLMS" which its dimension is 53*17237 and "datExprSTF" which its dimension is 99*22144. In two data sets, some columns(gene_names) are common. Based on using match() between colnames of two data sets I have founded 15711(TRUE) gene_name as intersecting genes between them. Now, I would like to provide a subset of "datExprSTLMS" so that the dimension of "datExprSTLMS" will be 53*15711. For this purpose I wrote below code:
dim(datExprSTF)
#[1] 99 22144
dim(datExprSTLMS)
#[1] 53 17237
TCGA2STF <- match(colnames(datExprSTLMS), colnames(datExprSTF))
table(is.finite(TCGA2STF))
#FALSE TRUE
#1526 15711
#delete NA(mismatch gene_names which in my case are 1526)
TCGA2STF_final <- Filter(function(x)!all(is.na(x)), TCGA2STF)
datExprSTLMS_final <- as.data.frame(datExprSTLMS[,TCGA2STF_final])
but after running the last line of my code I get below Error:
Error in datExprSTLMS[, TCGA2STF_final] : subscript out of bounds
I write my code in the R language. I need to guide
We can use intersect to find common columns between two data sets and then use them to subset datExprSTLMS
datExprSTLMS[, intersect(colnames(datExprSTLMS), colnames(datExprSTF))]

undefined columns selected (Bayesian analysis)

I am replicating a R code for the Bayesian analysis but I got this error that I have tried to solve it, also reading other questions here but still it does not work.
I use the same dataset and same variables (from OECD). Can anyone tell me why it does not work?
My code is this:
rm(list=ls())
# Name of variables to be extracted
v.resp=c("pv1math") # Response Variable
v.treat=c("IC02Q01","IC02Q02","IC02Q03") # Treatment variable(s)
# Student Confoundings
v.student.conf=c("Age", "Gender", "isced_0", "IMMIG", "HEDRES", "WEALTH", "ESCS","FAMSTRUC","hisced","hisei","HOMEPOS", "TIMEINT")
# School Confoundings
v.school.conf=c("CLSIZE","SCMATEDU","STRATIO","SMRATIO","PublicPrivate")
## LOAD DATA
dat <- read.dta("name.dta")
## Weighted sample with weights in the w vector
w=dat$W_FSTUWT
Subset data in R
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)]
names(dat)[names(dat)==v.resp]="y"
w=w[complete.cases(dat)]
w=w/sum(w)
nw=function(w) w/sum(w)
dat=dat[complete.cases(dat),]
dim(dat)
When I run the line
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)] I got the error
Error in [.data.frame(dat, c(v.resp, v.treat, v.student.conf, v.school.conf)) :undefined columns selected
I have 25000 observation and 900 variables but I want to subset my data with 21 variables and the observations related to them (less than 25000 for sure). I put comma between )] but nothing, run other lines I lose all data.
I also run this code from "Quick-R website" but again the same error message
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
I would like to understand why it does not work. I am copying and pasting these codes from a paper that used them for the same dataset.
Thank you.
The message stated: undefined columns selected. That is just what is the situation here: you only selected the rows you wanted, but forgot to tell which columns. When you use [ ] for subsetting, you must specify the rows and the columns. So, you need a comma to separate the info for the rows and for the columns. Since you have no selection on rows, you don't need to specify anything after the comma. But the comma is needed. The adjusted code:
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf),]
The only difference is the comma before the closing ]

SPIA package (R) applied to Illumina expression microarray data

I've been experimenting with alternative annotation to GSEA for expression (mRNA)data.
SPIA (Signalling Pathway Integration Analysis) looks interesting, but it seems to have exactly one error message for everything:
Error in spia(de = sigGenes, all = allGenes, organism = "hsa", plots = TRUE, :
de must be a vector of log2 fold changes. The names of de should >be
included in the reference array!
The input requires a single vector of log2 fold changes(my vector is named sigGenes), with Entrez ID as the associated names, and an integer vector of Entrez IDs included in the microarray (allGenes):
head(sigGenes)
6144 115286 23530 10776 83933 6232
0.368 0.301 0.106 0.234 -0.214 0.591
head(allGenes)
6144 115286 23530 10776 83933 6232
I've already removed values whose EntrezID annotations that are NA.
I've also subset my data from the Illumina microarray to only those genes found in the Affymetrix array using the example provided in the site I list below. I still get the same error.
Here is the full bit of R code:
library(Biobase)
library(limma)
library(SPIA)
sigGenes <- subset(full_table, P.Value<0.01)$logFC
names(sigGenes) <- subset(full_table, P.Value<0.01)$EntrezID
sigGenes<-sigGenes[!is.na(names(sigGenes))] # remove NAs
allGenes <- unique(full_table$EntrezID[!is.na(full_table$EntrezID)])
spiaOut <- spia(de=sigGenes, all=allGenes, organism="hsa", plots=TRUE, data.dir="./")
Any ideas of what else I could try?
Apologies if off topic (still new here). Happy to move the question elsewhere if needed.
Example of SPIA applied to Affymetrix platform data here: http://www.gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html)
Removing the duplicates did help.
As a workaround, I chose the median value (only because the values were close) among each set of duplicates as follows:
dups<-unique(names(sigGenes[which(duplicated(names(sigGenes)))])) # determine which are duplicates
dupID<-names(sigGenes) %in% dups # determine the laocation of all duplicates
sigGenes_dup<-vector(); j=0; # determine the median value for each duplicate
for (i in dups){j=j+1; sigGenes_dup[j]<- median(sigGenes[names(sigGenes)==i]) }
names(sigGenes_dup)<-dups
sigGenes<-sigGenes[!(names(sigGenes) %in% dups)] # remove duplicates from sigGenes
sigGenes<-c(sigGenes,sigGenes_dup) # append the median values of the duplicates
Alternatively just removing the duplicates works:
dups<-unique(names(sigGenes[which(duplicated(names(sigGenes)))]))
sigGenes<-sigGenes[!(names(sigGenes) %in% dups)] # remove duplicates from sigGenes
based on our discussion I would suggest removing duplicated entries in sigGenes. Without additional information, it is hard to say where the duplicates might originate from, and which one to delete.

Resources