I have two dataframes, one original and one that should be the original plus several additional columns of data after processing. I would like to make sure that the correspondence between original columns was preserved between dataframes (i.e., all subject identifiers still match up to the original vectors of data in each row.)
If original (orig) was dim 5000 x 50 and post-processing (pp) was 5000 x 100, and the first 50 columns that should be the same in each, how can I check? Is there something like setdiff() that can compare full dataframes?
SETDIFF <- setdiff(orig[,c(1:50)], pp[,c(1:50)])
In reply to comment above: to find the row and column indices where values are not equal, use which(orig[,1:50] != pp[,1:50], arr.ind = TRUE).
Related
I have a matrix called LungData with gene names and patient samples. There are ~26,000 genes and for each gene there are 41 samples. The gene names are in the first column, and the samples in the subsequent columns.
> dim(LungData)
[1] 26002 42
I have a subset of ~2,000 genes that I'm interested in. This subset is a list called GeneSubset.
> dim(GeneSubset)
[1] 1999 1
How can I get the 2000x42 sub-matrix which only contains the genes from GeneSubset? I'm not interested in the other genes, and dealing with a smaller sub-matrix will make the computations go a lot faster.
We can use either %in% or match. If the first column of 'LungData' is the 'genenames' and the dataset is a matrix, we use %in% to get a logical vector of TRUE/FALSE by comparing with 'GeneSubset' and this can be used for filtering the rows of 'LungData'.
LungData[LungData[,1] %in% GeneSubset[,1],]
Store the subset of desired genes in a vector instead of a list.
GeneSubset <- as.vector(GeneSubset)
Also,
rownames(LungData) <- LungData[,1] #assigning row names to the original matrix
LungData <- LungData[,-1] #removing 1st column since we already assigned rownames
ReqdData <- LungData[GeneSubset,] #subsetting the data on the basis of rownames
You might also want to use the subset function in base R or the code given by #akrun.
I'm trying to update a bunch of columns by adding and subtracting SD to each value of the column. The SD is for the given column.
The below is the reproducible code that I came up with, but I feel this is not the most efficient way to do it. Could someone suggest me a better way to do this?
Essentially, there are 20 rows and 9 columns.I just need two separate dataframes one that has values for each column adjusted by adding SD of that column and the other by subtracting SD from each value of the column.
##Example
##data frame containing 9 columns and 20 rows
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
##Standard Deviation calcualted for each row and stored in an object - i don't what this objcet is -vector, list, dataframe ?
Hi_SD<-apply(Hi,2,sd)
#data frame converted to matrix to allow addition of SD to each value
Hi_Matrix<-as.matrix(Hi,rownames.force=FALSE)
#a new object created that will store values(original+1SD) for each variable
Hi_SDValues<-NULL
#variable re-created -contains sum of first column of matrix and first element of list. I have only done this for 2 columns for the purposes of this example. however, all columns would need to be recreated
Hi_SDValues$X1<-Hi_Matrix[,1]+Hi_SD[1]
Hi_SDValues$X2<-Hi_Matrix[,2]+Hi_SD[2]
#convert the object back to a dataframe
Hi_SDValues<-as.data.frame(Hi_SDValues)
##Repeat for one SD less
Hi_SDValues_Less<-NULL
Hi_SDValues_Less$X1<-Hi_Matrix[,1]-Hi_SD[1]
Hi_SDValues_Less$X2<-Hi_Matrix[,2]-Hi_SD[2]
Hi_SDValues_Less<-as.data.frame(Hi_SDValues_Less)
This is a job for sweep (type ?sweep in R for the documentation)
Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD <- apply(Hi,2,sd)
Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD)
You don't need to convert the dataframe to a matrix in order to add the SD
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD<-apply(Hi,2,sd) # Hi_SD is a named numeric vector
Hi_SDValues<-Hi # Creating a new dataframe that we will add the SDs to
# Loop through all columns (there are many ways to do this)
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]+Hi_SD[i]
}
# Do pretty much the same thing for the next dataframe
Hi_SDValues_Less <- Hi
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]-Hi_SD[i]
}
I have a huge dataframe of around 1M rows and want to split the dataframe based on one column & different ranges.
Example dataframe:
length<-sample(rep(1:400),100)
var1<-rnorm(1:100)
var2<-sample(rep(letters[1:25],4))
test<-data.frame(length,var1,var2)
I want to split the dataframe based on length at different ranges (ex: all rows for length between 1 and 50).
range_length<-list(1:50,51:100,101:150,151:200,201:250,251:300,301:350,351:400)
I can do this by subsetting from the dataframe, ex: test1<-test[test$length>1 &test$length<50,]
But i am looking for more efficient way using "split" (just a line)
range = seq(0,400,50)
split(test, cut(test$length, range))
But do heed Justin's suggestion and look into using data.table instead of data.frame and I'll also add that it's very unlikely that you actually need to split the data.frame/table.
I am having a problem... I have two data. frames with a lot of columns and these two data.frames are of different length, in fact one has many rows and second data.frame has only one row.... But in both data frames there are columns of same names. Now, I want to multiply the matching columns with each other. I fail to solve it. Please help me.
The command
mapply("*", DataFrame1, DataFrame2)
should work if you want to multiply all columns. If the relevant columns are only a subset of all columns in the data frames, we first need to identify the columns being present in both data frames.
mapply("*", DataFrame1[intersect(names(DataFrame1), names(DataFrame2))],
DataFrame2[intersect(names(DataFrame1), names(DataFrame2))])
Two questions about R:
1.) If I have a data set with the multiple column values and one of the column values is 'test_score' how can I exclude the rows with blank values (and / or non-numeric values) for that column? (using pie(), hist(), or cor())
2) If the dataset has a column named 'Teachers', how might I graph the column 'testscores' only for the rows where Teacher = Jones?
Creating separate vectors without the missing data:
dat.nomissing <- tenthgrade[!is.nan(Score),]
seems problematic as the two columns must remain paired.
I was thinking something such as:
hist(!is.nan(tenthgrade$Score)[tenthgrade$Teacher=='Jones'])
However, is.nan is creating a list of TRUE, FALSE values (as it should).
Use subscripting. For example:
dat[!is.na(dat$test_score),]
hist(dat$test_score[dat$Teachers=='Jones'])
And a more complete example with artificial data:
# Create artificial dataset
dat <- data.frame('test_score'=rnorm(500), 'Teachers'=sample(c('Jones', 'Smith', 'Clark'), 500, replace=TRUE))
# Introduce some random missingness
dat$test_score[sample(1:500, 50)] <- NA
# Keep if test_score is valid
dat.nomissing <- dat[!is.na(dat$test_score),]
# Plot subset of data
hist(dat$test_score[dat$Teachers=='Jones'])