Pairwise Comparison of Rows in R - r
I have a dataset that contains results for many tests across many samples. The samples are replicated within the dataset. I would like to compare the test results between replicates within each group of replicated samples. I thought it might be easiest to first split my data frame by the SampleID so that I have a list of data frames, one data frame for each SampleID. There could be 2, 3, 4, or even 5 replicates of a sample so the number of unique combinations of rows to compare for each sample group is not the same. I have the logic that I am thinking laid out below. I want to run a function on the list of data frames and output the match results. The function would compare unique sets of 2 rows within each group of replicated samples and return values of "Match", "Mismatch", or NA (if one or both values for a test is missing). It would also return the count of tests that overlapped between the 2 compared replicates, the number of matches, and the number of mismatches. Lastly, it would include a column where the sample names are pasted together with their row numbers so I know which two samples were compared (ex. Sample1.1_Sample1.2). Could anyone point me in the right direction?
#Input data structure
data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"),
rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))
colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
data
data.split = split(data, data$SampleID)
##Row comparison function
#Input is a list of data frames. Each data frame contains results for replicates of the same sample.
RowCompare = function(x){
rowcount = nrow(x)
##ifelse(rowcount==2,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
#ifelse(rowcount==3,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 1 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 2 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
return(results)
}
#Output is a list of data frames - one for sample name
out = lapply(names(data.split), function(x) RowCompare(data.split[[x]]))
#Row bind the list of data frames back together to one large data frame
out.merge = do.call(rbind.data.frame, out)
head(out.merge)
#Desired output
out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"),
rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA),
rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"),
rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))
colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
out.merge
One thing I did see on another post that I thought might be useful is the line below which would create a data frame of unique row combinations that could then be used to define which rows to compare in each group of replicated samples. Not sure how to implement it though.
t(combn(nrow(data),2))
Thank you.
You are on the right track with t(combn(nrow(data),2)). See below for how I would do it.
testCols <- which(grepl("^Test\\d+",colnames(data)))
TestsCompare=function(x,y){
##how many non-NA values overlap
overlaps <- sum(!is.na(x) & !is.na(y))
##of those that overlap, how many match
matches <- sum(x==y, na.rm=T)
##of those that overlap, how many do not match
non_matches <- overlaps - matches # complement of matches
c(overlaps,matches,non_matches)
}
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
return(comp)
}
out = lapply(data.split, RowCompare)
Produces:
> out
$Sample1
row_a row_b overlaps matches non_matches
[1,] 1 2 8 6 2
$Sample2
row_a row_b overlaps matches non_matches
[1,] 1 2 10 9 1
[2,] 1 3 9 8 1
[3,] 2 3 9 9 0
Related
How do I aggregate results to get the most common number in each row of a dataset in R Studio
Im doing some consensus clustering, and it returns a set called "consensus_imouted" of 3000 rows with ten repetitions each with the cluster number (ranging from 1-6). I want to return just one column for each row with the most common cluster number for each. for example, the first row is 3 3 3 3 3 3 3 3 6 3, so i would want it to be 3 etc. any help?
You can use the apply function as follows: sampledata <- matrix(sample(1:6,30000,replace = TRUE), ncol = 10, nrow = 3000) sampledata <- data.frame(sampledata) sampledata$mostCounts <- apply(sampledata,1, function(row0) { as.numeric(names(which.max(table(row0)))) }) To get the most frequent value, just count the values in the row via table. Then, choose the value with the highest count using which.max. In a table, the values corresponding to the counts are the names of the table, hence use names to extract the original value. Now, since you know it is number, just cast the character to a numeric using as.numeric.
Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop
I have a large matrix called data of 10,864 rows and 134 columns. The first 4 columns are parameters which make every row unique. The data from 5th to 134th column for all rows are numbers between 1 and 20. I am running a for loop in the matrix to insert NA into certain cells of the matrix. This needs to be done on the basis of unique values from Columns OrgID, rank and scorei only if value in same row for column score(i+12) != 1. Hence, I run a for loop from column 5 to 134 and where there is duplication based on these three columns and value in score(i+12)column value is not equal to 1, I insert NA into that cell of matrix. for(i in 5:ncol(data){ data[which(duplicated(data[,c(1,4,i)]) & (data[,i+12])!=1),i] <- "NA" } This code, however, gives the wrong output by inserting NA only where there is duplicated value on the basis of 1st,4th and ith column i.e. equivalent result to running the following code: for(i in 5:ncol(data){ data[which(duplicated(data[,c(1,4,i)])),i] <- "NA" } How do make it to perform the required operation only when value in column score(i+12) !=1 in the duplicated rows. To make it simpler to see the failed output, I have highlighted a few rows and the relevant columns to show how this works when applied to the column 118 i.e.i =118 here. For example, based on the above explained logic, there is duplication in OrgID=5659. The duplication based on OrgID, rank and score118 identifies these 2 rows with one row showing value in score130=1and other score130=16. Hence, in the row with score130=16 should be now NA according to the logic. But this remains unchanged at 16.
Maybe you can try for(i in 5:(ncol(data) - 12)) { inds <- duplicated(data[c(1,4,i)]) | duplicated(data[c(1,4,i)], fromLast = TRUE) data[inds & data[[i + 12]] != 1, i + 12] <- NA }
In R, match data from a string variable across two data frames, and when match is found, merge corresponding rows
I have two data frames df1 (4x4) and df2 (4x1). In each, first variable (i.e. Original_items and Reordered) is string. In df1, V2:V4 are numeric. You can see that in df1 and df2, data in the first variable is arranged in a different order. I need to do the following. Take 1st element of the df2 'Reordered' variable (i.e. Enjoy holidays.), then search through elements of df1 'Original_items' variable to find the exact match. When match is found, I need to take the entire row of data associated with the matched element in df1 'Original_items' (i.e."Enjoy holidays.", 4,1,3), and append it beside the same element of df2 'Reordered' variable (i.e. "Enjoy holidays"). I need this output in the new data frame, called df_desired, which should be: "Enjoy holidays.", "Enjoy holidays.", 4, ,1 ,3. Please see below illustration of this example. When this is done, I would like to repeat this process for each element of the df2 'Reordered'variable, so the final result looks like df_desired table below. Context of the problem. I have around 2,000 items and 1,000 data points associated with each item. As I need to match items and append data in a predefined way, I am trying to think of an efficient solution. EDIT It was suggested that I could simply rename items in the "Original Variable". While this is true, it is inconvenient to do for a data frame of more than 2,000 items. Also, it was mentioned that this question maybe only related to merging. I believe merging is needed here only for elements that have been identified as identical across df1 and df2. Therefore, there are two key questions: 1) how to match string variables in this particular case? 2) how to merge/append rows conditionally, i.e. if they have been matched. Thank you for your input and I would be grateful for your help please I will mention what I tried and figured out so far. I realised df1[,1] == df2 [,1] # gives me true or false if rows in column 1 are the same in both data frames. I tried to set up a double loop, but unsuccessfully for (i in 1:nrow(df1)) { for (j in 1:nrow(df2)){ if (i==j){ c <- merge(a,b) } else print("no result") } } I feel that in the loop I'm not able to specify that I am only working with row values from a single variable "Original_item" in df1 # df1 (4x4 matrix) Original_items V2 V3 V4 Love birds. 1 5 3 Eat a lot of food. 2 5 5 Love birthdays. 2 2 4 Enjoy holidays. 4 1 3 # df2 (4x1 matrix) Reordered Enjoy holidays. Eat a lot of food. Love birds. Love birthdays. # df_desired (4x5 matrix) Reordered Original_items V2 V3 V4 Enjoy holidays. Enjoy holidays. 4 1 3 Eat a lot of food. Eat a lot of food. 2 5 5 Love birds. Love birds. 1 5 3 Love birthdays. Love birthdays. 2 2 4
If i understand correctly, you first want to sort df1$original_items to be in the same order as df2 reorder, then apply that same sorting pattern to the rest of df1 variables. First get your vector of indices of df1 in the sequential order that you desire those rows of df1 to end up in. #initialize an object to capture the above output indices <- NULL for (i in 1:nrow(df1)) { indices[i] <- which(df1$Original_items == df2$Reordered[i])) } Then, just use this list of indices to reorder the all the rows of df1 and create the new df. df_desired <- cbind(df2$Reordered, df1[indices, ])
create lists that contain the rownumbers for which column i contains the maximum value of that row
In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4. the first column contains parameter names, column 2 a shapiro test outcome on the raw data of parameter x column 3, shapiro test outcome of log10 transformed data for parameter x column 4, shapiro test outcome of a custom transformation given by the user for parameter x if this is the data: Parameter xval xlog10val xcustomval 1 FWS.Range 0.62233371 0.9741614 0.9619065 2 FL.Red.Range 0.48195980 0.9855781 0.9643206 3 FL.Orange.Range 0.43338087 0.9727243 0.8239867 4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407 5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224 6 SWS.Range 0.46932823 0.9487955 0.9825318 7 SWS.Length 0.02927791 0.4565962 0.7309313 8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000 9 FL.Red.Total 0.22437754 0.9655873 0.9923307 QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval) detailed explanation, ignore perhaps. .... list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row: 'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Red.Gradient', 'FWS.Fill.factor' and list 3 the rest of the names I tried something like df$Parameter[which(df$xval == max(df[ ,2:4]))] but this gives integer(0) results. EDIT to clarify: Lets start with looking at column 2 (xval). PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list. To make the DF: df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'), xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707), xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157), xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter' i1 <- max.col(df[-1], 'first') split(df$Parameter, i1) EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well df$Parameter <- as.character(df$Parameter) par.xval.max <- df[which.max(df$xval), "Parameter"] par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"] par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)] In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%
Set values less than threshold to zero, with column-specific thresholds
I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell. POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran 2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152 3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308 4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875 6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662 8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833 9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265 The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic 0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514 What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number. I know that it would be a for loop and if statement but i am new in R and i don't know for to do this. Please help me.
I think you want something like this: (Make up small reproducible example) set.seed(101) speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10, dimnames=list(NULL,LETTERS[1:10]))) threshdat <- rbind(seq(0.1,1,by=0.1)) Now process: thresh <- unlist(threshdat) ## make data frame into a vector ## 'sweep' runs the function column-by-column if MARGIN=2 ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh, FUN=function(x,y) ifelse(x<y,0,x)) ## recombine results with the first column speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course). frame2 = data.frame(POINTID=0, frame2) R works with vectors so a row of frame1 can be directly compared to frame2 frame1[,1] < frame2 Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply" answer = apply(frame1, 1, function(x) x < frame2) This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec": sweep(cols[-1], 2, vec, ">") # identifies the items to keep cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0 Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.