Intersecting tables in R based on rownames string - r

I have 2 datasets: one set has my actual data, and the other one is a list of my KOs of interest, and I'm trying to intersect the data to select only the KOs of interest.
As you can see, the row names also have the associated taxa. I've intersected these tables previously without the taxa data:
foi <- read.csv("krakened/biogeochemical.csv")
new <- intersect(rownames(kegg.f),foi$genefamily)
kegg.df.select <- kegg.f[new,]
but I'd really like to have the taxa in the row names. Is it possible to intersect the tables by only comparing the "KOxxxx" part of my rownames?

We may use trimws to extract the substring, use %in% to find the matches in the genefamily column from 'foi' and subset using the logical vector
kegg.f[trimws(rownames(kegg.f), whitespace = "\\|.*") %in% foi$genefamily,]
It can also be done with sub
kegg.f[sub("^(K\\d+)\\|.*", "\\1", rownames(kegg.f)), %in% foi$genefamily,]

Related

Is there a R methodology to select the columns from a dataframe that are listed in a separate array

I have a dataframe with over 100 columns. Post implementation of certain conditions, I need a subset of the dataframe with the columns that are listed in a separate array.
The array has 50 entries with 2 columns. The first column has the selected variable names and the second column has some associated values.
I wish to build a new data frame with just the variables mentioned in the the first column of the separate array. Could you please point me as to how to proceed?
Try this:
library(dplyr)
iris <- iris %>% select(contains(dataframe_with_names$names))
In R you can use square brackets [rows, columns] to select specific rows or specific columns. (Leaving either blank selects all).
If you had a vector of column names you wanted to keep called important_columns you could select only those columns with:
myData[,important_columns]
In your case the vector of column names is actually a column in your array. So you select that column and use it as your vector:
myData[, array$names]

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

Why does a dataframe allow 2-dimensional selection without a comma?

My understanding is that you can select from a dataframe in two ways. If you use [] and not include a comma, then it is a list-style selection. It works that way since a dataframe is built on a list and you are really just pulling from top level components.
And, if you include a comma, then you are doing matrix style selection and you get this syntax [rows, columns].
If that's true, then why can I select from a dataframe with an array?
df <- as.data.frame(state.x77)
df2 <- cbind(df, rep(NA, nrow(df)))
df2[is.na(df2)]
is.na() is a an array with dim attributes for 50 rows and 9 columns.
How does it know to select against every value instead of doing the typical selection amongst columns?
is.na(df2) produces a logical matrix, with the same dimensions as the data.frame, df2.
Subsetting a data.frame by a matrix of the same dimensions is a standard operation. See ?'[.data.frame' for more information.

create a vector containing row IDs of dataframe

I am using one of Rs built in datasets called USArrests. It looks like instead of the rows having a numeric ID, they have a State as the row ID. Now how do I create a vector containing all of these state names?
I would generally use myvec <- c(USArrests$colname) but I am not sure how to access the states as it is not considered a normal column
data("USArrests")
head(USArrests)
vector_of_names <- rownames(USArrests)
##if you want to append to the dataframe
USArrests$state_name <-rownames(USArrests)
USArrests

Is there a more elegant way to find duplicated records?

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?
In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.
duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.
# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ] #note the comma
You can put it all together in one line
df[duplicated(df$var), ] # again, the comma, to indicate we are selected rows
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]
Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Resources