text matching loop in r - r

I have 10000 or more texts in one column of a csv file_1.
In another csv file_2 I have some words which I need to search in file_1, and need to record in next column if text contain that words.
need to search all the words in all the texts many a times single text can contains multiple words from file_2, want all the words in next column to text with comma separated.
case matching also can be one challenge, and I want exact match only:
Example:
file_1
File_1
file_2
Disney,
Hollywood
Desired Output:
Desired Output

I assume you will read the files into two separate data frames such as df1 and df2.
You can subset your search values from df2 as needed, or turn it into one large vector to search through using:
df2 <- as.vector(t(df2))
Then create a new column "Match" on df1 using containing the items matched in df2.
for (i in 1:nrow(df1)) {
df1$Match[i] <- paste0(df2[which(df2 %in df1$SearchColumn[i])],collapse = ",")
}
This loops from row 1 to the max number of rows in df1, finds the indices of matches in df2 using the where function and then calls those values and pastes them together separated by a comma. I'm sure someone else can find a way to achieve this without a loop but I hope this works for you.

Related

Intersecting tables in R based on rownames string

I have 2 datasets: one set has my actual data, and the other one is a list of my KOs of interest, and I'm trying to intersect the data to select only the KOs of interest.
As you can see, the row names also have the associated taxa. I've intersected these tables previously without the taxa data:
foi <- read.csv("krakened/biogeochemical.csv")
new <- intersect(rownames(kegg.f),foi$genefamily)
kegg.df.select <- kegg.f[new,]
but I'd really like to have the taxa in the row names. Is it possible to intersect the tables by only comparing the "KOxxxx" part of my rownames?
We may use trimws to extract the substring, use %in% to find the matches in the genefamily column from 'foi' and subset using the logical vector
kegg.f[trimws(rownames(kegg.f), whitespace = "\\|.*") %in% foi$genefamily,]
It can also be done with sub
kegg.f[sub("^(K\\d+)\\|.*", "\\1", rownames(kegg.f)), %in% foi$genefamily,]

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

R function for simple lookup replacement of excel

I want to extract the values form file 2 to file matching the values in indicated columns. It is a simple lookup function in Excel.
but many solutions given are based on matching column names which I don't want change in my data set.
2 files having a matching column and file2 column to be inserted in file1
As your column names are different in the two data.frames you need to tell merge which columns correspond to each other:
merge(file1, unique(file2[, c("Symbol", "GeneID"))], by.x="UniprotBlastGeneSymbol", by.y="Symbol")
Your result column will be called GeneID, not Column4, of course. If file2 contains gene Ids that are not found in file1 then you may also want all.y=FALSE.

How do I select columns by name while ignoring certain characters?

I'm trying to pull data from a file, but only pull certain columns based on the column name.
I have this bit of code:
filepath <- ([my filepath])
files <- list.files(filepath, full.names=T)
newData <- fread(file,select=c(selectCols))
selectCols contains a list of column names (as strings). But in the data I'm pulling, there may be underscores placed differently in each file for the same data.
Here's an example:
PERIOD_ID
PERIOD_ID_
_PERIOD_ID_
And so on. I know I can use gsub to change the column names once the data is already pulled:
colnames(newData) <- gsub("_","",newData)
Then I can select by column name, but given that it's a lot of data I'm not sure this is the most efficient idea.
Is there a way to do ignore underscores or other characters within the fread function?

How to separate one column into many columns in a .txt file?

I've been given a data set for a project that I need to reformat in order to work with it.
The problem is that all of the column names and corresponding values are mashed into one column in the file. As shown in the picture.
I'm new to R so I hardly know how to work with complex commands.
My Questions:
Is there a simple way to separate this from 1 column into 12 columns?
Desire Output:
I'll also need to remove the periods between the column names and the semicolons between the values.
I just need to be able to do basic statistical analysis on the table.
Thanks
table
Although your data is in one column, it is semi colon separated. The read.csv function has the ability to accept a column separator:
df <- read.csv(file="path/to/your/file.txt", skip=1, header=FALSE, sep=";")
The above call will generate columns based on a ; separator. I skip the first line and ignore the header, because it is a single string. You may manually assign the columns names via:
names(df) <- c("name1", "name2", ..., "name12")

Resources