finding similar strings in each row of two different data frame

finding similar strings in each row of two different data frame - r

I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)
At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written
for example
Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others

Maybe you should normalize your data first. For instance, you could do:
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
This function can applied to each column of df1 as well as the lookup table df2:
s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")
With this normalized data, it is easy to generate the output you are looking for:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}
process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4" "3-15" "7-16"
The output is not exactly the same as in the question, but this may be due to manual lookup errors.
In order to iterate over all columns of df1, you could use lapply:
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)
Now, res is a list indexed by the column names of df1:
res[["sample_1"]]
# [1] "4" "1" "4"

Related

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?

The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.

If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.

Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!

You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

How to group column names and add suffixes to them?

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.

An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Merging two data.frames by two columns each

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".

See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)

It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!

Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT

Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))

You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

finding similar strings in each row of two different data frame - r

Related

Filter out all data frames which don't have the column Z in a list of data frames?

Iteratively adding a row containing characters and numbers to a dataframe

How to group column names and add suffixes to them?

Merging two data.frames by two columns each

How to convert single column data into two-column matrix using conditional/for loop in R

Categories

Resources