how to add title of gene to the output in R? - r

I have 9 length strings and list of longer strings with titles
Example data:
String <- "ABCDEFGHI", "ACBDGHIEF"
Data in text file contains 'longer strings with titles' like
>name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL
I use library(stringr) to locate the positions of each string.
Code in R
loc <- str_locate(textfile,pattern = strings)
write.csv(loc, "locate.csv")
EXPECTED Output:
string | locate | source of longer string
1 | 1-9| name1
2 | 7-15|name2
3 |NA| NA
QUESTION:
I would like to add the name of the longer string on which the "string" located? How to do this in R? I want to have the last column (that has bold-faced in the EXPECTED OUTCOME).
Thank you for help
Venkata

Here is an option with tidyverse. After reading the data with readLines, based on the occurence of 'title' with 'value', it is alternating, so an option would be to separate into columns or vectors with a recycling logical vector ('i1'), apply the str_locate only the 'value' ('col2'), create a row_number column and the 'source_longer_string' by checking if there is a NA element in 'locate' or not
library(dplyr)
library(stringr)
i1 <- c(TRUE, FALSE)
df1 <- tibble(col1 = textfile[i1], col2 = textfile[!i1])
str_locate(df1$col2, str_c(String, collapse="|")) %>%
as.data.frame %>%
transmute(string = row_number(),
locate = str_c(start, end, sep="-"),
source_longer_string = case_when(is.na(locate) ~ NA_character_,
TRUE ~ df1$col1))
# string locate source_longer_string
#1 1 1-9 >name
#2 2 7-15 >name1
#3 3 <NA> <NA>
data
textfile <- readLines(textConnection(">name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL"))
String <- c("ABCDEFGHI", "ACBDGHIEF")

Related

How can I obtain the distinct values for a "|" delimited column? [duplicate]

This question already has answers here:
How do keep only unique words within each string in a vector
(3 answers)
Closed 1 year ago.
I have a dataframe that looks like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y|21257a|77589y|
|c2|21257a|21257a|21257a|21257a|
|c3|21257a|75589y|75589y|33421v|
However, since there are duplicate characters under the grids column, I'd like to extract only the distinct characters such that the dataframe becomes like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y |
|c2|21257a |
|c3|21257a|75589y|33421v |
Any help would be appreciated!
Using sapply split the string on |, keep only unique value in each row and paste.
df$grids <- sapply(strsplit(df$grids, '|', fixed = TRUE), function(x)
paste0(unique(x), collapse = '|'))
Here is a base R regex based approach:
df$grids <- gsub("\\b(.+?)(?=\\|.*\\1)", "", df$grids, perl=TRUE)
df$grids <- gsub("^\\|+|\\|+$", "", df$grids)
df$grids <- gsub("\\|{2,}", "|", df$grids)
df
id grids
1 c1 21257a|75589y
2 c2 21257a
3 c3 21257a|75589y|33421v
Data:
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
For an explanation of the regex \b(.+?)(?=\|.*\1), it matches any pipe-separated term for which we can find the same term later in the grid string. If so, then we strip it by replacing with empty string. There are also some cleanup steps to remove dangling multiple pipes which might be left behind (or at the beginning/end of the grid string).
using data by #Tim
library(tidyverse)
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
df %>% mutate(grids = map_chr(str_split(grids, '\\|'),
~paste(unique(.x), collapse = '|')))
#> id grids
#> 1 c1 21257a|75589y
#> 2 c2 21257a
#> 3 c3 21257a|75589y|33421v
Created on 2021-05-27 by the reprex package (v2.0.0)

Find distinct elements based only on a part of the sentence in R

I have a data.frame that looks like this
name=c("PFLU_00001_gene", "PFLU_00001_mRNA", "PFLU_00001",
"PFLU_00002_gene", "PFLU_00002_mRNA", "PFLU_00002",
"PFLU_00003_gene", "PFLU_00003_mRNA", "PFLU_00003")
type=c("gene", "mRNA","CDS","gene", "mRNA","CDS","gene", "mRNA","NA")
df <- data.frame(name, type)
name type
1 PFLU_00001_gene gene
2 PFLU_00001_mRNA mRNA
3 PFLU_00001 CDS
4 PFLU_00002_gene gene
5 PFLU_00002_mRNA mRNA
6 PFLU_00002 CDS
7 PFLU_00003_gene gene
8 PFLU_00003_mRNA mRNA
9 PFLU_00003 NA
I would like from the column "names" to export the unique names
based only on the first part of the sentence (e.g., the PFLU_00001)
I would like my data to look like this.
name
PFLU_00001
PFLU_00002
PFLU_00003
Any help and guidance are highly appreciated.
with best wishes,
LDT
A base R option using unique + gsub
unique(
transform(
df["name"],
name = gsub("_\\D+$", "", name)
)
)
gives
name
1 PFLU_00001
4 PFLU_00002
7 PFLU_00003
We can use str_remove to remove the _ followed by one or more characters that are not a _ ([^_]+$) at the end ($) of the string and specify a regex lookaround ((?<=[0-9])) so that it matches the _ that follows a digit
library(dplyr)
library(stringr)
df %>%
transmute(name = str_remove(name, "(?<=[0-9])_[^_]+$")) %>%
distinct(name)
-output
# name
#1 PFLU_00001
#2 PFLU_00002
#3 PFLU_00003

Make a new column in a data frame containing text extracted from another based on a list or vector

I have an R data frame containing a text string column. I want to add a new column where a word matches a string and put that string into it. I understand how to do this for one specific text target as in the reproduceable example below:
#make a data frame
library(tidyverse)
d=c("Buy apples here","Pears are cheap","Oranges for sale", "Potatoes are not fruit")
df<-as.data.frame(d)
#extract 'Orange' into a new column called 'fruit'
df<-df%>%mutate(fruit = str_extract(d, "Orange"))
However, how do I vectorise this by using a list of words as my targets?
#target words
f=c("orange", "apple","pear")
dfa<-as.data.frame(f)
And how do I ignore the case so I get a result set so 'apple' and 'Orange' both produce a match and have the correct fruit description put into the new column:
#desired output
f1=c("apple","pear","orange","<NA>")
dfb<-as.data.frame(cbind(d,f1))
dfb
Many thanks.
You could build a regex from your vector of strings to be matched, pasting them together and seperating each by the pipe operator |. You can eliminate case as a concern by converting both d and f to uppercase (or lowercase) during the matching:
df %>%
mutate(fruit = str_extract(toupper(d),
toupper(paste(unique(dfa$f), collapse = "|"))))
#> d fruit
#> 1 Buy apples here APPLE
#> 2 Pears are cheap PEAR
#> 3 Oranges for sale ORANGE
#> 4 Potatoes are not fruit <NA>
In base R, we can use regmatches/regexpr
v1 <- regexpr(toupper(paste(unique(dfa$f), collapse = "|")), toupper(d))
out <- character(length(d))
out[v1 >0] <- regmatches(toupper(d), v1)
out
#[1] "APPLE" "PEAR" "ORANGE" ""

Create a new column based on matched string in another column using grep function

I'm trying to use grep function to create a new column based on matched string (no match = 0, match =1) but not getting the expected results
#my data
data<- data.frame(col1 = c("name","no_match","123","0.19","rand?m","also_no_match"))
#string to match
match_txt <- "[0-9]\\.[0-9][0-9]|name|name2|wh??|[0-9]|\\?" #used "|" to match multiple strings
#create the new column using for loop
for (i in 1:nrow(data))
{
data$col2[i] <- grep(match_txt , data$col1[i])
}
# I get the error below:
# Error in data$col2[i] <- grep(match_txt, data$col1[i], ignore.case = TRUE) : replacement has length zero
#this is expected correct results:
expected_data <- data.frame(col1 = c("name","no_match","123","0.19","rand?m","also_no_match"),
col2 = c(1,0,1,1,1,0))
grep/grepl are vectorised. We can directly use them on the column with the pattern, just wrap it in as.integer to convert logical values TRUE/FALSE to 1/0 respectively.
data$col2 <- as.integer(grepl("[0-9]\\.[0-9][0-9]|name|name2|wh??|[0-9]|\\?", data$col1))
data
# col1 col2
#1 name 1
#2 no_match 0
#3 123 1
#4 0.19 1
#5 rand?m 1
#6 also_no_match 0

Separate column into three columns with grouping [duplicate]

This question already has answers here:
How to strsplit different number of strings in certain column by do function
(1 answer)
tidyr separate only first n instances [duplicate]
(2 answers)
Closed 5 years ago.
I have a column with Full names that should be separated into three columns just by spaces. The problem is that some full names contains more than three words, and 4-th and other words shouldn't be omitted, but added to third part.
For instance, "Abdullaeva Mehseti Nuraddin Kyzy" should be separated as:
| Abdullaeva | Mehseti | Nuraddin Kyzy |
I tried to split column with (tidyr) package as follow, but in this way 3d part contains only 1 word after second space.
df<-df %>%
separate('FULL_NAME', c("1st_part","2d_part","3d_part"), sep=" ")
Any help will be appreciated.
Use extra argument:
# dummy data
df1 <- data.frame(x = c(
"some name1",
"justOneName",
"some three name",
"Abdullaeva Mehseti Nuraddin Kyzy"))
library(tidyr)
library(dplyr)
df1 %>%
separate(x, c("a1", "a2", "a3"), extra = "merge")
# a1 a2 a3
# 1 some name1 <NA>
# 2 justOneName <NA> <NA>
# 3 some three name
# 4 Abdullaeva Mehseti Nuraddin Kyzy
# Warning message:
# Too few values at 2 locations: 1, 2
From manual:
extra
If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times
Since for this dataset you said that you only have name1, name2, last name, then you can also use str_split_fixed from stringr, i.e.
setNames(data.frame(stringr::str_split_fixed(df1$x, ' ', 3)), paste0('a', 1:3))
Which gives,
a1 a2 a3
1 some name1
2 justOneName
3 some three name
4 Abdullaeva Mehseti Nuraddin Kyzy
Note that you can fill the empty slots with NA as per usual

Resources