I have a data.frame with a single column "Terms". This could contain a string of multiple words. Each term contains at least two words or more, no upper limit.
From this column "Terms", I would like to extract the last word and store it in a new column "Last".
# load library
library(dplyr)
library(stringi)
# read csv
df <- read("filename.txt",stringsAsFactors=F)
# show df
head(df)
# Term
# 1 this is for the
# 2 thank you for
# 3 the following
# 4 the fact that
# 5 the first
I have prepared a function LastWord which works well when a single string is given.
However, when a vector of string is given, it still works with the first string in the vector. This has forced me to use mapply when used with mutate, to add a column as seen below.
LastWord <- function(InputWord) {
stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}
df <- mutate(df, Last=mapply(LastWord, df$Term))
Using mapply makes the process very slow. I generally need to process around 10 to 15 million lines or terms at a time. It takes hours.
Could anyone suggest a way to create the LastWord function that works with vector rather than a string?
You can try:
df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
# Term LastWord
# 1 this is for the the
# 2 thank you for for
# 3 the following following
# 4 the fact that that
# 5 the first first
In the gsub call, the expression between the brackets matches anything that is not a space at least one time (instead of [^ ]+, [a-zA-Z]+ could work too) at the end of the string ($). The fact that it is in between brackets permit to capture the expression with \\1. So gsub only keeps what is in between brackets as replacement.
EDIT:
As #akrun mentionned in the comments, in this case, sub can also be used instead of gsub.
To extract the last word only, you can use a vectorized function from stringi directly which should be very fast
library(stringi)
df$LastWord <- stri_extract_last_words(df$Term)
Now if you want two new columns, one containing all words but the last and another one containing the last words, you can use some regular expression like
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")
# [,1] [,2] [,3]
# [1,] "this is for the" "this is for" "the"
# [2,] "thank you for" "thank you" "for"
# [3,] "the following" "the" "following"
# [4,] "the fact that" "the fact" "that"
# [5,] "the first" "the" "first"
So what you want is
df[c("ExceptLast", "LastWord")] <-
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")[, 2:3]
(Note that this won't work if df$Term contains only one word. In that case you will need to modify the regular expression, depending on which column you want it to be included in.)
Related
Suppose I want to extract all letters between the letter a and c. I've been so far using the stringr package which gives a clear idea of the full matches and the groups. The package for example would give the following.
library(stringr)
str_match_all("abc", "a([a-z])c")
# [[1]]
# [,1] [,2]
# [1,] "abc" "b"
Suppose I only want to replace the group, and not the full match---in this case the letter b. The following would, however, replace the full match.
str_replace_all("abc", "a([a-z])c", "z")
[1] "z"
# Desired result: "azc"
Would there be any good ways to replace only the capture group? suppose I wanted to do multiple matches.
str_match_all("abcdef", "a([a-z])c|d([a-z])f")
# [[1]]
# [,1] [,2] [,3]
# [1,] "abc" "b" NA
# [2,] "def" NA "e"
str_replace_all("abcdef", "a([a-z])c|d([a-z])f", "z")
# [1] "zz"
# Desired result: "azcdzf"
Matching groups was easy enough, but I haven't found a solution when a replacement is desired.
It is not the way regex was designed. Capturing is a mechanism to get the parts of strings you need and when replacing, it is used to keep parts of matches, not to discard.
Thus, a natural solution is to wrap what you need to keep with capturing groups.
In this case here, use
str_replace_all("abc", "(a)[a-z](c)", "\\1z\\2")
Or with lookarounds (if the lookbehind is a fixed/known width pattern):
str_replace_all("abc", "(?<=a)[a-z](?=c)", "z")
Usually when I want to replace certain pattern of characters in a text\string I use the grep family functions, that is what we call working with regular expressions.
You can use sub function of the grep family functions to make replacements in strings.
Exemple:
sub("b","z","abc")
[1] "azc"
You may face more challenges working with replacement, for that, grep family functions offers many functionality:
replacing all characters by your preference except a and c:
sub("[^ac]+","z","abBbbbc")
[1] "azc"
replacing the second match
sub("b{2}","z","abBbbbc")
[1] "abBzbc"
replacing all characters after the pattern:
sub("b.*","z","abc")
[1] "az"
the same above except c:
sub("b.*[^c]","z","abc")
[1] "abc"
So on...
You can look for "regular expressions in R using grep" into internet and find many ways to work with regular expressions.
Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
length(dataset[which(dataset=="corn")])
I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))
I'm trying to concatenate multiple rows into one.
Each row, it is either start with ">Gene Identifier" or Sequence information
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
Here I just put two genes, but there are hundreds of genes following this.
Basically I will just leave the gene identifier as this, but I want to concatenate sequences only when it is separated into multiple rows.
Therefore, the final results should look like this:
The sequences were concatenated and combined into one row, without any space inbetween.
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
By using "paste" function in R, I was able to achieve this manually.
i.e. paste(dat[2,1], dat[3,1], sep="")
However, I have a list of hundreads of gene, so I need a way to concatenate rows automatically.
I was thinking forloop, basically, if the row starts from ">", skip it, but if it is not start from ">", concatenate.
But I'm not expert in bioinformatics/R, it is hard for me to actually generate a script to achieve it.
Any help would be greatly appreciated!
Something happened when I pasted this into the answer box to concatenate the data lines but they were separate in my R session so this should work:
Lines <-
readLines(textConnection(">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*
>*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*
"))
geneIdx <- grepl("\\|", Lines)
grp <- cumsum(geneIdx)
grp
#[1] 1 1 1 2 2 2
tapply(Lines, grp, FUN=function(x) c(x[1], paste(x[-1], collapse="") ) )
#----------------------
$`1`
[1] ">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714"
[2] "GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*"
$`2`
[1] ">*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909"
[2] "GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*"
Would regular expressions do the trick? The regular expression below deletes newlines (\\n) not followed by > ((?!>) being a negative lookahead).
text <-">Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT"
cat(text)
cat(gsub("\\n(?!>)", "", text, perl=TRUE))
Result
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
I am very new to R, and I could not find a simple example online of how to remove the last n characters from every element of a vector (array?)
I come from a Java background, so what I would like to do is to iterate over every element of a$data and remove the last 3 characters from every element.
How would you go about it?
Here is an example of what I would do. I hope it's what you're looking for.
char_array = c("foo_bar","bar_foo","apple","beer")
a = data.frame("data"=char_array,"data2"=1:4)
a$data = substr(a$data,1,nchar(a$data)-3)
a should now contain:
data data2
1 foo_ 1
2 bar_ 2
3 ap 3
4 b 4
Here's a way with gsub:
cs <- c("foo_bar","bar_foo","apple","beer")
gsub('.{3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b"
Although this is mostly the same with the answer by #nfmcclure, I prefer using stringr package as it provdies a set of functions whose names are most consistent and descriptive than those in base R (in fact I always google for "how to get the number of characters in R" as I can't remember the name nchar()).
library(stringr)
str_sub(iris$Species, end=-4)
#or
str_sub(iris$Species, 1, str_length(iris$Species)-3)
This removes the last 3 characters from each value at Species column.
The same may be achieved with the stringi package:
library('stringi')
char_array <- c("foo_bar","bar_foo","apple","beer")
a <- data.frame("data"=char_array, "data2"=1:4)
(a$data <- stri_sub(a$data, 1, -4)) # from the first to the (last-4)-th character
## [1] "foo_" "bar_" "ap" "b"
Similar to #Matthew_Plourde using gsub
However, using a pattern that will trim to zero characters i.e. return "" if the original string is shorter than the number of characters to cut:
cs <- c("foo_bar","bar_foo","apple","beer","so","a")
gsub('.{0,3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b" "" ""
Difference is, {0,3} quantifier indicates 0 to 3 matches, whereas {3} requires exactly 3 matches otherwise no match is found in which case gsub returns the original, unmodified string.
N.B. using {,3} would be equivalent to {0,3}, I simply prefer the latter notation.
See here for more information on regex quantifiers:
https://www.regular-expressions.info/refrepeat.html
friendly hint when working with n characters of a string to cut off/replace:
--> be aware of whitespaces in your strings!
use base::gsub(' ', '', x, fixed = TRUE) to get rid of unwanted whitespaces in your strings. i spent quite some time to find out why the great solutions provided above did not work for me. thought it might be useful for others as well ;)
I have a CSV file. I want to read the file in R but use only the first 2 commas i.e. if there is a line like this in the file,
1,1000,I, am done, with you
In R I want this to the row of a dataframe with three columns like this
> df <- data.frame("Id"="1","Count" ="1000", "Comment" = "I, am done, with you")
> df
Id Count Comment
1 1 1000 I, am done, with you
Regular expression will work.
For example, suppose str are the rows you want to recognize. Here suppose your csv file looks like
1,1000,I, am done, with you
2,500, i don't know
If you want to read from file, just call readLines() to read all lines of the file as a character vector in R, just like str.
The technique is very simple. Here I use {stringr} package to match the text and extract the information I need.
str <- c("1,1000,I, am done, with you", "2,500, i don't know")
library(stringr)
# match the strings by pattern integer,integer,anything
matches <- str_match(str,pattern="(\\d+),(\\d+),\\s*(.+)")
Here I briefly explains the pattern (\\d+),(\\d+),\\s*(.+). \\d represents digit character, \\s represents space character, . represents anything. + means one or more, * means none or some. () groups the patterns so that the function knows what we regard as a group of information.
If you look at matches, it looks like
[,1] [,2] [,3] [,4]
[1,] "1,1000,I, am done, with you" "1" "1000" "I, am done, with you"
[2,] "2,500, i don't know" "2" "500" "i don't know"
Look, str_match function successfully split the texts by the pattern to a matrix. Then our work is only to transform the matrix to a data frame with correct data types.
df <- data.frame(matches[,-1],stringsAsFactors=F)
colnames(df) <- c("Id","Count","Comment")
df <- transform(df,Id=as.integer(Id),Count=as.integer(Count))
df is our target:
Id Count Comment
1 1 1000 I, am done, with you
2 2 1002 i don't know