split string including punctuations in R - r

file_name <- 'I am a good boy who went to Africa, Brazil and India'
strsplit(file_name, ' ')
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa," "Brazil"
[11] "and" "India"
In the above implementation, I want to return all the strings individually. However, the function is returning 'Africa,' as a single entity whereas I want to return the , also separately.
The expected output should be. The , appears as a separate element
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa" "," "Brazil"
[11] "and" "India"

Perhaps this helps
strsplit(file_name, '\\s+|(?<=[a-z])(?=[[:punct:]])', perl = TRUE)
#[[1]]
#[1] "I" "am" "a" "good" "boy" "who" "went"
#[8] "to" "Africa" "," "Brazil" "and" "India"
Or use an extraction method
regmatches(file_name, gregexpr("[[:alnum:]]+|,", file_name))

Related

how to delete a word in R from a string using regex

Hi I would like to delete blanks (check no.170) and many others from the below. any idea how can i go about it?
words2
[116] "been" "any" "reasonable" "cause" "for"
[121] "such" "apprehension" "Indeed" "the" "most"
[126] "ample" "evidence" "to" "the" "contrary"
[131] "has" "all" "the" "while" "existed"
[136] "and" "been" "open" "to" "their"
[141] "inspection" "It" "is" "found" "in"
[146] "nearly" "all" "the" "published" "speeches"
[151] "of" "him" "who" "now" "addresses"
[156] "you" "I" "do" "but" "quote"
[161] "from" "one" "of" "those" "speeches"
[166] "when" "I" "declare" "that" ""
[171] "I" "have" "no" "purpose" "directly"
[176] "or" "indirectly" "to" "interfere" "with"
[181] "the" "institution" "of" "slavery" "in"
[186] "the" "States" "where" "it" "exists"
[191] "I" "believe" "I" "have" "no"
If you had a vector x = c(1, 2, 3, 2, 1) and you wanted to remove all 2s, you might do this: x[x != 2]. Similarly, you have a vector words2 and you want to remove the blanks "", so you can do this: words2[words2 != ""].
Of course, to remove them from words2 and save the result, you need to use <- or = to overwrite words2, as in
words2 = words2[words2 != ""] ## remove blanks
words2 = words2[nchar(words2) > 0] ## keep only strings with more than 0 characters
## remove blank and "bad string" strings
words2 = word2[! words2 %in% c("", "bad string")]
Regex is useful if you are looking inside strings (e.g., remove strings that contain an "a"), or if you are using patterns (e.g., remove strings that have a number at the end). When you are looking for exact matches of a whole string, you don't need regex.

Regular expressions, extract specific parts of pattern

I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"

Grouping elements of a list into sublists when a row is blank

I want to use the blank row represented by "" that exists in my list so I can group all the rows in between into sublists.
For example I have a long list that looks like this:
> data
[1] "data science"
[2] "big data"
[3] "machine learning"
[4] "BI"
[5] "analytics"
[6] ""
[7] "SAS"
[8] "R"
[9] "Python"
[10] "Spark"
[11] ""
[12] "Hive"
[13] "PIG"
[14] "IMPALA"
....
And I want something like this:
> output
[[1]] [1] "data science" "big data" "machine learning" "BI" "analytics"
[[2]] [1] "SAS" "R" "Python" "Spark"
[[3]] [1] "Hive" "PIG" "IMPALA"
The indexation in my output is maybe wrong but overall it's what I want.
Maybe something with splitwould do it.
You are correct that split can help you. If you cumsum a logical vector it will break apart your original vector into groups. You then have to drop the first element because it is "". That's what tail does in the lapply:
set.seed(201)
x <- sample(letters, 20, replace = T)
x[c(6,12)] <- ""
> lapply(split(x, cumsum(x == "")), tail, -1)
$`0`
[1] "p" "p" "q" "r"
$`1`
[1] "v" "n" "g" "l" "t"
$`2`
[1] "p" "p" "n" "e" "t" "c" "j" "m"

split each character in R

I have song.txt file
*****
[1]"The snow glows white on the mountain tonight
Not a footprint to be seen."
[2]"A kingdom of isolation,
and it looks like I'm the Queen"
[3]"The wind is howling like this swirling storm inside
Couldn't keep it in;
Heaven knows I've tried"
*****
[4]"Don't let them in,
don't let them see"
[5]"Be the good girl you always have to be
Conceal, don't feel,
don't let them know"
[6]"Well now they know"
*****
I would like to loop over the lyrics and fill in the elements of each list as
each element in the list contains a character vector, where each element of the vector is a word in the song.
like
[1] "The" "snow" "glows" "white" "on" "the" "mountain" "tonight" "Not" "a" "footprint"
"to" "be" "seen." "A" "kingdom" "of" "isolation," "and" "it" "looks" "like" "I'm" "the"
"Queen" "The" "wind" "is" "howling" "like" "this" "swirling" "storm" "inside"
"Couldn't" "keep" "it" "in" "Heaven" "knows" "I've" "tried"
[2]"Don't" "let" "them" "in,""don't" "let" "them" "see" "Be" "the" "good" "girl" "you"
"always" "have" "to" "be" "Conceal," "don't" "feel," "don't" "let" "them" "know"
"Well" "now" "they" "know"
First I made an empty list with words <- vector("list", 2).
I think that I should first put the text into one long character vector where in relation to the delimiters ***** start and stop. with
star="\\*{5}"
pindex = grep(star, page)
After this what should I do?
It sounds like what you want is strsplit, run (effectively) twice. So, starting from the point of "a single long character string separated by **** and spaces" (which I assume is what you have?):
list_of_vectors <- lapply(strsplit(song, split = "\\*{5}"), function(x) {
#Split each verse by spaces
split_verse <- strsplit(x, split = " ")
#Then return it as a vector
return(unlist(split_verse))
})
The result should be a list of each verse, with each element consisting of a vector of each word in that verse. Iff you're not dealing with a single character string in the read-in object, show us the file and how you're reading it in ;).
To get it into the format you want, maybe give this a shot. Also, please update your post with more information so we can definitively solve your problem. There are a few areas of your posted question that need some clarification. Hope this helps.
## writeLines(text <- "*****
## The snow glows white on the mountain tonight
## Not a footprint to be seen.
## A kingdom of isolation,
## and it looks like I'm the Queen
## The wind is howling like this swirling storm inside
## Couldn't keep it in;
## Heaven knows I've tried
## *****
## Don't let them in,
## don't let them see
## Be the good girl you always have to be Conceal,
## don't feel,
## don't let them know
## Well now they know
## *****", "song.txt")
> read.song <- readLines("song.txt")
> split.song <- unlist(strsplit(read.song, "\\s"))
> star.index <- grep("\\*{5}", split.song)
> word.index <- sapply(2:length(star.index), function(i){
(star.index[i-1]+1):(star.index[i]-1)
})
> lapply(seq(word.index), function(i) split.song[ word.index[[i]] ])
## [[1]]
## [1] "The" "snow" "glows" "white" "on" "the" "mountain"
## [8] "tonight" "Not" "a" "footprint" "to" "be" "seen."
## [15] "A" "kingdom" "of" "isolation," "and" "it" "looks"
## [22] "like" "I'm" "the" "Queen" "The" "wind" "is"
## [29] "howling" "like" "this" "swirling" "storm" "inside" "Couldn't"
## [36] "keep" "it" "in;" "Heaven" "knows" "I've" "tried"
## [[2]]
## [1] "Don't" "let" "them" "in," "don't" "let" "them" "see" "Be"
## [10] "the" "good" "girl" "you" "always" "have" "to" "be" "Conceal,"
## [19] "don't" "feel," "don't" "let" "them" "know" "Well" "now" "they"
## [28] "know"

What 1-2 letter object names conflict with existing R objects?

To make my code more readable, I like to avoid names of objects that already exist when creating new objects. Because of the package-based nature of R, and because functions are first-class objects, it can be easy to overwrite common functions that are not in base R (since a common package might use a short function name but without knowing what package to load there is no way to check for it). Objects such as the built-in logicals T and F also cause trouble.
Some examples that come to mind are:
One letter
c
t
T/F
J
Two letters
df
A better solution might be to avoid using short names altogether in favor of more descriptive ones, and I generally try to do that as a matter of habit. Yet "df" for a function which manipulates a generic data.frame is plenty descriptive and a longer name adds little, so short names have their uses. In addition, for SO questions where the larger context isn't necessarily known, coming up with descriptive names is well-nigh impossible.
What other one- and two-letter variable names conflict with existing R objects? Which among those are sufficiently common that they should be avoided? If they are not in base, please list the package as well. The best answers will involve at least some code; please provide it if used.
Note that I am not asking whether or not overwriting functions that already exist is advisable or not. That question is addressed on SO already:
In R, what exactly is the problem with having variables with the same name as base R functions?
For visualizations of some answers here, see this question on CV:
https://stats.stackexchange.com/questions/13999/visualizing-2-letter-combinations
apropos is ideal for this:
apropos("^[[:alpha:]]{1,2}$")
With no packages loaded, this returns:
[1] "ar" "as" "by" "c" "C" "cm" "D" "de" "df" "dt" "el" "F" "gc" "gl"
[15] "I" "if" "Im" "is" "lh" "lm" "ls" "pf" "pi" "pt" "q" "qf" "qr" "qt"
[29] "Re" "rf" "rm" "rt" "sd" "t" "T" "ts" "vi"
The exact contents will depend upon the search list. Try loading a few packages and re-running it if you care about conflicts with packages that you commonly use.
I loaded all the (>200) packages installed on my machine with this:
lapply(rownames(installed.packages()), require, character.only = TRUE)
And reran the call to apropos, wrapping it in unique, since there were a few duplicates.
one_or_two <- unique(apropos("^[[:alpha:]]{1,2}$"))
This returned:
[1] "Ad" "am" "ar" "as" "bc" "bd" "bp" "br" "BR" "bs" "by" "c" "C"
[14] "cc" "cd" "ch" "ci" "CJ" "ck" "Cl" "cm" "cn" "cq" "cs" "Cs" "cv"
[27] "d" "D" "dc" "dd" "de" "df" "dg" "dn" "do" "ds" "dt" "e" "E"
[40] "el" "ES" "F" "FF" "fn" "gc" "gl" "go" "H" "Hi" "hm" "I" "ic"
[53] "id" "ID" "if" "IJ" "Im" "In" "ip" "is" "J" "lh" "ll" "lm" "lo"
[66] "Lo" "ls" "lu" "m" "MH" "mn" "ms" "N" "nc" "nd" "nn" "ns" "on"
[79] "Op" "P" "pa" "pf" "pi" "Pi" "pm" "pp" "ps" "pt" "q" "qf" "qq"
[92] "qr" "qt" "r" "Re" "rf" "rk" "rl" "rm" "rt" "s" "sc" "sd" "SJ"
[105] "sn" "sp" "ss" "t" "T" "te" "tr" "ts" "tt" "tz" "ug" "UG" "UN"
[118] "V" "VA" "Vd" "vi" "Vo" "w" "W" "y"
You can see where they came from with
lapply(one_or_two, find)
Been thinking about this more. Here's a list of one-letter object names in base R:
> var.names <- c(letters,LETTERS)
> var.names[sapply(var.names,exists)]
[1] "c" "q" "t" "C" "D" "F" "I" "T" "X"
And one- and two-letter object names in base R:
one.letter.names <- c(letters,LETTERS)
N <- length(one.letter.names)
first <- rep(one.letter.names,N)
second <- rep(one.letter.names,each=N)
two.letter.names <- paste(first,second,sep="")
var.names <- c(one.letter.names,two.letter.names)
> var.names[sapply(var.names,exists)]
[1] "c" "d" "q" "t" "C" "D" "F" "I" "J" "N" "T" "X" "bc" "gc"
[15] "id" "sd" "de" "Re" "df" "if" "pf" "qf" "rf" "lh" "pi" "vi" "el" "gl"
[29] "ll" "cm" "lm" "rm" "Im" "sp" "qq" "ar" "qr" "tr" "as" "bs" "is" "ls"
[43] "ns" "ps" "ts" "dt" "pt" "qt" "rt" "tt" "by" "VA" "UN"
That's a much bigger list than I initially suspected, although I would never think of naming a variable "if", so to a certain degree it makes sense.
Still doesn't capture object names not in base, or give any sense of which functions are best avoided. I think a better answer would either use expert opinion to figure out which functions are important (e.g. using c is probably worse than using qf) or use a data mining approach on a bunch of R code to see what short-named functions get used the most.

Resources