I have song.txt file
*****
[1]"The snow glows white on the mountain tonight
Not a footprint to be seen."
[2]"A kingdom of isolation,
and it looks like I'm the Queen"
[3]"The wind is howling like this swirling storm inside
Couldn't keep it in;
Heaven knows I've tried"
*****
[4]"Don't let them in,
don't let them see"
[5]"Be the good girl you always have to be
Conceal, don't feel,
don't let them know"
[6]"Well now they know"
*****
I would like to loop over the lyrics and fill in the elements of each list as
each element in the list contains a character vector, where each element of the vector is a word in the song.
like
[1] "The" "snow" "glows" "white" "on" "the" "mountain" "tonight" "Not" "a" "footprint"
"to" "be" "seen." "A" "kingdom" "of" "isolation," "and" "it" "looks" "like" "I'm" "the"
"Queen" "The" "wind" "is" "howling" "like" "this" "swirling" "storm" "inside"
"Couldn't" "keep" "it" "in" "Heaven" "knows" "I've" "tried"
[2]"Don't" "let" "them" "in,""don't" "let" "them" "see" "Be" "the" "good" "girl" "you"
"always" "have" "to" "be" "Conceal," "don't" "feel," "don't" "let" "them" "know"
"Well" "now" "they" "know"
First I made an empty list with words <- vector("list", 2).
I think that I should first put the text into one long character vector where in relation to the delimiters ***** start and stop. with
star="\\*{5}"
pindex = grep(star, page)
After this what should I do?
It sounds like what you want is strsplit, run (effectively) twice. So, starting from the point of "a single long character string separated by **** and spaces" (which I assume is what you have?):
list_of_vectors <- lapply(strsplit(song, split = "\\*{5}"), function(x) {
#Split each verse by spaces
split_verse <- strsplit(x, split = " ")
#Then return it as a vector
return(unlist(split_verse))
})
The result should be a list of each verse, with each element consisting of a vector of each word in that verse. Iff you're not dealing with a single character string in the read-in object, show us the file and how you're reading it in ;).
To get it into the format you want, maybe give this a shot. Also, please update your post with more information so we can definitively solve your problem. There are a few areas of your posted question that need some clarification. Hope this helps.
## writeLines(text <- "*****
## The snow glows white on the mountain tonight
## Not a footprint to be seen.
## A kingdom of isolation,
## and it looks like I'm the Queen
## The wind is howling like this swirling storm inside
## Couldn't keep it in;
## Heaven knows I've tried
## *****
## Don't let them in,
## don't let them see
## Be the good girl you always have to be Conceal,
## don't feel,
## don't let them know
## Well now they know
## *****", "song.txt")
> read.song <- readLines("song.txt")
> split.song <- unlist(strsplit(read.song, "\\s"))
> star.index <- grep("\\*{5}", split.song)
> word.index <- sapply(2:length(star.index), function(i){
(star.index[i-1]+1):(star.index[i]-1)
})
> lapply(seq(word.index), function(i) split.song[ word.index[[i]] ])
## [[1]]
## [1] "The" "snow" "glows" "white" "on" "the" "mountain"
## [8] "tonight" "Not" "a" "footprint" "to" "be" "seen."
## [15] "A" "kingdom" "of" "isolation," "and" "it" "looks"
## [22] "like" "I'm" "the" "Queen" "The" "wind" "is"
## [29] "howling" "like" "this" "swirling" "storm" "inside" "Couldn't"
## [36] "keep" "it" "in;" "Heaven" "knows" "I've" "tried"
## [[2]]
## [1] "Don't" "let" "them" "in," "don't" "let" "them" "see" "Be"
## [10] "the" "good" "girl" "you" "always" "have" "to" "be" "Conceal,"
## [19] "don't" "feel," "don't" "let" "them" "know" "Well" "now" "they"
## [28] "know"
Related
file_name <- 'I am a good boy who went to Africa, Brazil and India'
strsplit(file_name, ' ')
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa," "Brazil"
[11] "and" "India"
In the above implementation, I want to return all the strings individually. However, the function is returning 'Africa,' as a single entity whereas I want to return the , also separately.
The expected output should be. The , appears as a separate element
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa" "," "Brazil"
[11] "and" "India"
Perhaps this helps
strsplit(file_name, '\\s+|(?<=[a-z])(?=[[:punct:]])', perl = TRUE)
#[[1]]
#[1] "I" "am" "a" "good" "boy" "who" "went"
#[8] "to" "Africa" "," "Brazil" "and" "India"
Or use an extraction method
regmatches(file_name, gregexpr("[[:alnum:]]+|,", file_name))
R Package: stringr::words
I want to know the number of words that are exactly three letters long in the stringr::words file after applying the following regular expression:
x <- str_view(words, "^...$", match = TRUE)
While the code was able to extract words that are exactly three letters long, it does not tell me how many words there are. So, I thought the length function will be appropriate to find the number.
length(x)
The code returns 8, which cannot be as it is very clear that x is more than 8.
What is the proper syntax to calculate the number of words after matching with the regular expression, in this case, x?
Also, can anyone explain to me why length(x) returns 8 in the above example?
Thank you in advance.
str_view returns an HTML object which is used for viewing.
x <- str_view(words, "^...$", match = TRUE)
class(x)
#[1] "str_view" "htmlwidget"
The 8 components that you see are
names(x)
#[1] "x" "width" "height" "sizingPolicy" "dependencies"
#[6] "elementId" "preRenderHook" "jsHooks"
Instead of str_view use str_subset :
library(stringr)
x <- str_subset(words, "^...$")
x
# [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" "bag"
# [14] "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" "car" "cat"
# [27] "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" "end" "eye" "far"
# [40] "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" "hit" "hot" "how" "job"
# [53] "key" "kid" "lad" "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs"
# [66] "new" "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per" "put"
# [79] "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit" "six" "son" "sun"
# [92] "tax" "tea" "ten" "the" "tie" "too" "top" "try" "two" "use" "war" "way" "wee"
#[105] "who" "why" "win" "yes" "yet" "you"
length(x)
#[1] 110
Another option is str_count:
library(stringr)
sum(str_count(x, "^...$"))
[1] 3
Data:
x <- c("abc", "abcd", "ab", "abc", "abcsd", "edf")
I'd suggest using grep with length:
length(grep("^.{3}$", words))
# => [1] 110
With grep, you actually get a subset of the words and length will return the count of the found matches.
stringr::str_view can be used to view HTML rendering of regular expression match, and it does not actually return the list of matches. Beside grep, you may use stringr::str_subset.
I am trying to select a sub-item in a list where each list item is a different paragraph, for example:
> psw_list$p1
[1] "For" "more" "than" "five" "years," "William" "Sencion"
[8] "did" "the" "same" "task" "over" "and" "over."
[15] "He" "signed" "onto" "the" "New" "York" "City’s"
[22] "housing" "lottery" "site" "and" "applied" "for" "one"
[29] "of" "the" "city’s" "highly" "coveted," "below-market" "apartments."
[36] "Each" "time," "he" "got" "the" "same" "response:"
[43] "silence."
This paragraph is the first item in the list, and I'd like to go through it and select one word to change, however when I use this code to try and do that, it doesn't actually replace it in the list:
str_replace(psw_list$p1[5], psw_list$p1[5], 'coersion')
I've tried this with the sub() and gsub() functions as well and neither change the word in the list, they just return the 'substituted' word, but when I print the first list item it is exactly the same. Any advice would be appreciated.
Hi I would like to delete blanks (check no.170) and many others from the below. any idea how can i go about it?
words2
[116] "been" "any" "reasonable" "cause" "for"
[121] "such" "apprehension" "Indeed" "the" "most"
[126] "ample" "evidence" "to" "the" "contrary"
[131] "has" "all" "the" "while" "existed"
[136] "and" "been" "open" "to" "their"
[141] "inspection" "It" "is" "found" "in"
[146] "nearly" "all" "the" "published" "speeches"
[151] "of" "him" "who" "now" "addresses"
[156] "you" "I" "do" "but" "quote"
[161] "from" "one" "of" "those" "speeches"
[166] "when" "I" "declare" "that" ""
[171] "I" "have" "no" "purpose" "directly"
[176] "or" "indirectly" "to" "interfere" "with"
[181] "the" "institution" "of" "slavery" "in"
[186] "the" "States" "where" "it" "exists"
[191] "I" "believe" "I" "have" "no"
If you had a vector x = c(1, 2, 3, 2, 1) and you wanted to remove all 2s, you might do this: x[x != 2]. Similarly, you have a vector words2 and you want to remove the blanks "", so you can do this: words2[words2 != ""].
Of course, to remove them from words2 and save the result, you need to use <- or = to overwrite words2, as in
words2 = words2[words2 != ""] ## remove blanks
words2 = words2[nchar(words2) > 0] ## keep only strings with more than 0 characters
## remove blank and "bad string" strings
words2 = word2[! words2 %in% c("", "bad string")]
Regex is useful if you are looking inside strings (e.g., remove strings that contain an "a"), or if you are using patterns (e.g., remove strings that have a number at the end). When you are looking for exact matches of a whole string, you don't need regex.
I'm having a hard time extracting elements between a / and a black space. I can do this when I have two characters like < and > for instance but the space is throwing me. I'd like the most efficient way to do this in base R as This will be lapplied to thousands of vectors.
I'd like to turn this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
This:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thank you all for the answers. I'm going for speed so Andres code wins out. Dwin's code wins for the shotest amount of code. Dirk yours was the second fastest. The stringr solution was the slowest (I figured it would be) and wasn't in base but is pretty understandable (which really is the intent of the stringr package I think as this seems to be Hadley's philosophy with most things.
I appreciate your assistance. Thanks again.
I thought I'd include the benchmarking since this will be lapplied over several thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
Similar but a bit more succinct:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where beginning and end anchor characters
are ^ and $ respectively, .* matches any character.
[^ ]+ takes the nonblank characters.
\\1 is the first tagged character
Use regex pattern that is fwd-slash or space:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read the Q closely enough. One could use that result to extract the even numbered elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Here is a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of 'text before slash':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it is just a matter of splitting the string
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtering the "". There may well be more compact ways for the last bit.
R>
The stringr package has nice functions for working with strings, with very intuitive names. Here you can use str_extract_all to get all matches (including the leading slash), then str_sub to remove the slashes:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"