I'm having a hard time extracting elements between a / and a black space. I can do this when I have two characters like < and > for instance but the space is throwing me. I'd like the most efficient way to do this in base R as This will be lapplied to thousands of vectors.
I'd like to turn this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
This:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thank you all for the answers. I'm going for speed so Andres code wins out. Dwin's code wins for the shotest amount of code. Dirk yours was the second fastest. The stringr solution was the slowest (I figured it would be) and wasn't in base but is pretty understandable (which really is the intent of the stringr package I think as this seems to be Hadley's philosophy with most things.
I appreciate your assistance. Thanks again.
I thought I'd include the benchmarking since this will be lapplied over several thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
Similar but a bit more succinct:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where beginning and end anchor characters
are ^ and $ respectively, .* matches any character.
[^ ]+ takes the nonblank characters.
\\1 is the first tagged character
Use regex pattern that is fwd-slash or space:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read the Q closely enough. One could use that result to extract the even numbered elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Here is a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of 'text before slash':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it is just a matter of splitting the string
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtering the "". There may well be more compact ways for the last bit.
R>
The stringr package has nice functions for working with strings, with very intuitive names. Here you can use str_extract_all to get all matches (including the leading slash), then str_sub to remove the slashes:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Related
I have strings like these:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
What I'm trying to do is extract those words that have exactly one vowel. I do get the correct result with this:
library(stringr)
str_extract_all(turns, "\\b[b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\\b")
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "i" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
However, it feels cumbersome to define a consonant class. Is there a more elegant and more concise way?
We can use str_count on the words after splitting the 'turns' at the spaces
library(stringr)
lapply(strsplit(turns, "\\s+"), function(x) x[str_count(x, '[aeiou]') == 1])
-output
#[[1]]
#[1] "him" "to" "stir" "him" "up" "now" "and"
#[[2]]
#[1] "when" "when" "him" "he" "on" "the"
#[[3]]
#[1] "yes" "it" "for" "a" "long"
#[[4]]
#[1] "it" "was"
You can use a PCRE regex with character classes containing double negation:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
rx <- "\\b[^[:^alpha:]aeiou]*[aeiou][^[:^alpha:]aeiou]*\\b"
regmatches(turns, gregexpr(rx, turns, perl=TRUE, ignore.case=TRUE))
See the R demo online. The result is as in the question.
See the regex demo. Details:
\b - word boundary
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
[aeiou] - a vowel
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
\b - word boundary.
An equivalent expression:
(?i)\b[^\P{L}aeiou]*[aeiou][^\P{L}aeiou]*\b
See this regex demo. \P{L} matches any char but a letter. (?i) is equivalent of ignore.case=TRUE.
Here is a base R option using strsplit + nchar + gsub
lapply(
strsplit(turns, "\\s"),
function(v) v[nchar(gsub("[^aeiou]", "", v)) == 1]
)
which gives
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
I have a function that was suggested by a user as an aswer to my previous question:
word_string <- function(x) {
inds <- seq_len(nchar(x))
start = inds[-length(inds)]
stop = inds[-1]
substring(x, start, stop)
}
The function works as expected and breaks down a given word into component parts as per my sepcifications:
word_string('microwave')
[1] "mi" "ic" "cr" "ro" "ow" "wa" "av" "ve"
What I now want to be able to do is have the function applied to all rows of a specified columnin a dataframe.
Here's dataframe for purposes of illustration:
word <- c("House", "Motorcar", "Boat", "Dog", "Tree", "Drink")
some_value <- c("2","100","16","999", "65","1000000")
my_df <- data.frame(word, some_value, stringsAsFactors = FALSE )
my_df
word some_value
1 House 2
2 Motorcar 100
3 Boat 16
4 Dog 999
5 Tree 65
6 Drink 1000000
Now, if I use lapply to work the function on my dataframe, not only do I get incorrect results but also an error message.
lapply(my_df['word'], word_string)
$word
[1] "Ho" "ot" "at" "" "Tr" "ri"
Warning message:
In seq_len(nchar(x)) : first element used of 'length.out' argument
So you can see that the function is being applied, but it's being applied such that it's evaluating each row partially.
The desired output would be something like:
[1] "ho" "ou" "us" "se
[2] "mo" "ot" "to" "or" "rc" "ca" "ar"
[3] "bo" "oa" "at"
[4] "do" "og"
[5] "tr" "re" "ee"
[6] "dr" "ri" "in" "nk"
Any guidance greatly appreciated.
The reason is that [ is still a data.frame with one column (if we don't use ,) and so here the unit is a single column.
str(my_df['word'])
'data.frame': 6 obs. of 1 variable:
# $ word: chr "House" "Motorcar" "Boat" "Dog" ...
The lapply loops over that single column instead of each of the elements in that column.
W need either $ or [[
lapply(my_df[['word']], word_string)
#[[1]]
#[1] "Ho" "ou" "us" "se"
#[[2]]
#[1] "Mo" "ot" "to" "or" "rc" "ca" "ar"
#[[3]]
#[1] "Bo" "oa" "at"
#[[4]]
#[1] "Do" "og"
#[[5]]
#[1] "Tr" "re" "ee"
#[[6]]
#[1] "Dr" "ri" "in" "nk"
How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?
x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above
Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"
I am facing this issue in R in which I want to split the strings on comma and then further split on semicolon, but only keep the first item before the semicolon i.e. ee and jj below. I have tried a bunch of things but nested lists seem too convoluted!
Here's what I am doing:
d <- c("aa,bb,cc,dd,ee;e,ff",
"gg,hh,ii,jj;j")
e=strsplit(d,",")
myfun2 <- function(x,arg1) {
strsplit(x,";")
}
f=lapply(e,myfun2)
f=
[[1]]
[[1]][[1]]
[1] "aa"
[[1]][[2]]
[1] "bb"
[[1]][[3]]
[1] "cc"
[[1]][[4]]
[1] "dd"
[[1]][[5]]
[1] "ee" "e"
[[1]][[6]]
[1] "ff"
[[2]]
[[2]][[1]]
[1] "gg"
[[2]][[2]]
[1] "hh"
[[2]][[3]]
[1] "ii"
[[2]][[4]]
[1] "jj" "j"
Here's the output that I want
Correct output=
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
I have tried a bunch of things using lapply to the nested list "f" and used "[[" and "[" but with no success.
Any help is greatly appreciated. (I know that I am missing something silly, but just can't figure it out right now!)
This is your code
d <- c("aa,bb,cc,dd,ee;e,ff", "gg,hh,ii,jj;j")
e <- strsplit(d,",")
myfun2 <- function(x,arg1) { strsplit(x,";") }
f <- lapply(e,myfun2)
If we start from your f, then the next step would be
lapply(f,function(x) mapply(`[`,x,1))
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
Basically, you need an inner and outer type apply function to go down the two levels of nesting.
We can use gsub to match the pattern ; followed by one ore more alphabetic characters, replace with '', and then split (strsplit) with ,.
strsplit(gsub(';[a-z]+', '', d), ',')
#[[1]]
#[1] "aa" "bb" "cc" "dd" "ee" "ff"
#[[2]]
#[1] "gg" "hh" "ii" "jj"
I have song.txt file
*****
[1]"The snow glows white on the mountain tonight
Not a footprint to be seen."
[2]"A kingdom of isolation,
and it looks like I'm the Queen"
[3]"The wind is howling like this swirling storm inside
Couldn't keep it in;
Heaven knows I've tried"
*****
[4]"Don't let them in,
don't let them see"
[5]"Be the good girl you always have to be
Conceal, don't feel,
don't let them know"
[6]"Well now they know"
*****
I would like to loop over the lyrics and fill in the elements of each list as
each element in the list contains a character vector, where each element of the vector is a word in the song.
like
[1] "The" "snow" "glows" "white" "on" "the" "mountain" "tonight" "Not" "a" "footprint"
"to" "be" "seen." "A" "kingdom" "of" "isolation," "and" "it" "looks" "like" "I'm" "the"
"Queen" "The" "wind" "is" "howling" "like" "this" "swirling" "storm" "inside"
"Couldn't" "keep" "it" "in" "Heaven" "knows" "I've" "tried"
[2]"Don't" "let" "them" "in,""don't" "let" "them" "see" "Be" "the" "good" "girl" "you"
"always" "have" "to" "be" "Conceal," "don't" "feel," "don't" "let" "them" "know"
"Well" "now" "they" "know"
First I made an empty list with words <- vector("list", 2).
I think that I should first put the text into one long character vector where in relation to the delimiters ***** start and stop. with
star="\\*{5}"
pindex = grep(star, page)
After this what should I do?
It sounds like what you want is strsplit, run (effectively) twice. So, starting from the point of "a single long character string separated by **** and spaces" (which I assume is what you have?):
list_of_vectors <- lapply(strsplit(song, split = "\\*{5}"), function(x) {
#Split each verse by spaces
split_verse <- strsplit(x, split = " ")
#Then return it as a vector
return(unlist(split_verse))
})
The result should be a list of each verse, with each element consisting of a vector of each word in that verse. Iff you're not dealing with a single character string in the read-in object, show us the file and how you're reading it in ;).
To get it into the format you want, maybe give this a shot. Also, please update your post with more information so we can definitively solve your problem. There are a few areas of your posted question that need some clarification. Hope this helps.
## writeLines(text <- "*****
## The snow glows white on the mountain tonight
## Not a footprint to be seen.
## A kingdom of isolation,
## and it looks like I'm the Queen
## The wind is howling like this swirling storm inside
## Couldn't keep it in;
## Heaven knows I've tried
## *****
## Don't let them in,
## don't let them see
## Be the good girl you always have to be Conceal,
## don't feel,
## don't let them know
## Well now they know
## *****", "song.txt")
> read.song <- readLines("song.txt")
> split.song <- unlist(strsplit(read.song, "\\s"))
> star.index <- grep("\\*{5}", split.song)
> word.index <- sapply(2:length(star.index), function(i){
(star.index[i-1]+1):(star.index[i]-1)
})
> lapply(seq(word.index), function(i) split.song[ word.index[[i]] ])
## [[1]]
## [1] "The" "snow" "glows" "white" "on" "the" "mountain"
## [8] "tonight" "Not" "a" "footprint" "to" "be" "seen."
## [15] "A" "kingdom" "of" "isolation," "and" "it" "looks"
## [22] "like" "I'm" "the" "Queen" "The" "wind" "is"
## [29] "howling" "like" "this" "swirling" "storm" "inside" "Couldn't"
## [36] "keep" "it" "in;" "Heaven" "knows" "I've" "tried"
## [[2]]
## [1] "Don't" "let" "them" "in," "don't" "let" "them" "see" "Be"
## [10] "the" "good" "girl" "you" "always" "have" "to" "be" "Conceal,"
## [19] "don't" "feel," "don't" "let" "them" "know" "Well" "now" "they"
## [28] "know"