Error using lapply to pass dataframe variable through custom function - r

I have a function that was suggested by a user as an aswer to my previous question:
word_string <- function(x) {
inds <- seq_len(nchar(x))
start = inds[-length(inds)]
stop = inds[-1]
substring(x, start, stop)
}
The function works as expected and breaks down a given word into component parts as per my sepcifications:
word_string('microwave')
[1] "mi" "ic" "cr" "ro" "ow" "wa" "av" "ve"
What I now want to be able to do is have the function applied to all rows of a specified columnin a dataframe.
Here's dataframe for purposes of illustration:
word <- c("House", "Motorcar", "Boat", "Dog", "Tree", "Drink")
some_value <- c("2","100","16","999", "65","1000000")
my_df <- data.frame(word, some_value, stringsAsFactors = FALSE )
my_df
word some_value
1 House 2
2 Motorcar 100
3 Boat 16
4 Dog 999
5 Tree 65
6 Drink 1000000
Now, if I use lapply to work the function on my dataframe, not only do I get incorrect results but also an error message.
lapply(my_df['word'], word_string)
$word
[1] "Ho" "ot" "at" "" "Tr" "ri"
Warning message:
In seq_len(nchar(x)) : first element used of 'length.out' argument
So you can see that the function is being applied, but it's being applied such that it's evaluating each row partially.
The desired output would be something like:
[1] "ho" "ou" "us" "se
[2] "mo" "ot" "to" "or" "rc" "ca" "ar"
[3] "bo" "oa" "at"
[4] "do" "og"
[5] "tr" "re" "ee"
[6] "dr" "ri" "in" "nk"
Any guidance greatly appreciated.

The reason is that [ is still a data.frame with one column (if we don't use ,) and so here the unit is a single column.
str(my_df['word'])
'data.frame': 6 obs. of 1 variable:
# $ word: chr "House" "Motorcar" "Boat" "Dog" ...
The lapply loops over that single column instead of each of the elements in that column.
W need either $ or [[
lapply(my_df[['word']], word_string)
#[[1]]
#[1] "Ho" "ou" "us" "se"
#[[2]]
#[1] "Mo" "ot" "to" "or" "rc" "ca" "ar"
#[[3]]
#[1] "Bo" "oa" "at"
#[[4]]
#[1] "Do" "og"
#[[5]]
#[1] "Tr" "re" "ee"
#[[6]]
#[1] "Dr" "ri" "in" "nk"

Related

split string including punctuations in R

file_name <- 'I am a good boy who went to Africa, Brazil and India'
strsplit(file_name, ' ')
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa," "Brazil"
[11] "and" "India"
In the above implementation, I want to return all the strings individually. However, the function is returning 'Africa,' as a single entity whereas I want to return the , also separately.
The expected output should be. The , appears as a separate element
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa" "," "Brazil"
[11] "and" "India"
Perhaps this helps
strsplit(file_name, '\\s+|(?<=[a-z])(?=[[:punct:]])', perl = TRUE)
#[[1]]
#[1] "I" "am" "a" "good" "boy" "who" "went"
#[8] "to" "Africa" "," "Brazil" "and" "India"
Or use an extraction method
regmatches(file_name, gregexpr("[[:alnum:]]+|,", file_name))

What is the syntax in R for returning the number of words matched in regular expression?

R Package: stringr::words
I want to know the number of words that are exactly three letters long in the stringr::words file after applying the following regular expression:
x <- str_view(words, "^...$", match = TRUE)
While the code was able to extract words that are exactly three letters long, it does not tell me how many words there are. So, I thought the length function will be appropriate to find the number.
length(x)
The code returns 8, which cannot be as it is very clear that x is more than 8.
What is the proper syntax to calculate the number of words after matching with the regular expression, in this case, x?
Also, can anyone explain to me why length(x) returns 8 in the above example?
Thank you in advance.
str_view returns an HTML object which is used for viewing.
x <- str_view(words, "^...$", match = TRUE)
class(x)
#[1] "str_view" "htmlwidget"
The 8 components that you see are
names(x)
#[1] "x" "width" "height" "sizingPolicy" "dependencies"
#[6] "elementId" "preRenderHook" "jsHooks"
Instead of str_view use str_subset :
library(stringr)
x <- str_subset(words, "^...$")
x
# [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" "bag"
# [14] "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" "car" "cat"
# [27] "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" "end" "eye" "far"
# [40] "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" "hit" "hot" "how" "job"
# [53] "key" "kid" "lad" "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs"
# [66] "new" "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per" "put"
# [79] "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit" "six" "son" "sun"
# [92] "tax" "tea" "ten" "the" "tie" "too" "top" "try" "two" "use" "war" "way" "wee"
#[105] "who" "why" "win" "yes" "yet" "you"
length(x)
#[1] 110
Another option is str_count:
library(stringr)
sum(str_count(x, "^...$"))
[1] 3
Data:
x <- c("abc", "abcd", "ab", "abc", "abcsd", "edf")
I'd suggest using grep with length:
length(grep("^.{3}$", words))
# => [1] 110
With grep, you actually get a subset of the words and length will return the count of the found matches.
stringr::str_view can be used to view HTML rendering of regular expression match, and it does not actually return the list of matches. Beside grep, you may use stringr::str_subset.

Dictionary of words separated by their length

I have a dataframe of words like so :chr "ABC" "ABM" "AG" "AGB" "AGP" "AD".
I would like to convert it into a list (dictionary) of lists (of words), divided by length:
:chr NULL
:chr [1:2] "AD" "AG"
:chr [1:4] "ABC" "ABM" "AGB" "AGP"
You can use split:
split(words, nchar(words)) # split the words vector by the number of characters
# $`2`
# [1] "AG" "AD"
# $`3`
# [1] "ABC" "ABM" "AGB" "AGP"
Data:
words <- c("ABC", "ABM", "AG", "AGB", "AGP", "AD")

R, split string to pairs of character

How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?
x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above
Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"

Extract elements between a character and space

I'm having a hard time extracting elements between a / and a black space. I can do this when I have two characters like < and > for instance but the space is throwing me. I'd like the most efficient way to do this in base R as This will be lapplied to thousands of vectors.
I'd like to turn this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
This:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thank you all for the answers. I'm going for speed so Andres code wins out. Dwin's code wins for the shotest amount of code. Dirk yours was the second fastest. The stringr solution was the slowest (I figured it would be) and wasn't in base but is pretty understandable (which really is the intent of the stringr package I think as this seems to be Hadley's philosophy with most things.
I appreciate your assistance. Thanks again.
I thought I'd include the benchmarking since this will be lapplied over several thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
Similar but a bit more succinct:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where beginning and end anchor characters
are ^ and $ respectively, .* matches any character.
[^ ]+ takes the nonblank characters.
\\1 is the first tagged character
Use regex pattern that is fwd-slash or space:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read the Q closely enough. One could use that result to extract the even numbered elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Here is a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of 'text before slash':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it is just a matter of splitting the string
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtering the "". There may well be more compact ways for the last bit.
R>
The stringr package has nice functions for working with strings, with very intuitive names. Here you can use str_extract_all to get all matches (including the leading slash), then str_sub to remove the slashes:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"

Resources