Chopping string everytime a ; is there R - r

I have a string in which the entries are delimited by a ;.
abc;def;tyu;poi;asf;ghl
Is there a function in R which allows me to split up this string into
abc
def
tyu
all being seperate objects so that I can seperately access and index them?
Is there a way to do this without a character argument? Ny directly indexing a cell in a data frame? The cell looks like the string shown above but if I try
strsplit(k[1,8],split=';')
there is an error non character argument.

items <- scan(text="abc;def;tyu;poi;asf;ghl", sep=";", what="")
Read 6 items
as.matrix(items)
#----------
[,1]
[1,] "abc"
[2,] "def"
[3,] "tyu"
[4,] "poi"
[5,] "asf"
[6,] "ghl"
If these are items in a data.frame, it's likely that you should have used read.delim with sep=";" and probably stringsAsFactors=FALSE. You can still extract from a factor value with as.character:
df <- data.frame(a="abc;def;tyu;poi;asf;ghl")
items <- scan(text=df[1,1], sep=";", what="")
Error in textConnection(text) : invalid 'text' argument
# Use as.character instead
items <- scan(text=as.character(df[1,1]), sep=";", what="")
Read 6 items

For this you can use the strsplit function.
> strsplit('abc;def;ghi', split = ';')
[[1]]
[1] "abc" "def" "ghi"
Note that strsplit is vectorized, and it returns a list of results, also if you just feed one string. To get only that one string:
strsplit('abc;def;ghi', split = ';')[[1]]
[1] "abc" "def" "ghi"
the advantage of the vectorization is that you can feed strsplit a vector of strings:
> strsplit(rep('abc;def;ghi', 10), split = ';')
[[1]]
[1] "abc" "def" "ghi"
[[2]]
[1] "abc" "def" "ghi"
[[3]]
[1] "abc" "def" "ghi"
[[4]]
[1] "abc" "def" "ghi"
[[5]]
[1] "abc" "def" "ghi"
[[6]]
[1] "abc" "def" "ghi"
[[7]]
[1] "abc" "def" "ghi"
[[8]]
[1] "abc" "def" "ghi"
[[9]]
[1] "abc" "def" "ghi"
[[10]]
[1] "abc" "def" "ghi"

Related

ft_tokenizer tokenizes words to lower, I want it to be as they are

I am using ft_tokenizer for spark dataframe in R.
and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
I guess it is not possible with ft_tokenizer. From ?ft_tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing
text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)
which will give the same output as expected and you can continue your process as it is from here.
text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"
#[[1]][[2]]
#[1] "IS"
#[[1]][[3]]
#[1] "a"
#[[1]][[4]]
#[1] "sentence"
#[[2]]
#[[2]][[1]]
#[1] "So"
#[[2]][[2]]
#[1] "is"
#[[2]][[3]]
#[1] "this"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

append a list element-wise to elements of a nested list in R

I'm new to R and still trying to get my head around the apply family instead of using loops.
I have two lists, one nested, the other not, both composed of characters:
>lst1 <- list(c("ABC", "DEF", "GHI"), c("JKL", "MNO", "PQR"))
>lst2 <- c("abc", "def")
I would like to create a third list such that each element of lst2 is appended as the last element of the respective sublist in lst1. The desired output looks like this:
>lst3
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
My experience thus far in R tells me there likely is a way of doing this without writing a loop explicitly.
You can use Map which does exactly what mapply(..., simplify = F) do:
Map(c, lst1, lst2)
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
You can definitely use lapply if you apply your function over the length of your lst1 vector. This works:
lapply(1:length(lst1),function(i) append(lst1[[i]],lst2[[i]]))
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
lapply will not do what you need. You can use a loop with append to do this:
list1 <- list(c("ABC","DEF","GHI"),c("JKL","MNO","PQR"))
list2 <- c("abc","def")
listcomplete <- list(c("ABC","DEF","GHI","abc"),c("JKL","MNO","PQR","def"))
for (i in 1:length(list2)) {
list1[[i]] <- append(list1[[i]],list2[i])
}
Results:
> list1
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"

Return only matched pattern from grep

given the follwing example in R:
my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
grep('(?<=ivw_2014_)[a-z]*',my.list,perl=T,value=T)
returns
a b c
"ivw_2014_abc.pdf" "ivw_2014_def.pdf" "ivw_2014_ghi.pdf"
I would like to make it return only
[1] 'abc' 'def' 'ghi'
in bash I would use the -o option. How do I achieve this in R?
Without using any capturing groups,
> my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
> gsub("^.*_|\\..*$", "", my.list, perl=T)
[1] "abc" "def" "ghi"
For example :
sub('.*_(.*)[.].*','\\1',my.list)
[1] "abc" "def" "ghi"
Following may be of interest:
as.character(unlist(data.frame(strsplit(as.character(unlist(data.frame(strsplit(as.character(my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Following is easier to read:
as.character(
unlist(data.frame(strsplit(as.character(
unlist(data.frame(strsplit(as.character(
my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Another option would be:
library(stringi)
stri_extract_first_regex(unlist(my.list), "[A-Za-z]+(?=\\.)")
#[1] "abc" "def" "ghi"
Look at the regmatches function. It works with regexpr rather than grep, but returns just the matched part of the string.

R strsplit with multiple unordered split arguments?

Given a character string
test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
I wish to obtain
"abc"
"def"
"ghi"
However, using strsplit, one must know the order of the splitting values in the string, as strsplit uses the first value to do the first split, the second to do the second... and then recycles.
But this does not:
strsplit(test_1, c(",", " "))
strsplit(test_2, c(" ", ","))
strsplit(test_2, split=c("[:punct:]","[:space:]"))[[1]]
I am looking to split the string wherever I find any of my splitting values in a single step.
Actually strsplit uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s" would be more to improve readability than of necessity):
> strsplit(test_1, "\\, |\\,| ") # three possibilities OR'ed
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_2, "\\, |\\,| ")
[[1]]
[1] "abc" "def" "ghi" "klm"
Without using both \\, and \\, (note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:
> strsplit(test_2, "\\,\\s|\\,|\\s")
[[1]]
[1] "abc" "def" "ghi" "klm"
#Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:
> strsplit(test_2, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_1, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
In case you don't like regular expressions, you can call strsplit() multiple times:
strsplits <- function(x, splits, ...)
{
for (split in splits)
{
x <- unlist(strsplit(x, split, ...))
}
return(x[!x == ""]) # Remove empty values
}
strsplits(test_1, c(" ", ","))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c(" ", ","))
# "abc" "def" "ghi" "klm"
Updated for the added example
strsplits(test_1, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
But if you are going to use regular expressions, you might as well go with #DWin's approach:
strsplit(test_1, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"
strsplit(test_2, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"
You could go with strsplit(test_1, "\\W").
test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
key_words <- c("abc","def","ghi")
matches <- str_c(key_words, collapse ="|")
str_extract_all(test_1, matches)
str_extract_all(test_2, matches)

Resources