Split string in R - r

I am trying to split the output of "ls -lrt" command from Linux. but it's taking only one space as delimeter. If there is two space then its taking 2nd space as value. So I think I need to suppress multiple space as one. Does anybody has any idea on this?
> a <- try(system("ls -lrt | grep -i .rds", intern = TRUE))
> a
[1] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS"
[2] "-rw-r--r-- 1 u7x9573 sashare 86704 Jun 9 16:10 InputSource2.rds"
> str(a)
chr [1:6] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS" ...
>
>c = strsplit(a," ")
>c
[[1]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare" ""
[6] "2297" "Jun" "" "9" "16:10"
[11] "abcde.RDS"
[[2]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare"
[5] "86704" "Jun" "" "9"
[9] "16:10" "InputSource2.rds"
In next step I needed just file name and I used following code which worked fine:
mtrl_name <- try(system("ls | grep -i .rds", intern = TRUE))

This returns that info in a data frame for the indicated files:
file.info(list.files(pattern = "[.]rds$", ignore.case = TRUE))
or if we knew the extensions were lower case:
file.info(Sys.glob("*.rds"))

strsplit takes a regular expression so we can use those to help out. For more info read ?regex
> x <- "Spaces everywhere right? "
> # Not what we want
> strsplit(x, " ")
[[1]]
[1] "Spaces" "" "" "everywhere" "right?"
[6] ""
> # Use " +" to tell it to split on 1 or more space
> strsplit(x, " +")
[[1]]
[1] "Spaces" "everywhere" "right?"
> # If we want to be more explicit and catch the possibility of tabs, new lines, ...
> strsplit(x, "[[:space:]]+")
[[1]]
[1] "Spaces" "everywhere" "right?"

Related

How can I use two lists to create a Table (Columns and Rows)

I want to write a script in R that allows me to import MSG files and store the information in a table. The fields may vary by course, so the column names are defined based on the first MSG file being imported.
The import and extraction are already working (special thanks to the user "January")
What does not work is the filling in the table, which consists of two steps. Add column names and fill in rows.
I've tried using unlist to prepare the contents of the lists so that I can add them as colums and rows to a table.
Anmeldung <- gsub("^\\s+", "", Anmeldung) # remove spaces at the beginning and end
Anmeldung <- gsub("\\s+$", "", Anmeldung)
words <- strsplit(Anmeldung, " *[\n\r]+ *")[[1]]
fields <- as.list(words[seq(1, length(words), 2)])
information <- as.list(words[seq(2, length(words), 2)])
resTab1 = data.frame(t(unlist(fields)))
resTab2 = data.frame(t(unlist(information)))
colnames(resTab2) = c(resTab1)
variable.names(resTab2)
When I am trying to create the Table,this error appears:
colnames(resTab2) = c(resTab1)
Error in names(x) <- value :
'names' attribute [22] must be the same length as the vector [21]
This is what the Dataframes Fields and Information look like:
Fields
> fields
[[1]]
[1] "Anrede"
[[2]]
[1] "Vorname"
[[3]]
[1] "Name"
[[4]]
[1] "Email (für Kontaktaufnahme)"
[[5]]
[1] "Telefon/Mobile (geschäftlich)"
[[6]]
[1] "Telefon/Mobile (privat)"
[[7]]
[1] "Strasse/Nr."
Information:
> information
[[1]]
[1] "Herr"
[[2]]
[1] "James"
[[3]]
[1] "Bond"
[[4]]
[1] "james.bond#email.com"
[[5]]
[1] "007 000 77 07"
[[6]]
[1] "007 000 77 07"
[[7]]
[1] "Lampenstrasse 8"
I see you're trying to give names to resTab2 that is shorter than your resTab1
ex:
x <- c(1,2)
y <- c("a","b","c")
names(x) <- y
#Error in names(x) <- y :
#'names' attribute [3] must be the same length as the vector [2]
EDIT:
use unlist to flatten the list
information <- unlist(information)
fields <- unlist(fields)
names(information) <- fields
information
#OUTPUT
#Anrede 'Herr'
#Vorname 'James'
#Name 'Bond'
#Email (für Kontaktaufnahme) 'james.bond#email.com'
#Telefon/Mobile (geschäftlich) '007 000 77 07'
#Telefon/Mobile (privat) '007 000 77 07'
#Strasse/Nr. 'Lampenstrasse 8'

How to turn a table with strings into a list of vectors in R?

I have a dataset looks like this
> data.frame("letter" = letters, "words" = paste0(1:26,letters, letters,",", rev(letters),letters,5:26, ",", letters, 1:24, rev(letters)))
letter words
1 a 1aa,za5,a1z
2 b 2bb,yb6,b2y
3 c 3cc,xc7,c3x
4 d 4dd,wd8,d4w
5 e 5ee,ve9,e5v
...
And I would like to turn this table into
[[a]]
[1] "1aa" "za5" "a1z"
[[b]]
[1] "2bb" "yb6" "b2y"
[[c]]
[1] "3cc" "xc7" "c3x"
[[d]]
[1] "4dd" "wd8" "d4w"
[[e]]
[1] "5ee" "ve9" "e5v"
...
I have tried to use a for loop which works for me, however, when the nrow of this dataframe increase, it takes longer time. And I would like to know if there is a cleaner wayt to do so?
Your answer is much appreciated.
Thank you very much!!
The function strsplit is what you are looking for. Try :
df = data.frame("letter" = letters, "words" = paste0(1:26,letters, letters,",", rev(letters),letters,5:26, ",", letters, 1:24, rev(letters)))
strsplit(as.character(df$words),',',fixed= TRUE)
[[1]]
[1] "1aa" "za5" "a1z"
[[2]]
[1] "2bb" "yb6" "b2y"
[[3]]
[1] "3cc" "xc7" "c3x"
[[4]]
[1] "4dd" "wd8" "d4w"
[[5]]
[1] "5ee" "ve9" "e5v"

Issue with strsplit not storing searched field

I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street

Count number of times a word-wildcard appears in text (in R)

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:
1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).
2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).
I'm able to achieve (1), but not (2). Can anyone please help? thanks.
library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Result:
# docs word.count activation activated activat*
# 1 doc 1 5 1(20.00%) 1(20.00%) 0
Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected
> text <- "activation has begunm system activated"
> text <- Corpus(VectorSource(text))
> words <- c("activation", "activated", "activat")
> apply_as_df(text, termco, match.list=words)
docs word.count activation activated activat
1 doc 1 5 1(20.00%) 1(20.00%) 2(40.00%)
Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "2.3"
$year
[1] "2015"
$month
[1] "12"
$day
[1] "10"
$`svn rev`
[1] "69752"
$language
[1] "R"
$version.string
[1] "R version 3.2.3 (2015-12-10)"
$nickname
[1] "Wooden Christmas-Tree"
Hope this helps
Maybe consider different approach using library stringi?
text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")
library(stringi)
counts <- unlist(lapply(words,function(word)
{
newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
stri_count_regex(text, newWord)
}))
ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios
Result is:
activation activated activat*
0.2 0.2 0.4
In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

How to write the proper regular expression to extract value from the string?

> str=" 9.48 12.89 13.9 6.79 "
> strsplit(str,split="\\s+")
[[1]]
[1] "" "9.48" "12.89" "13.9" "6.79"
> unlist(strsplit(str,split="\\s+"))->y
> y[y!=""]
[1] "9.48" "12.89" "13.9" "6.79"
How can i get it by a single regular expression with strsplit , not to oparate it with
y[y!=""]?
I would just trim the string before splitting it:
strsplit(gsub("^\\s+|\\s+$", "", str), "\\s+")[[1]]
# [1] "9.48" "12.89" "13.9" "6.79"
Alternatively, it is pretty direct to use scan in this case:
scan(text=str)
# Read 4 items
# [1] 9.48 12.89 13.90 6.79
If you want to extract just the numbers perhaps following regex would do.
regmatches(str, gregexpr("[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "6.79"
To capture -ve numbers you can use following
str = " 9.48 12.89 13.9 --6.79 "
regmatches(str, gregexpr("\\-{0,1}[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "-6.79"

Resources