splitting filename text by underscores using R

splitting filename text by underscores using R - r

In R I'd like to take a collection of file names in the format below and return the number to the right of the second underscore (this will always be a number) and the text string to the right of the third underscore (this will be combinations of letters and numbers).
I have file names in this format:
HELP_PLEASE_4_ME
I want to extract the number 4 and the text ME
I'd then like to create a new field within my data frame where these two types of data can be stored. Any suggestions?

Here is an option using regexec and regmatches to pull out the patterns:
matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.
Produces:
a match.1 match.2
1 HELP_PLEASE_4_ME 4 ME
2 SOS_WOW_3_Y34OU 3 Y34OU
This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.
And the data:
df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)

The obligatory stringr version of #BrodieG's quite spiffy answer:
df[c("match.1", "match.2")] <-
t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))
Put here for context only. You should accept BrodieG's answer.

Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.
> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
## string under2 under3
## 1 HELP_PLEASE_4_ME 4 ME
Then for longer vectors, you could do something like the last two lines here.
## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME" "GOODBYE_TWO_21_YOU"
## [3] "HI_THREE_22_THEM" "BYE_FOUR_23_US"
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
## XX V2 V3
## 1 HELLO_ONE_20_ME 20 ME
## 2 GOODBYE_TWO_21_YOU 21 YOU
## 3 HI_THREE_22_THEM 22 THEM
## 4 BYE_FOUR_23_US 23 US

Related

How to turn one symbol into another

Im trying to find a way to substitute out the "|" symbol in this dataset for the "/" symbol
df<-c(2|4,5|6,3|4,4|7,5|8)
So that way the final would look like
df<-c(2/4,5/6,3/4,4/7,5/8)
Any help would be great thankyou

If it is a single string (or a vector of elements) use chartr from base R
chartr("|", "/", df)
[1] "2/4,5/6,3/4,4/7,5/8"
df2$col1 <- chartr("|", "/", df2$col1)
> df2
col1
1 2/4
2 5/6
3 3/4
4 4/7
5 5/8
data
df <- "2|4,5|6,3|4,4|7,5|8"
df2 <- data.frame(col1 = c("2|4", "5|6", "3|4", "4|7", "5|8"))

How to return rows of a df that contain strings from a character list

I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.

Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats

This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]

How to split strings and numbers in R?

I have character vector of the following form (this is just a sample):
R1Ng(10)
test(0)
n.Ex1T(34)
where as can be seen above, the first part is always some combination of alphanumeric and punctuation marks, then there are parentheses with a number inside. I want to create a numeric vector which will store the values inside the parentheses, and each number should have name attribute, and the name attribute should be the string before the number. So, for example, I want to store 10, 0, 34, inside a numeric vector and their name attributes should be, R1Ng, test, n.Ex1T, respectively.
I can always do something like this to get the numbers and create a numeric vector:
counts <- regmatches(data, gregexpr("[[:digit:]]+", data))
as.numeric(unlist(counts))
But, how can I extract the first string part, and store it as the name attribute of that numberic array?

How about this:
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name = gsub( "\\(.*", "", x),
Count = as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)))
# Name Count
# 1 R1Ng 10
# 2 test 0
# 3 n.Ex1T 34
Or alternatively as a vector
setNames(as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)),
gsub( "\\(.*", "", x ))
# R1Ng test n.Ex1T
# 10 0 34

Here is another variation using the same expression and capturing parentheses:
temp <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name=gsub("^(.*)\\((\\d+)\\)$", "\\1", temp),
count=gsub("^(.*)\\((\\d+)\\)$", "\\2", temp))

We can use str_extract_all
library(stringr)
lst <- str_extract_all(x, "[^()]+")
Or with strsplit from base R
lst <- strsplit(x, "[()]")
If we need to store as a named vector
sapply(lst, function(x) setNames(as.numeric(x[2]), x[1]))
# R1Ng test n.Ex1T
# 10 0 34
data
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")

R function to input character vector

I currently have 10 vectors that look like the following:
string1 <- c("house", "home", "cabin")
string2 <-c("hotel", "hostel", "motel")
and so on for 10 strings.
R newbie learning functions. I have the following code I want to execute across these 10 strings, and turn in to a function. This code takes in these strings and searches for matches and creates a new variable:
a$string.i <- (1:nrow(a) %in% c(sapply(string1, grep, a$Contents, fixed = TRUE))) +0
As I am new to R, I'm stumped on how to turn this into a function. Do I need to first define the number of strings, then set 'string1' in the above code to x? How do I set the name of the variable = to the name of the string?
Some sample data:
a <- read.table(text='Contents other
1 "a house a home" "111"
2 "cabin in the woods" "121"', header=TRUE)

If you need a function, may be you can try:
fun1 <- function(namePrefix, dat){ #assuming that the datasets have a common prefix i.e. `string`
pat <- paste0("^", namePrefix, "\\d")
nm1 <- ls(pattern=pat, envir=.GlobalEnv)
lst <- mget(nm1, envir=.GlobalEnv)
lst2 <- lapply(lst, function(x)
(1:nrow(dat) %in% c(sapply(x, grep, dat$Contents, fixed=TRUE)))+0) #your code
dat[names(lst2)] <- lst2
dat
}
fun1("string", a)
# Contents other string1 string2
#1 a house a home 111 1 0
#2 cabin in the woods 121 1 0

How to read user input into the subset command

I have some R command like this
subset(
(aggregate(cbind(var1,var2)~Ei+Mi+hours,a, FUN=mean)),
(aggregate(cbind(var1,var2)~Ei+Mi+hours,a, FUN=mean))$Ei == c(1:EXP)
)
I want to do
1) Ask the user to input the var1 and var2
2) Get those variables into the subset command line as shown above and
continue with other things.
Note: for reading the user input I have variables like
c(ax,bx,cx,dx,ex,fx,gx,hx,ix,jx,kx,lx,mx,nx,ox) = c(1:15) and each
variable is mapped to number 1 to 15. So displaying this for user and
asking the user to select any number between 1 to 15 and then
checking the corresponding variable for the entered number and
reading this into the command line is whats the best method, I think.
So how can I implement this?
Regarding the answer:
Just wondering there is one possible scenario like , if the user wants to enter multiple of numbers in one go. [ex: 1,2,3]...than how to read this using readlines as said in the answer below using
v1 <- quote(var1 <- as.numeric(readline('Enter Variable 1: ')))
eavl(v1)
xx <- paste0(letters[1:15], 'x')
xx[v1]
How to read multiple variables in this case?

Here's a rough example of the readline interactive prompt. When v1 is evaluated, the user will be prompted to enter a value. That value is then stored as var1.
> v1 <- quote(var1 <- as.numeric(readline('Enter Variable 1: ')))
> eval(v1)
Enter Variable 1: 1000 ## user enters 1000, for example
> 100 + var1 + 50 ## example to show captured output as object
## [1] 1150
So in your case it might go something like
> v1 <- quote(var1 <- as.numeric(readline('Enter a number from 1 to 15: ')))
> eval(v1)
Enter a number from 1 to 15: 7
> var1
## [1] 7
> xx <- paste0(letters[1:15], 'x')
> xx
## [1] "ax" "bx" "cx" "dx" "ex" "fx" "gx" "hx" "ix" "jx" "kx" "lx" "mx" "nx" "ox"
> xx[var1]
## [1] "gx"
I borrowed this idea for a function from this older SO post. You can return the output invisibly and it will still take in the user values.
input.fun <- function(){
v1 <- readline("var1: ")
v2 <- readline("var2: ")
v3 <- readline("var3: ")
v4 <- readline("var4: ")
v5 <- readline("var5: ")
out <- sapply(c(v1, v2, v3, v4, v5), as.numeric, USE.NAMES = FALSE)
invisible(out)
}
> x <- input.fun()
var1: 7
var2: 4
var3: 8
var4: 5
var5: 2
> x
[1] 7 4 8 5 2
In response to your edit: I'm not sure if this is the standard method for reading multiple numbers in one line, but it works.
> xx <- readline('Enter numbers separated by a space: ')
Enter numbers separated by a space: 4 12 67 9 2
> as.numeric(strsplit(xx, ' ')[[1]])
## [1] 4 12 67 9 2

Here's a possibility using scan()
#sample data
df<-data.frame(
ax=runif(50),
bx=runif(50),
cx=runif(50),
dx=runif(50),
Ei=sample(letters[1:5], 50, replace=T)
)
#get vars
vars<-c(NA,NA)
while(any(is.na(vars))) {
cat(paste("enter var number", sum(!is.na(vars))+1),"\n")
cat(paste(seq_along(names(df)), ":", names(df)), sep="\n")
try(n<-scan(what=integer(), nmax=1), silent=T)
vars[min(which(is.na(vars)))]<-n
}
#--pause
#use vars
subset(aggregate(df[,vars], df[,c("Ei"), drop=F], FUN=mean), Ei=="a")
It's not super robust, but if you copy the first half (before the pause) it will ask you for two variable numbers, and then if you run the second half, it will use those two values. I've adjusted the aggregate and subset to be more appropriate for variable usage which means not using the formula syntax.
I did not do any error checking. That's left as an exercise for the asker.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

splitting filename text by underscores using R - r

The obligatory stringr version of #BrodieG's quite spiffy answer: df[c("match.1", "match.2")] <- t(sapply(str_match_all(df$a, "^.?_.?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3)) Put here for context only. You should accept BrodieG's answer.

Related

How to turn one symbol into another

How to return rows of a df that contain strings from a character list

How to split strings and numbers in R?

R function to input character vector

How to read user input into the subset command

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

splitting filename text by underscores using R - r

The obligatory stringr version of #BrodieG's quite spiffy answer: df[c("match.1", "match.2")] <- t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3)) Put here for context only. You should accept BrodieG's answer.

Related

How to turn one symbol into another

How to return rows of a df that contain strings from a character list

How to split strings and numbers in R?

R function to input character vector

How to read user input into the subset command

Categories

Resources

The obligatory stringr version of #BrodieG's quite spiffy answer: df[c("match.1", "match.2")] <- t(sapply(str_match_all(df$a, "^.?_.?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3)) Put here for context only. You should accept BrodieG's answer.