r substring wildcard search to find text - r

I have a data.frame column that has values such as below. I want to use each cell and create two columns- num1 and num2 such that num1=everything before "-" and num2=everything between "-" and "."
I am thinking of using gregexpr function as shown here and write a for loop to iterate over each row. Is there a faster way to do this?
60-150.PNG
300-12.PNG
employee <- c('60-150.PNG','300-12.PNG')
employ.data <- data.frame(employee)

Try
library(tidyr)
extract(employ.data, employee, into=c('num1', 'num2'),
'([^-]*)-([^.]*)\\..*', convert=TRUE)
# num1 num2
#1 60 150
#2 300 12
Or
library(data.table)#v1.9.5+
setDT(employ.data)[, tstrsplit(employee, '[-.]', type.convert=TRUE)[-3]]
# V1 V2
#1: 60 150
#2: 300 12
Or based on #rawr's comment
read.table(text=gsub('-|.PNG', ' ', employ.data$employee),
col.names=c('num1', 'num2'))
# num1 num2
#1 60 150
#2 300 12
Update
To keep the original column
extract(employ.data, employee, into=c('num1', 'num2'), remove=FALSE,
'([^-]*)-([^.]*)\\..*', convert=TRUE)
# employee num1 num2
#1 60-150.PNG 60 150
#2 300-12.PNG 300 12
Or
setDT(employ.data)[, paste0('num', 1:2) := tstrsplit(employee,
'[-.]', type.convert=TRUE)[-3]]
# employee num1 num2
#1: 60-150.PNG 60 150
#2: 300-12.PNG 300 12
Or
cbind(employ.data, read.table(text=gsub('-|.PNG', ' ',
employ.data$employee),col.names=c('num1', 'num2')))
# employee num1 num2
#1 60-150.PNG 60 150
#2 300-12.PNG 300 12

You can try cSplit from my "splitstackshape" package:
library(splitstackshape)
cSplit(employ.data, "employee", "-|.PNG", fixed = FALSE)
# employee_1 employee_2
# 1: 60 150
# 2: 300 12
Since you mention gregexpr, you can probably try something like:
do.call(rbind,
regmatches(as.character(employ.data$employee),
gregexpr("-|.PNG", employ.data$employee),
invert = TRUE))[, -3]
[,1] [,2]
[1,] "60" "150"
[2,] "300" "12"

Another option using stringi
library(stringi)
data.frame(type.convert(stri_split_regex(employee, "[-.]", simplify = TRUE)[, -3]))
# X1 X2
# 1 60 150
# 2 300 12

Or with the simple gsub.
gsub("-.*", "", employ.data$employee) # substitute everything after - with nothing
gsub(".*-(.*)\\..*", "\\1", employ.data$employee) #keep only anything between - and .

The strsplit function will give you what you're looking for, output to a list.
employee <- c('60-150.PNG','300-12.PNG')
strsplit(employee, "[-]")
##Output:
[[1]]
[1] "60" "150.PNG"
[[2]]
[1] "300" "12.PNG"
Note the second argument to strsplit is a regex value, not just a character to split on, so more complicated regexp can be used.

Related

Regular expression in R to remove pairs

I have a code that outputs a pair of integers as "(1, 21)", as a string. The integers are always between 1 and 99.
I want to extract the integers into an array as numeric. How can I do this? I've done some research and it seems regex is the way to go, but I'm unsure exactly how to do this here.
Here are several base R one-liners. These each produce a data frame. Use as.matrix(...) on that if you want a matrix/array. (2) seems particularly compact.
1) trimws/read.table trim non-digits off the ends using trimws and then use read.table to read it in giving the data frame shown.
x <- c("(1, 21)", "(2, 22)", "(3, 33)") # input
read.table(text = trimws(x, white = "\\D"), sep = ",")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
2) gsub/read.table Another approach is to convert each non-digit to a space and then use read.table:
read.table(text = gsub("\\D", " ", x))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
3) strcapture Define a regular expression with captures to use with strcapture.
strcapture("(\\d+), (\\d+)", x, data.frame(V1 = integer(0), V2 = integer(0)))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
4) chartr/read.table Use chartr to replace ( with a space and then use read.table defining the comment character as ).
read.table(text = chartr("(", " ", x), sep = ",", comment.char = ")")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
You can use gsub to remove ( and ) using [()] and then use strsplit to split at , . unlist the retuned list and convert it to as.integer and create a matrix or array.
matrix(as.integer(unlist(strsplit(gsub("[()]", "", x), ", ", TRUE))), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 21 31
Data:
x <- c("(1, 21)", "(3, 31)")
Tidyverse way
x <- c("(1, 21)", "(33, 99)", "(1, 7)")
library(tidyverse)
map_dfr(str_split(str_replace(x, '\\((\\d+)\\,\\s(\\d+)\\)', '\\1 \\2'), ' '), ~ set_names(.x, c('A', 'B')))
#> # A tibble: 3 x 2
#> A B
#> <chr> <chr>
#> 1 1 21
#> 2 33 99
#> 3 1 7
Created on 2021-06-02 by the reprex package (v2.0.0)

extracting before digits before and after forward slash /

I have trouble with extracting the string before and after /.
x <- c("maximusa/b=5/1","maximusa/b=-4/1","maximusa/b=3/-2")
before_slash=sub(".*=(\\d+).*","\\1", x, perl = TRUE)
gives
"5" "maximusa/b=-4/1" "3"
then
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
gives
"maximusa/b=5/1" "maximusa/b=-4/1" "maximusa/b=3/-2"
OTH, the expected output
before slash 5 -4 3
after slash 1 1 -2
how can I get the expected output ?
thx for answers,
I would like to add one more condition to extract the strings
assume we have string like that.
Same as the OP how could we extract with + sign as well as ignoring the parentheses ? Current solution of #mob gives
x <- c("maximusa/b=(5/+1)","maximusa/b=(-4/1)","maximusa/b=(+3/-2)")
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
> after_slash
[1] "maximusa/b=(5/+1)" "1)" "maximusa/b=(+3/-2)"
and
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
> before_slash
[1] "maximusa/b=(5/+1)" "maximusa/b=(-4/1)" "maximusa/b=(+3/-2)"
I tried some but no luck!
One problem is that
after_slash=sub("^.*\\/(d+)","\\1", x, perl = TRUE)
should be
after_slash=sub("^.*/(\\d+)","\\1", x, perl = TRUE)
To capture negative integers as well, you'll want to use
before_slash=sub(".*=(-?\\d+).*","\\1", x, perl = TRUE)
after_slash=sub("^.*/(-?\\d+)","\\1", x, perl = TRUE)
The tokens -? mean "the - character, 0 or 1 times"
We can use str_extract_all to match a - (if any) followed by one or more digits ([0-9]+) and change the type of it to numeric
library(tidyverse)
map_dfc(str_extract_all(x, "-?[0-9]+"), as.numeric)
# A tibble: 2 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 5 -4 3
#2 1 1 -2
Or with read.table after getting the substring with sub and then specifying the sep as / to create a two column data.frame
read.table(text= sub(".*=", "", x), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2
Or another option is strsplit
sapply(strsplit(x, "[=/]"), `[`, 3:4)
Update
If the OP's string have () as well, the first option should work well, but in the second option, we can change
x1 <- c("maximusa/b=(5/1)","maximusa/b=(-4/1)","maximusa/b=(3/-2)")
read.table(text= gsub(".*=|[()]", "", x1), sep="/")
# V1 V2
#1 5 1
#2 -4 1
#3 3 -2
This should work, too.
matrix(as.numeric(unlist(strsplit(
gsub("(^\\w*\\/)(b=)(-?\\d)(\\/)(-?\\d$)", "\\3 \\5", x), " "))), 2)
# [,1] [,2] [,3]
# [1,] 5 -4 3
# [2,] 1 1 -2

How to split character value properly

I have a data frame which consists of some composite information. I would like to split the vector a into the vectors "a" and "d", where "a" corresponds only to the numeric ID 898, 3467 ,234 ,222 and vector "d" contains the corresponding character values.
Data:
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df<-data.frame(a,b,c)
What I tried so far:
a<-str(df$a)
a<-strsplit(df$a, split)
But that just doesn't work out with my regular expression skills.
The required output table might have the form:
a d b c
898 Me 1 2
3467 You or 8 3
234 Hi-hi 3 6
222 what 8 2
library(tidyr)
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df <-data.frame(a,b,c)
final_df <- separate(df , a , c("a" , "d") , sep = "_")
# a d b c
#1 898 Me 1 2
#2 3467 You or 8 4
#3 234 Hi-hi 3 6
#4 222 what 8 2
final_df$d
# [1] "Me" "You or " "Hi-hi" "what"
strsplit is right, but you need to pass the character to split with:
do.call(rbind, strsplit(as.character(df$a), "_"))
# [,1] [,2]
# [1,] "898" "Me"
# [2,] "3467" "You or "
# [3,] "234" "Hi-hi"
# [4,] "222" "what"
Or
library(stringi)
stri_split_fixed(df$a, "_", simplify = TRUE)
With your example, Here is my solution in base R:
df$a2 <- gsub("[^0-9]", "", a)
df$d <- gsub("[0-9]", "", a)
That gives:
> df
a b c a2 d
1 898_Me 1 2 898 _Me
2 3467_You or 8 4 3467 _You or
3 234_Hi-hi 3 6 234 _Hi-hi
4 222_what 8 2 222 _what
Not elegant but it preserves original data and easy to apply.

R data frame format group

I have data frame in this format-
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
How can I get this to-
ABC 2 4 6
DEF 10 20
I tried the aggregate function, but it needs functions like mean/sum as params. How can I just display the values directly in the row.
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20")
unstack(df, form=V2~V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
unstack produces a list in this case as the columns don't have the same length. In case of the same length:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
t(unstack(df, form=V2~V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
Well, what are the observations? Are they suppose to measure the same thing for each category?
You can't actually get a data frame exactly as you have posted, because the number of observations for each category is different. But you could do that if you add an "NA" to the "DEF".
Like this:
ABC 2 4 6
DEF 10 20 NA
If that is what you want, you could just use reshape2's dcast.
But you have to name the observations:
library(reshape2)
df <- data.frame(obs =c(1:3, 1:2),
categories = c(rep("ABC", 3), rep("DEF",2)),
values=c(2,4,6,10,20), stringsAsFactors=FALSE)
df2 <- dcast(df, categories~obs)
df2
# categories 1 2 3
# 1 ABC 2 4 6
# 2 DEF 10 20 NA
To add to your alternatives:
This seems to be a basic "long to wide" reshape problem, but it is missing a "time" variable. It's easy to recreate one by using ave:
ave(as.character(df$V1), df$V1, FUN = seq_along)
# [1] "1" "2" "3" "1" "2"
df$time <- ave(as.character(df$V1), df$V1, FUN = seq_along)
Once you have a "time" variable, using reshape is pretty straightforward:
reshape(df, idvar="V1", timevar="time", direction = "wide")
# V1 V2.1 V2.2 V2.3
# 1 ABC 2 4 6
# 4 DEF 10 20 NA
If, instead, you wanted a list, there is no need for the time variable. Just use split:
split(df$V2, df$V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
#
Similarly, if your data were balanced, split plus rbind could get you what you need. Using the sample data from #lukeA:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
do.call(rbind, split(df$V2, df$V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
You want to obtain a sparse matrix? The two rows in your example have different lengths. Try a function producing a list:
mat<-cbind(
c("ABC","ABC","ABC","DEF","DEF"),
c(2,4,6,10,20)
)
count<-function(mat){
values<-unique(mat[,1])
outlist<-list()
for(v in values){
outlist[[v]]<-mat[mat[,1]==v,2]
}
return(outlist)
}
count(mat)
Which will give you this result:
$ABC
[1] "2" "4" "6"
$DEF
[1] "10" "20"

How to create a new data frame with original data separated by ; and with different counts per category?

I have a table with the following format.
df1 <- data.frame (A=c("aaa", "bbb", "ccc", "ddd"),
B=c("111; 222", "333", "444; 555; 666; 777", "888; 999"))
A B
1 aaa 111; 222
2 bbb 333
3 ccc 444; 555; 666; 777
4 ddd 888; 999
I want to have a dataframe like this:
aaa 111
aaa 222
bbb 333
ccc 444
ccc 555
ccc 666
ccc 777
ddd 888
ddd 999
I found a wonderful solution to convert a similar list to dataframe in previous Stack Overflow questions. However, it is difficult for me to convert it from a dataframe with multiple entries. How can I do this?
Here is a simple base R solution (explanation below):
spl <- with(df1, strsplit(as.charcter(B), split = "; ", fixed = TRUE))
lens <- sapply(spl, length)
out <- with(df1, data.frame(A = rep(A, lens), B = unlist(spl)))
Which gives us:
R> out
A B
1 aaa 111
2 aaa 222
3 bbb 333
4 ccc 444
5 ccc 555
6 ccc 666
7 ccc 777
8 ddd 888
9 ddd 999
What is the code doing? Line 1:
spl <- with(df1, strsplit(as.character(B), split = "; ", fixed = TRUE))
breaks apart each of the strings in B using "; " as the characters to split on. We use fixed = TRUE (as suggested by #Marek in the comments) to speed up the matching and splitting as in this case we do not need to match using a regular expression, we simply want to match on the stated string. This gives us a list with the various elements split out:
R> spl
[[1]]
[1] "111" "222"
[[2]]
[1] "333"
[[3]]
[1] "444" "555" "666" "777"
[[4]]
[1] "888" "999"
The next line simply counts how many elements there are in each component of the list spl
lens <- sapply(spl, length)
which gives us a vectors of lengths:
R> lens
[1] 2 1 4 2
The final line of the solution plugs the outputs from the two previous steps into a new data frame. The trick is to repeat each element of df1$A lens number of times; for which we use the rep() function. We also need to unwrap the list spl into a vector which we do with unlist():
out <- with(df1, data.frame(A = rep(A, lens), B = unlist(spl)))
Literally the same as step one in my answer to your previous question:
library(reshape)
x <- melt((strsplit(as.character(df1$B), "; ")))
x <- data.frame("A"=df1[x$L1,1],"B"=x$value)

Resources