I'm still learning my way around Regex, help much appreciated. I'm trying to extract the string from beginning of file name, aswell as last two characters from inside the square brackets of "File" below to generate "Image" and "ID" variables by mutate shown in data.out.
data<- data.frame("File"= c("TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data",
"TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data"))
data.out<- data %>% data.frame("Image"= c("TA1317", "TA2654"), "ID" = c("2A", "3A"))
File Image ID
1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA2654 3A
Another alternative is strcapture, with only one regex pattern instead of two:
out <- strcapture("^([^_]*).*?\\[[^,]*,([^,]*,[^,*])\\].*", data$File, list(Image = "", ID = ""))
out$ID <- gsub(",", "", out$ID, fixed = TRUE)
out
# Image ID
# 1 TA1317 2A
# 2 TA 2654 3A
cbind(data, out)
# File Image ID
# 1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
# 2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
Within a dplyr pipe, you can still use it:
library(dplyr)
data %>%
bind_cols(strcapture("^([^_]*).*?\\[[^,]*,([^,]*,[^,*])\\].*", .$File, list(Image = "", ID = ""))) %>%
mutate(ID = gsub(",", "", ID, fixed = TRUE))
# File Image ID
# 1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
# 2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
You can try :
transform(data, Image = sub('([A-Z0-9\\s]+)_.*', '\\1', File),
ID = sub('.*\\[.*(\\d+),([A-Z])\\].*', '\\1\\2', File))
# File Image ID
#1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
#2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
where Image captures one or more occurrence of A-Z, 0-9 or whitespace.
and ID consists of a number followed by a comma and a letter between square brackets.
Related
So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Im reading a text file containing data with "problematic" line. The last line that starts with *NOTE has to be removed (number of rows in the text file is not always the same):
ColumnA ColumnB ColumnC
A2 17 14
B2 20 -1
C2 21 36
*NOTE: -1 = data do not exist
This is my line to read the text file (i have to select text file since its location not constant:
my_data <- read.delim(file.choose(), header = TRUE, sep = "", quote = "",
dec = ".", fill = TRUE, comment.char = "")
I have tried :
my_data[- grep("*NOTE:", my_data$ColumnA),]
But it does not seem to work.
Any simple solutions to this?
You could call read.delim with comment.char = "*":
my_data <- read.delim(file.choose(), header = TRUE, sep = "", quote = "",
dec = ".", fill = TRUE, comment.char = "*")
This will remove the final line when you are reading it in because it starts with *.
Another option is fread from data.table. fread has a fancy autostart feature which automagically drops lines without the expected number of columns:
library(data.table)
fread(file.choose())
There is another way to handle this, which is to write a short function that takes the regexes that you want to filter out. You can feed it the file name, but if this is missing it will give you the file dialogue:
read_broken <- function(file_path, filter_out = "^[*]NOTE:")
{
if(missing(file_path)) file_path <- file.choose()
x <- suppressWarnings(readLines(file_path))
x <- x[nzchar(x)]
x <- x[!apply(sapply(filter_out, grepl, x), 1, any)]
read.delim(text = x, header = TRUE, sep = "", quote = "", dec = ".", fill = TRUE)
}
So you can do:
read_broken("myfile.txt")
#> ColumnA ColumnB ColumnC
#> 1 A2 17 14
#> 2 B2 20 -1
#> 3 C2 21 36
Or
read_broken("myfile.txt", filter_out = c("^[*]NOTE:", "A2"))
#> ColumnA ColumnB ColumnC
#> 1 B2 20 -1
#> 2 C2 21 36
I have a column called 'WFBS' that has over a million rows of strings of different lengths that look like this:
WFBS <- c("M010203", "S01020304", "N104509")
and I need an output that looks like this:
WFBS1 <- c("M01", "S01", "N10")
WFBS2 <- c("02", "02", "45")
WFBS3 <- c("03", "03", "09")
WFBS4 <- c(NA, "04", NA)
So I need to separate each string in:
first column: 3 characters (ie the letter followed by 2 digits)
rest of the columns: 2 characters per column until I have no characters left
I tried using the function strsplit, but it says that my variables are not characters, so then I created a vector x as follows:
x <- as.character(WFBS)
but then I don't know how to separate the string into columns with the function strsplit.
An option with base R bu creating a delimiter , using sub, read with read.csv to create a 4 column data.frame
read.csv(text = sub("^(...)(..)(..)(.*)", "\\1,\\2,\\3,\\4", WFBS),
header = FALSE, colClasses = rep("character", 4), na.strings = "",
col.names =paste0("WFBS", 1:4), stringsAsFactors = FALSE)
# WFBS1 WFBS2 WFBS3 WFBS4
#1 M01 02 03 <NA>
#2 S01 02 03 04
#3 N10 45 09 <NA>
This might be a useful starting point:
library(tidyr)
df <- data.frame(WFBS = c("M010203", "S01020304", "N104509"),
stringsAsFactors = FALSE)
> df %>% separate(col = WFBS,
into = c("WFBS1","WFBS2","WFBS3","WFBS4"),
sep = c(3,5,7))
WFBS1 WFBS2 WFBS3 WFBS4
1 M01 02 03
2 S01 02 03 04
3 N10 45 09
This leaves you with empty strings rather than NAs in the remainder spots, which you'd have to convert.
My dataset looks like this below
Id Col1
--------------------
133 Mary 7E
281 Feliz 2D
437 Albert 4C
What I am trying to do is to take the 1st two characters from the 1st word in Col1 and all the whole second word and then merge them.
My final expected dataset should look like this below
Id Col1
--------------------
133 MA7E
281 FE2D
437 AL4C
Any suggestions on how to accomplish this is much appreciated.
You can do
my_data$Col1 <- sub("(\\w{2})(\\w* )(\\b\\w+\\b)", "\\1\\3", my_data$Col1)
my_data$Col1 <- toupper(my_data$Col1)
my_data
# Id Col1
# 1 133 MA7E
# 2 281 FE2D
# 3 437 AL4C
The brackets show the single groups that are matched and only the first and the third are retained. \\w matches letters and numbers and \\b matches the boundary of words.
We can also do this in paste0 together the output of substr and str_split within a dplyr pipe chain:
df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
library(stringr)
df %>%
mutate(Col1 = toupper(paste0(substr(Col1, 1, 2),
stringr::str_split(Col1, ' ')[[1]][-1])))
You can do this in several steps. First split by space, subset first two letters of the name and capitalize them. Paste that together with the second part. Result is in column final. You could take all these intermediate steps or chain commands into less statements, whatever floats your boat.
xy <- data.frame(id = c(133, 281, 437),
name = c("Mary 7E", "Feliz 2D", "Albert 4C"),
stringsAsFactors = FALSE)
xy$first <- sapply(strsplit(xy$name, " "), "[", 1)
xy$second <- sapply(strsplit(xy$name, " "), "[", 2)
xy$first_upper <- toupper(substr(x = xy$first, start = 1, stop = 2))
xy$final <- paste(xy$first_upper, xy$second, sep = "")
xy
id name first second first_upper final
1 133 Mary 7E Mary 7E MA MA7E
2 281 Feliz 2D Feliz 2D FE FE2D
3 437 Albert 4C Albert 4C AL AL4C
Here is another variation using sub. We can use lookarounds in Perl mode to selectively remove everything except for the first two, and last two, characters. Then, make a call to toupper() to capitalize all letters.
df$Col1 <- toupper(sub("(?<=^..).*(?=..$)", "", df$Col1), perl=TRUE)
[1] "MA7E" "FE2D" "AL4C"
Demo
rather than one row solution this is easy to interpret and modify
xx_df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
xx_df %>%
mutate(xpart1 = stri_split_fixed(Col1, " ", simplify = T)[,1]) %>%
mutate(xpart2 = stri_split_fixed(Col1, " ", simplify = T)[,2]) %>%
mutate(Col1_new = paste0(substr(xpart1,1,2), substr(xpart2, 1, 2))) %>%
select(id, Col1 = Col1_new) %>%
mutate(Col1 = toupper(Col1))
result is
id Col1
1 133 MA7E
2 281 FE2D
3 437 AL4C
For this solution use substr to take the first 2 elements from each string, and the last 2. For selecting the last 2 we need nchar, as part of sapply. paste0 together. Also using toupper to have capital letters.
l2 <- sapply(df$Col1, function(x) nchar(x))
paste0(toupper(substr(df$Col1,1,2)), substr(df$Col1, l2-1, l2))
[1] "MA7E" "FE2D" "AL4C"
I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well
ID Function
F04 FZ000TTY WB002FR088DR011
F05 FZ000AGH WZ004ABD
F06 FZ0005ABD
my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example
ID Function SUBFUN1 SUBFUN2 SUBFUN3 SUBFUN4 SUBFUN5
F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time
I tried the the following
Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))
Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))
I get the error message below in my attempts for subfun1 & subfun2:
Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) :
replacement has 0 rows, data has 343456
Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) :
replacement has 0 rows, data has 343456
I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string
Any help would be greatly appreciated with this
Using data.table:
library(data.table)
Aud <- data.frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
stringsAsFactors = FALSE
)
setDT(Aud)
cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
ID Function V1 V2 V3 V4 V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
2: F05 FZ000AGH WZ004ABD FZ AGH WZ ABD <NA>
3: F06 FZ0005ABD FZ ABD <NA> <NA> <NA>
Staying in base R one could do something like the following:
our_split <- strsplit(Aud$Function, "[0-9]+| ")
cbind(
Aud,
do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)
One can use tidyr::separate to divide Function column in multiple columns using regex as separator.
library(tidyverse)
df %>%
separate(Function, into = paste("V",1:5, sep=""),
sep = "([^[:alpha:]]+)", fill="right", extra = "drop")
# ID V1 V2 V3 V4 V5
# 1 F04 FZ TTY WB FR DR
# 2 F05 FZ AGH WZ ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>
([^[:alpha:]]+) : Separate on anything other than alphabates
Data:
df <- read.table(text=
"ID Function
F04 'FZ000TTY WB002FR088DR011'
F05 'FZ000AGH WZ004ABD'
F06 FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)
A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |.
library(tidyverse)
Aud <- data_frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)
search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"
Aud %>%
mutate(code = str_extract_all(Function, search_terms)) %>%
select(-Function) %>%
unnest(code) %>%
group_by(ID) %>%
mutate(subfun = row_number()) %>%
spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID subfun1 subfun2 subfun3 subfun4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F04 FZ TTY WB FR
#> 2 F05 FZ WZ ABD <NA>
#> 3 F06 FZ ABD <NA> <NA>
Created on 2018-07-11 by the reprex package (v0.2.0).