I would like to split text in a column by '' using the separate function in tidyr. Given this example data...
library(tidyr)
df1 <- structure(list(Parent.objectId = 1:2, Attachment.path = c("photos_attachments\\photos_image-20220602-192146.jpg",
"photos_attachments\\photos_image-20220602-191635.jpg")), row.names = 1:2, class = "data.frame")
And I've tried multiple variations of this...
df2 <- df1 %>%
separate(Attachment.path,c("a","b","c"),sep="\\",remove=FALSE,extra="drop",fill="right")
Which doesn't result in an error, but it doesn't split the string into two columns, likely because I'm not using the correct regular expression for the single backslash.
We may need to escape
library(tidyr)
separate(df1, Attachment.path,c("a","b","c"),
sep= "\\\\", remove=FALSE, extra="drop", fill="right")
According to ?separate
sep - ... The default value is a regular expression that matches any sequence of non-alphanumeric values.
By splitting on \, assuming you are trying to get folder and filenames, try these 2 functions:
#get filenames
basename(df1$Attachment.path)
# [1] "photos_image-20220602-192146.jpg" "photos_image-20220602-191635.jpg"
#get foldernames
basename(dirname(df1$Attachment.path))
# [1] "photos_attachments" "photos_attachments"
Related
I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:
trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))
But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?
If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers
library(dplyr)
library(stringr)
out <- df %>%
filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))
sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0
data
set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma",
"Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
replace = TRUE), value = rnorm (50))
You were very close to a correct solution, you just needed to add the "start of string" anchor ^, as follows:
trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))
For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.
Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01
You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)
I want to extract everything but a pattern and return this concetenated in a string.
I tried to combine str_extract_all together with sapply and cat
x = c("a_1","a_20","a_40","a_30","a_28")
data <- tibble(age = x)
# extracting just the first pattern is easy
data %>%
mutate(age_new = str_extract(age,"[^a_]"))
# combining str_extract_all and sapply doesnt work
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep="")))
class(str_extract_all(x,"[^a_]"))
sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep=""))
Returns NULL instead of concatenated patterns
Instead of cat, we can use paste. Also, with tidyverse, can make use of map and str_c (in place of paste - from stringr)
library(tidyverse)
data %>%
mutate(age_new = map_chr(str_extract_all(x, "[^a_]+"), ~ str_c(.x, collapse="")))
using `OP's code
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),
function(x) paste(x,collapse="")))
If the intention is to get the numbers
library(readr)
data %>%
mutate(age_new = parse_number(x))
Here is a non tidyverse solution, just using stringr.
apply(str_extract_all(column,regex_command,simplify = TRUE),1,paste,collapse="")
'simplify' = TRUE changed str_extract_all to output a matrix, and apply iterates over the matrix. I got the idea from https://stackoverflow.com/a/4213674/8427463
Example: extract all 'r' in rownames(mtcar) and concatenate as a vector
library(stringr)
apply(str_extract_all(rownames(mtcars),"r",simplify = TRUE),1,paste,collapse="")
One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.
My dataframe, dat, has two columns which look like this:
value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog
I would like to 'trim' the data frame to only include rows in which condition contains "naming".
I've tried to do this with grep:
dat = dat[grep("naming", dat$condition, value = T)]
which causes the following error:
Error in `[.data.frame`(dat, grep("naming", dat$condition, value = T)) :
undefined columns selected
Can anyone suggest a fix? Any help would be greatly appreciated!
You can split up condition using separate from tidyr:
df = input_df %>% separate( condition, into = c("condition1", "condition2"), sep = "/")
Then just use filter:
only_naming_df = df %>% filter(condition1 == "naming")
The error is easy to fix once adding a comma after the parenthesis. But I want to have a list of available options to achieve this task. Belows are solution and comments from others and mine.
Use grep or grepl
grep returns the index (row number), while grepl returns a logical vector (TRUE or FALSE). Notice that when using grep in this case, value = T should not be added because it will return the string, which is not helpful for subsetting.
dat[grep("naming", dat$condition), ]
dat[grepl("naming", dat$condition), ]
Functions from dplyr and stringr
str_detect is equivalent to grepl(pattern, x), while str_which is equivalent to grep(pattern, x).
library(dplyr)
library(stringr)
dat %>% filter(str_detect(condition, "naming"))
dat %>% slice(str_which(condition, "naming"))
Data Preparation
# Create example dataframes
dat <- read.table(text = "value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog",
header = TRUE, stringsAsFactors = FALSE)