Separate column on the last digit - r

Say I have a df such as this:
x <- data.frame("SN" = 1:3, "Age" = c(21,15,2), "Name" = c("Q62yes","Q44_1_1Maybe", "Q2Some times"))
I would like separate out the Name column such that:
x_out <- data.frame("SN" = 1:3, "Age" = c(21,15,2), "Name" = c("Q62","Q44_1_1","Q2"), "New" = c("yes", 'Maybe', 'some times'))
I tried this, but I don't think my regex is not separating it into two groups as expected. Any suggestions?
x %>%
tidyr::separate(Name,c("name",'new'), sep = "(Q[[:digit:]]*_[[:digit:]])*([[:alpha:]]*\\s*)")

You can use
x %>%
tidyr::extract(Name,c("name",'new'), "(.*?\\d)([[:alpha:]].*)")
The regex means:
(.*?\d) - Group 1: any zero or more chars as few as possible till the digit that is followed with the subsequent subpatterns
([[:alpha:]].*) - Group 2: a letter and then the rest of the string.
See the regex demo.
R test with output:
> x %>%
+ tidyr::extract(Name,c("name",'new'), "(.*?\\d)([[:alpha:]].*)")
SN Age name new
1 1 21 Q62 yes
2 2 15 Q44_1_1 Maybe
3 3 2 Q2 Some times

We can use a regex lookaround to split between a digit (\\d) and non-digit ([A-Za-z]) in separate
library(tidyr)
library(dplyr)
x %>%
separate(Name, into = c("Name", "New"), sep="(?<=\\d)(?=[A-Za-z])")
-output
SN Age Name New
1 1 21 Q62 yes
2 2 15 Q44_1_1 Maybe
3 3 2 Q2 Some times
Note that this will also work when we have no digits as well compared to extract
x$Name[3] <- "hello"

Related

R: explode a character-string and get the last element (row-wise)

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?
With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA
The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?
You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))
We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)
Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

Remove part of a string based on overlapping patterns

I have the following data:
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))
Further, I have the following vector of patterns:
my_patterns <- c("my example", "is my", "my other text")
What I want to achieve is to remove any text of my_patterns that occurs in dat$x.
I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.
Wrong solution:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")
dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))
I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.
Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.
So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.
Desired Output of dat would be:
x some_other_cols short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short
Appreciate your help! Thanks.
New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:
library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}
If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"
The idea is to create the pattern with optional parts (the ? after the grouping parentheses.
So we have roughly:
(is )? <= optional "is" followed by space
my <= literal "my" followed by space
(other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"
If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:
# Given datas
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
some_other_cols = c(1, 2, 2, 4))
my_patterns <- c("my example", "is my", "my other text")
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
ret <- strsplit(text,"")[[1]]
lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
sapply(regexec(pattern,vector),
function(x) {
if(x>-1) {
c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
}
})
}
# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}
Result after run:
> dat
x some_other_cols result
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
There are two crucial points here:
The patterns to remove from a string may overlap
There may be multiple non-overlapping patterns to remove from the string
The solution below tries to address both issues using my favorite tools
library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later
library(stringr)
library(magrittr) # piping used to improve readability
pos <-
# find start and end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn")) %>%
rbindlist() %>%
# collapse overlapping positions
setorder(rn, start, end) %>%
.[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>%
.[, .(start = min(start), end = max(end)), by = .(rn, grp)]
Now, pos has become:
rn grp start end
1: 1 1 6 18
2: 2 1 10 25
3: 3 1 1 13
4: 5 1 6 10
5: 5 2 24 28
6: 6 1 1 13
7: 6 2 15 27
8: 7 1 3 7
9: 8 1 1 10
10: 8 2 12 16
11: 8 3 22 34
12: 9 1 1 10
13: 9 2 19 31
# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
# update join
dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][ #remove row number
, short_x := str_squish(short_x)][] # remove whitespace
x some_other_cols short_x
1: this is my example text 1 this text
2: and here is my other text example 2 and here example
3: my other text is short 2 is short
4: yet another text 4 yet another text
5: this is my text where 'is my' appears twice 5 this text where '' appears twice
6: my other text is my example 6
7: This myself 7 Thself
8: my example is my not my other text 8 not
9: my example is not my other text 9 is not
The code to collapse overlapping positions is modified from this answer.
The intermediate result
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn"))
[[1]]
rn start end
1: 1 9 18
2: 6 18 27
3: 8 1 10
4: 9 1 10
[[2]]
rn start end
1: 1 6 10
2: 2 10 14
3: 5 6 10
4: 5 24 28
5: 6 15 19
6: 7 3 7
7: 8 12 16
[[3]]
rn start end
1: 2 13 25
2: 3 1 13
3: 6 1 13
4: 8 22 34
5: 9 19 31
shows that patterns 1 and 2 overlap in row 1 and patterns 2 and 3 overlap in row 2. Rows 5, 8, and 9 have non-overlapping patterns. Row 7 is to show that patterns are extracted regardless of word boundaries.
EDIT: dplyr version
The OP has mentioned that he/she has "successfully avoided data.table so far". So, I felt challenged to add a dplyr version:
library(dplyr)
library(stringr)
pos <-
# find start end end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as_tibble) %>%
bind_rows(.id = "rn")) %>%
bind_rows() %>%
# collapse overlapping positions
arrange(rn, start, end) %>%
group_by(rn) %>%
mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>%
group_by(rn, grp) %>%
summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
mutate(rn = row_number() %>% as.character(),
short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
dat <- dat %>%
left_join(pos %>% filter(grp == g), by = "rn") %>%
mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>%
select(-grp, -start, -end)
}
# remove row number
dat %>%
select(-rn) %>%
mutate(short_x = str_squish(short_x))
x some_other_cols short_x
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
5 this is my text where 'is my' appears twice 5 this text where '' appears twice
6 my other text is my example 6
7 This is myself 7 This self
8 my example is my not my other text 8 not
9 my example is not my other text 9 is not
The algorithm is essentially the same. However, there are two challenges here where dplyr differs from data.table:
dplyr requires explicit coersion from factor to character
there is no update join available in dplyr, so the for loop has become more verbose than the data.table counterpart (Perhaps, someone knows a fancy purrr function or a map-reduce trick to accomplish the same?)
EDIT 2
There are some bug fixes and improvements to above codes:
Collapsing positions has been corrected to work also for some edge case I have added to dat.
seq() has been replaced by seq_len().
str_squish() reduces repeated whitespace inside a string and removes whitespace from start and end of a string.
Data
I have added some use cases to test for non-overlapping patterns and complete removal, e.g.:
dat <- data.frame(
x = c(
"this is my example text",
"and here is my other text example",
"my other text is short",
"yet another text",
"this is my text where 'is my' appears twice",
"my other text is my example",
"This myself",
"my example is my not my other text",
"my example is not my other text"
),
some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")

String manipulation in r where contents of interest are in a different order

I have dataframe where I am attempting to extract content of a column and then append it to the dataframe as a new column.
For example my dataframe looks like:
> head(df)
id event_params
1 {"type":"L","maximumangle":-87.618,"duration":25}
2 {"type":"L","maximumangle":1.62,"duration":25}
3 {"maximumangle":-29.661,"type":"L","duration":20}
I wish to extract the maximum angle, and then append this to the existing dataframe as a new column titled maximumangle. My initial thought was to use the grep function. However, since maximumangle does not appear in the same order in each row, this will not work.
What can I do to achieve what I want?
1) Parse the last column using fromJSON in the rjson package. This adds all the JSON data.
library(rjson)
L <- lapply(as.character(DF$event_params), fromJSON)
cbind(DF, do.call("rbind", lapply(L, as.data.frame, stringsAsFactors = FALSE)))
giving:
id event_params type maximumangle duration
1 1 {"type":"L","maximumangle":-87.618,"duration":25} L -87.618 25
2 2 {"type":"L","maximumangle":1.62,"duration":25} L 1.620 25
3 3 {"maximumangle":-29.661,"type":"L","duration":20} L -29.661 20
2) We can simplify this slightly if you really only need maximumangle:
maximumangle <- function(x) fromJSON(as.character(x))$maximumangle
transform(DF, maximumangle = sapply(DF$event_params, maximumangle, USE.NAMES = FALSE))
giving:
id event_params maximumangle
1 1 {"type":"L","maximumangle":-87.618,"duration":25} -87.618
2 2 {"type":"L","maximumangle":1.62,"duration":25} 1.620
3 3 {"maximumangle":-29.661,"type":"L","duration":20} -29.661
Note
We assumed that the input in reproducible form is given by:
Lines <- '
id event_params
1 {"type":"L","maximumangle":-87.618,"duration":25}
2 {"type":"L","maximumangle":1.62,"duration":25}
3 {"maximumangle":-29.661,"type":"L","duration":20}'
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
1) We can use str_extract from stringr by using a regex lookaround to match the string 'maximumangle' followed by a quote (") and colon (:) and extract the pattern the follows it i.e. zero or more - (-*) followed by numbers with digits ([0-9.]+)
library(dplyr)
library(stringr)
df %>%
mutate(maximumangle = as.numeric(str_extract(event_params,
'(?<=maximumangle":)-*[0-9.]+')))
# id event_params maximumangle
#1 1 {"type":"L","maximumangle":-87.618,"duration":25} -87.618
#2 2 {"type":"L","maximumangle":1.62,"duration":25} 1.620
#3 3 {"maximumangle":-29.661,"type":"L","duration":20} -29.661
2) Or the same can be done with base R using regexpr/regmatches
df$maximumangle <- as.numeric(regmatches(df$event_params,
regexpr('(?<=maximumangle":)-*[0-9.]+', df$event_params, perl = TRUE)))
data
df <- structure(list(id = 1:3, event_params = c("{\"type\":\"L\",\"maximumangle\":-87.618,\"duration\":25}",
"{\"type\":\"L\",\"maximumangle\":1.62,\"duration\":25}", "{\"maximumangle\":-29.661,\"type\":\"L\",\"duration\":20}"
)), .Names = c("id", "event_params"), class = "data.frame", row.names = c(NA,
-3L))

Extract data after a string one each appearance R

I have a data like this (named spectra):
#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216
In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal.
So, I want to have the code that can help extract this sample into 3 pieces of data.
Animal 1: 3 lines after 1st ' ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd ' ##XYDATA=(X++(Y..Y))'
And so on.
I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.
bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])
Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector.
Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.
c_data<- c("split", 1, 2, 3,
"split", 4, 5, 6)
y<- c_data == "split"
df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){
if (y[i] == TRUE){
x <- x +1
df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
}
}
yields
animal v1 v2 v3
1 split 1 2 3
2 split 4 5 6
If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.
You can do either of the following with split and map:
library(dplyr)
library(purrr)
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
split(.$Animal) %>%
map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
'['(-1)
This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.
You can also do the following:
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
filter(!grepl("^#.*$", V1)) %>%
mutate(V1 = as.numeric(V1)) %>%
split(.$Animal)
This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.
Result:
$`1`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 649.0251 1
2 667.6752 1
3 686.3254 1
$`2`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 723.6257 2
2 742.2758 2
3 760.9260 2
$`3`
# A tibble: 4 x 2
V1 Animal
<dbl> <int>
1 872.8268 3
2 891.4770 3
3 910.1271 3
4 928.7773 3
Note:
Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.
Data:
df = read.table(textConnection("'#Milk spectra: 1234'
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216"),comment.char = "", stringsAsFactors = FALSE)

Resources