I need to extract a specific number from strings in a vector that look like this:
V1 V2 info
XX YY AB=414312;CD=0.5555;EF=1234;GH=2346;IJ=551;AA_CD=0.4633
VV ZZ AB=1093;CD=0.4444,0.78463;EF=1654;GH=6546;IJ=1241;AA_CD=0.4366
I only want to extract the number from "CD=XXX" (notice there is also a "AA_CD=XXXX" in every row)
I currently have:
df$info <- as.numeric(gsub("^.*;CD=[0-9, ],?|;.*$", "", df$info))
Which grabs the number after "CD=" in instances where there is not more than one number separated by a comma.
I need this to include the rows in which there are more than one number separated by commas.
My regex only works for rows in which there is only one number in that spot, like so:
0.5555
0.4444,0.78463
0.0123
0.34,0.54,0.765
I know it is probably a silly mistake I am making...Thanks in advance!!!
Here is an approach
lapply(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ","), as.numeric)
gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec) #extracts the numbers
#output
1] "0.5555" "0.4444,0.78463"
these are then split at , with strsplit producing a list
then as.numeric converts the list elements with lapply
if it is not needed to keep track of which vector member had which numbers:
as.numeric(unlist(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ",")))
Related
How do I keep only rows that contain a certain string given a list of strings. What I'm trying to say is I don't want to use grepl() and hardcode the values I would like to exclude. Let's assume that I want to only keep records that contain abc or bbc or bcc or 20 more options in one of the columns, and I have x <- c("abc", "bbc", ....).
What can I do to only keep records containing values of x in the dataframe?
You can use %in%:
df_out <- df[df$v1 %in% x, ]
Or, you could form a regex alternation with the values in x and then use grepl:
regex <- paste0("^(?:", paste(x, collapse="|"), ")$")
df_out <- df[grepl(regex, df$v1), ]
The stringi package has good functions for extracting string pattern matches
newdat <- stringi::stri_extract_all(str, pattern)
https://rdrr.io/cran/stringi/man/stri_extract.html
You can even pass the function a list of strings as your pattern to match
I have a dataframe that looks like this:
Col1
4000
2.333.422
1,000,000
0.1
As you can see, I have some numbers that use dots as thousand separators and some that use commas. I want to replace the dots with commas, but only if a value has more than one dot in it, so that the last value that uses a dot as a decimal separator, does not get lost.
Any idea how to do this? Much appreciate any help.
We can use a combination of grepl and gsub for a base R option:
x <- c("4000", "2.333.422", "1,000,000", "0.1")
output <- ifelse(grepl("\\..*\\.", x), gsub("\\.", ",", x), x)
output
[1] "4000" "2,333,422" "1,000,000" "0.1"
I have two datasets each has around 100 variables that have similar names with some minor differences. The variable names in dataset 1 are, CHILD1xxx child1xxx, and the variable names in dataset 2 are, CHILD2xxx child2xxx
For each of the datasets, I want to systematically get rid of the number (i.e.1 or 2) so that the variable names are all CHILDxxx or childxxx.
I was thinking about using str_replace or str_replace_all but wasn't sure what kind of regular expression I would use to capture the above criteria. I would greatly appreciate any insights on this.
UPDATES 11/28/22
The final working code looks like this for replacing names in the entire dataset, as suggested by #Josh White:
colnames(DATASET)<-gsub("^(child)\\d+(.*)", "\\1\\2", colnames(DATASET), ignore.case = TRUE)
Here's one approach using gsub().
It captures the word "child" (ignoring case), and any combination of characters (or none) after a number (\\d+ will capture a set of digits next to each other, so the number can be anything from 0 to Inf). Using capture groups (the things in brackets), we returns the things before and after the digits, but not the digits "\\1\\2".
x <- c("CHILD1xxx", "child2yyy", "Child23hello")
gsub("^(child)\\d+(.*)", "\\1\\2", x, ignore.case = TRUE)
[1] "CHILDxxx" "childyyy" "Childhello"
Another approach could be to remove all numbers but this could be problematic if other numbers come up later on in the string.
gsub("\\d", "", x)
[1] "CHILDxxx" "childyyy" "Childhello"
To remove a substring form a string, you can conveniently use str_remove. Since the substring to be removed is one or more digits, define \\d+ as the pattern for the removal:
library(stringr)
str_remove(x, "\\d+")
[1] "CHILDxxx" "childyyy" "Childhello"
Data:
x <- c("CHILD1xxx", "child2yyy", "Child23hello")
EDIT:
if the replacements should be implemented in column (variable) names in a dataframe, then you could use str_remove together with rename_with:
df %>%
rename_with(~str_remove(., "\\d+"))
CHILDxxx childyyy Childhello SomeOther
1 NA NA NA NA
Data:
df <- data.frame(
CHILD1xxx = NA,
child2yyy = NA,
Child23hello = NA,
SomeOther = NA
)
I have a column that has numeric and strings. I'd like to find only those rows that has a particular string and not the others. In this case, I only need rows that has SE and not the others.
df :
names
SE123, FE43, SA67
SE167, SE24, SE56, SE34
SE23
FE36, KE90, LS87
DG20, SE34, LP47
SE57, SE39
Result df
names
SE167, SE24, SE56, SE34
SE23
SE57, SE39
My code
df[grep("^SE", as.character(df$names)),]
But this selects every row that has SE. Would somebody please help in achieving the result df? Thanks.
Looking at your expected output it looks like you want to select those rows where every element starts with "SE" where each element is a word between two commas.
Using base R, one method would be to split the strings on "," and select rows where every element startsWith "SE"
df[sapply(strsplit(df$names, ","), function(x)
all(startsWith(trimws(x), "SE"))), , drop = FALSE]
# names
#2 SE167, SE24, SE56, SE34
#3 SE23
#6 SE57, SE39
If you want to find presence of "SE" irrespective of position maybe grepl is a better choice.
df[sapply(strsplit(df$names, ","), function(x)
all(grepl("SE", trimws(x)))), , drop = FALSE]
Make sure you have names as character column before doing strsplit or run
df$names <- as.character(df$names)
names[!grepl("[A-Z]",gsub("SE","",names))]
[1] "SE167, SE24, SE56, SE34" "SE23" "SE57, SE39"
You can remove the SE from all strings and then look for any character. Strings having only SE will not contain any other character and are thus kept by the filter.
(This also works if you have 25SE)
I'd like to insert an underscore after the first three characters of all variable names in a data frame. Any help would be much appreciated.
Current data frame:
df1 <- data.frame("genCrc_b1"=c(1,1,1),"genprd"=c(1,1,1) ,"genopr_b1_b2"=c(1,1,1))
Desired data frame:
df2 <- data.frame("gen_Crc_b1"=c(1,1,1),"gen_prd"=c(1,1,1) ,"gen_opr_b1_b2"=c(1,1,1))
My attempts:
gsub('^(.{3})(.*)$', "_", names(df1))
gsub('^(.{3})(.*)$', '\\_\\2', names(df1))
We can use sub to capture the first 3 characters as a group ((.{3})) and in the replacement specify the backreference of the group (\\1) followed by underscore
names(df1) <- sub("^(.{3})", "\\1_", names(df1))
names(df1)
#[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
In the OP's post, especially the last one, there were two capture groups, but only one was specified
gsub('^(.{3})(.*)$', '\\1_\\2', names(df1))
BTW, gsub is not needed as we are replacing only at a single instance instead of multiple times.
In the first case, none of backreference for the captured groups were used in the replacement
If your variable names all begin with gen, we can also do the following.
colnames(df1) <- gsub("gen", "gen_", colnames(df1), fixed = TRUE)
You can also use regmatches<- to replace the sub-expressions.
regmatches(names(df1), regexpr("gen", names(df1), fixed=TRUE)) <- "gen_"
Now, check that the values have been properly changed.
names(df1)
[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
Here, regexpr finds the first position in each element of the character vector that matches the subexpression, "gen". These positions are fed to regmatches and the substitution is performed.