I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df
Related
I am trying this for long now. My data frame looks like this:
type=c("ID=gene:PFLU_4201;
biotype=protein_coding;description=putative filamentous adhesin;gene_id=PFLU_4201;
logic_name=ena",
"ID=gene:PFLU_5927;Name=algP1;biotype=protein_coding;
description=transcriptional regulatory protein algp (alginate regulatory protein algr3);
gene_id=PFLU_5927;logic_name=ena")
SNP=c(1, 2)
data=data.frame(type, SNP)
I would like to isolate from the type column only the string PFLU_*** and my data to look like this
type SNP
PFLU_4201 1
PFLU_5927 2
Any help is more than welcome
Assuming that ID=gene:PFLU_*** and gene_id=PFLU_*** are always the same you can use the mutate and str_extract functions from the dplyr and stringr packages, Both are part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = str_extract(type,"\\bPFLU_[:digit:]+\\b"))
This results in:
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
If there are times when they are not the same you can use str_extract_all, map_chr, str_c and unique. map_chr is found in the purrr package, which is also part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = map_chr(str_extract_all(type,"\\b(PFLU_[:digit:]+)+\\b"), ~ str_c(unique(.x), collapse=", ")))
This will create comma separated string with all instances that match PFLU_ followed by a the adjacent number for each type string.
Changing the second PFLU_5927 to PFLU_0000 would result in:
type SNP
1 PFLU_4201 1
2 PFLU_5927, PFLU_0000 2
We can just use sub here for a base R option:
data$type <- sub("^.*\\b(PFLU_\\d+)\\b.*$", "\\1", data$type)
data
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
The sample data used was the same you provided in your original question.
You can try This:
library(stringr)
new_data <- data %>% mutate(
type = substr(type,str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"start"],
str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"end"]))
If you want to get more than one PFLU _ **** per line, you can use the str_locate_all function.
Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?
It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C
Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!
Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values
I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200
I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.
Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))
Taking advantage of the powerful parsers the readr package offers out of the box:
my_parser <- function(col) {
# Try first with parse_number that handles currencies automatically quite well
res <- suppressWarnings(readr::parse_number(col))
if (is.null(attr(res, "problems", exact = TRUE))) {
res
} else {
# If parse_number fails, fall back on parse_guess
readr::parse_guess(col)
# Alternatively, we could simply return col without further parsing attempt
}
}
library(dplyr)
emp_data %>%
mutate(foo = "USD13.4",
bar = "£37") %>%
mutate_all(my_parser)
# name salary foo bar
# 1 john 23456.33 13.4 37
# 2 carl 45677.43 13.4 37
# 3 hank 76234.88 13.4 37
A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.
doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols],
function(x) as.numeric(gsub('\\$|,', '', x)))