Select a specific string from a complicated txt in R using dpylr - r

I am trying this for long now. My data frame looks like this:
type=c("ID=gene:PFLU_4201;
biotype=protein_coding;description=putative filamentous adhesin;gene_id=PFLU_4201;
logic_name=ena",
"ID=gene:PFLU_5927;Name=algP1;biotype=protein_coding;
description=transcriptional regulatory protein algp (alginate regulatory protein algr3);
gene_id=PFLU_5927;logic_name=ena")
SNP=c(1, 2)
data=data.frame(type, SNP)
I would like to isolate from the type column only the string PFLU_*** and my data to look like this
type SNP
PFLU_4201 1
PFLU_5927 2
Any help is more than welcome

Assuming that ID=gene:PFLU_*** and gene_id=PFLU_*** are always the same you can use the mutate and str_extract functions from the dplyr and stringr packages, Both are part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = str_extract(type,"\\bPFLU_[:digit:]+\\b"))
This results in:
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
If there are times when they are not the same you can use str_extract_all, map_chr, str_c and unique. map_chr is found in the purrr package, which is also part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = map_chr(str_extract_all(type,"\\b(PFLU_[:digit:]+)+\\b"), ~ str_c(unique(.x), collapse=", ")))
This will create comma separated string with all instances that match PFLU_ followed by a the adjacent number for each type string.
Changing the second PFLU_5927 to PFLU_0000 would result in:
type SNP
1 PFLU_4201 1
2 PFLU_5927, PFLU_0000 2

We can just use sub here for a base R option:
data$type <- sub("^.*\\b(PFLU_\\d+)\\b.*$", "\\1", data$type)
data
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
The sample data used was the same you provided in your original question.

You can try This:
library(stringr)
new_data <- data %>% mutate(
type = substr(type,str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"start"],
str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"end"]))
If you want to get more than one PFLU _ **** per line, you can use the str_locate_all function.

Related

Renaming column but capturing number

I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df

Using glue in new column name in rename() in R

I have a date variable, that can be set at the beginning of my script. Later in the script this date variable is used as part of a new column name using glue.
df <- df %>%
rename("Date" = glue("{date_variable}"),
glue("Change since {date_variable}") = change)
This is what setting the date variable looks like:
date_variable <- "2020-04-15"
Now the first rename, where the glue is in the old column name, works perfectly. The second part, where the glue is in the new column name, does not. It returns:
Unexpected) '=' in:
"df <- df %>%
rename(glue("Test {date_variable}") ="
Is it simply not possible to use a glue command in a new variable name?
The following is also possible, without explicit use of glue, but with the walrus operator :=, like in #Ronak Shah's answer:
df %>% rename("Change since {date_variable}" := change)
If you want to use glue with rename you can do :
library(dplyr)
library(glue)
df %>% rename("Date" = glue("{date_variable}"),
!!glue("Change since {date_variable}") := change)
# Date Change since 2020-04-15
#1 1 2
#2 2 3
#3 3 4
#4 4 5
Another option is to use rename_with which will not require glue.
df %>%
rename_with(~c('Date', paste('Change since', date_variable)),
c(date_variable, 'change'))
data
df <- data.frame("2020-04-15" = 1:4, change = 2:5, check.names = FALSE)
date_variable <- "2020-04-15"

use dplyr to combine columns of data.frame when column names are not known

Given a tibble:
library(tibble)
myTibble <- tibble(a = letters[1:3], b = c(T, F, T), c = 1:3)
I can use transmute to paste the columns, separated by '.':
> library(dplyr)
> transmute(myTibble, concat = paste(a, b, c, sep = "."))
# A tibble: 3 x 1
concat
<chr>
1 a.TRUE.1
2 b.FALSE.2
3 c.TRUE.3
If I want to use the above transmute statement in a function that receives a tibble, I won't know the names of the tibble or the number of columns ahead of time. What dplyr syntax would allow me to paste all columns in a tibble separated by a '.'?
Please note, I can do this with something like:
> apply(myTibble, 1, paste, collapse = ".")
[1] "a.TRUE.1" "b.FALSE.2" "c.TRUE.3"
but I am trying to understand dplyr better. So, yes, this is a specific problem I am trying to solve, but I am also stumped as to why I can't solve it with dplyr, which means there is something key about dplyr column selection I don't yet understand, and I'd like to learn, so that is why I'm asking specifically about a dplyr solution.
With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.
vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

R - identify which columns contain currency data $

I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.
Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))
Taking advantage of the powerful parsers the readr package offers out of the box:
my_parser <- function(col) {
# Try first with parse_number that handles currencies automatically quite well
res <- suppressWarnings(readr::parse_number(col))
if (is.null(attr(res, "problems", exact = TRUE))) {
res
} else {
# If parse_number fails, fall back on parse_guess
readr::parse_guess(col)
# Alternatively, we could simply return col without further parsing attempt
}
}
library(dplyr)
emp_data %>%
mutate(foo = "USD13.4",
bar = "£37") %>%
mutate_all(my_parser)
# name salary foo bar
# 1 john 23456.33 13.4 37
# 2 carl 45677.43 13.4 37
# 3 hank 76234.88 13.4 37
A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.
doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols],
function(x) as.numeric(gsub('\\$|,', '', x)))

How to search for a string in one column in other columns of a data frame

I have a table, call it df, with 3 columns, the 1st is the title of a product, the 2nd is the description of a product, and the third is a one word string. What I need to do is run an operation on the entire table, creating 2 new columns (call them 'exists_in_title' and 'exists_in_description') that have either a 1 or 0 indicating if the 3rd column exists in either the 1st or 2nd column. I need it to simply be a 1:1 operation, so for example, calling row 1 'A', I need to check if the cell A3, exists in A1, and use that data to create column
exists_in_title, and then check if A3 exists in A2, and use that data to create the column exists_in_description. Then move on to row B and go through the same operation. I have thousands of rows of data so it's not realistic to do these in a 1 at a time fashion, writing individual functions for each row, definitely need a function or method that will run through every row in the table in one shot.
I've played around with grepl, pmatch, str_count but none seem to really do what I need. I think grepl is probably the closest to what I need, here's an example of 2 lines of code I wrote that logically do what I would want them to, but didn't seem to work:
df$exists_in_title <- grepl(df$A3, df$A1)
df$exists_in_description <- grepl(df$A3, df$A2)
However when I run those I get the following message, which leads me to believe it did not work properly: "argument 'pattern' has length > 1 and only the first element will be used"
Any help on how to do this would be greatly appreciated. Thanks!
grepl will work with mapply:
Sample data frame:
title <- c('eggs and bacon','sausage biscuit','pancakes')
description <- c('scrambled eggs and thickcut bacon','homemade biscuit with breakfast pattie', 'stack of sourdough pancakes')
keyword <- c('bacon','sausage','sourdough')
df <- data.frame(title, description, keyword, stringsAsFactors=FALSE)
Searching for matches using grepl:
df$exists_in_title <- mapply(grepl, pattern=df$keyword, x=df$title)
df$exists_in_description <- mapply(grepl, pattern=df$keyword, x=df$description)
And the results:
title description keyword exists_in_title exists_in_description
1 eggs and bacon scrambled eggs and thickcut bacon bacon TRUE TRUE
2 sausage biscuit homemade biscuit with breakfast pattie sausage TRUE FALSE
3 pancakes stack of sourdough pancakes sourdough FALSE TRUE
Update I
You could also do this with dplyr and stringr:
library(dplyr)
df %>%
rowwise() %>%
mutate(exists_in_title = grepl(keyword, title),
exists_in_description = grepl(keyword, description))
library(stringr)
df %>%
rowwise() %>%
mutate(exists_in_title = str_detect(title, keyword),
exists_in_description = str_detect(description, keyword))
Update II
Mapis also an option, or using more from tidyverse another option could be purrr with stringr:
library(tidyverse)
df %>%
mutate(exists_in_title = unlist(Map(function(x, y) grepl(x, y), keyword, title))) %>%
mutate(exists_in_description = map2_lgl(description, keyword, str_detect))

Resources