Using glue in new column name in rename() in R - r

I have a date variable, that can be set at the beginning of my script. Later in the script this date variable is used as part of a new column name using glue.
df <- df %>%
rename("Date" = glue("{date_variable}"),
glue("Change since {date_variable}") = change)
This is what setting the date variable looks like:
date_variable <- "2020-04-15"
Now the first rename, where the glue is in the old column name, works perfectly. The second part, where the glue is in the new column name, does not. It returns:
Unexpected) '=' in:
"df <- df %>%
rename(glue("Test {date_variable}") ="
Is it simply not possible to use a glue command in a new variable name?

The following is also possible, without explicit use of glue, but with the walrus operator :=, like in #Ronak Shah's answer:
df %>% rename("Change since {date_variable}" := change)

If you want to use glue with rename you can do :
library(dplyr)
library(glue)
df %>% rename("Date" = glue("{date_variable}"),
!!glue("Change since {date_variable}") := change)
# Date Change since 2020-04-15
#1 1 2
#2 2 3
#3 3 4
#4 4 5
Another option is to use rename_with which will not require glue.
df %>%
rename_with(~c('Date', paste('Change since', date_variable)),
c(date_variable, 'change'))
data
df <- data.frame("2020-04-15" = 1:4, change = 2:5, check.names = FALSE)
date_variable <- "2020-04-15"

Related

Select a specific string from a complicated txt in R using dpylr

I am trying this for long now. My data frame looks like this:
type=c("ID=gene:PFLU_4201;
biotype=protein_coding;description=putative filamentous adhesin;gene_id=PFLU_4201;
logic_name=ena",
"ID=gene:PFLU_5927;Name=algP1;biotype=protein_coding;
description=transcriptional regulatory protein algp (alginate regulatory protein algr3);
gene_id=PFLU_5927;logic_name=ena")
SNP=c(1, 2)
data=data.frame(type, SNP)
I would like to isolate from the type column only the string PFLU_*** and my data to look like this
type SNP
PFLU_4201 1
PFLU_5927 2
Any help is more than welcome
Assuming that ID=gene:PFLU_*** and gene_id=PFLU_*** are always the same you can use the mutate and str_extract functions from the dplyr and stringr packages, Both are part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = str_extract(type,"\\bPFLU_[:digit:]+\\b"))
This results in:
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
If there are times when they are not the same you can use str_extract_all, map_chr, str_c and unique. map_chr is found in the purrr package, which is also part of tidyverse.
require(tidyverse)
data<-data %>%
mutate(type = map_chr(str_extract_all(type,"\\b(PFLU_[:digit:]+)+\\b"), ~ str_c(unique(.x), collapse=", ")))
This will create comma separated string with all instances that match PFLU_ followed by a the adjacent number for each type string.
Changing the second PFLU_5927 to PFLU_0000 would result in:
type SNP
1 PFLU_4201 1
2 PFLU_5927, PFLU_0000 2
We can just use sub here for a base R option:
data$type <- sub("^.*\\b(PFLU_\\d+)\\b.*$", "\\1", data$type)
data
type SNP
1 PFLU_4201 1
2 PFLU_5927 2
The sample data used was the same you provided in your original question.
You can try This:
library(stringr)
new_data <- data %>% mutate(
type = substr(type,str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"start"],
str_locate(type,"PFLU_[0-9][0-9][0-9][0-9]")[,"end"]))
If you want to get more than one PFLU _ **** per line, you can use the str_locate_all function.

use dplyr to combine columns of data.frame when column names are not known

Given a tibble:
library(tibble)
myTibble <- tibble(a = letters[1:3], b = c(T, F, T), c = 1:3)
I can use transmute to paste the columns, separated by '.':
> library(dplyr)
> transmute(myTibble, concat = paste(a, b, c, sep = "."))
# A tibble: 3 x 1
concat
<chr>
1 a.TRUE.1
2 b.FALSE.2
3 c.TRUE.3
If I want to use the above transmute statement in a function that receives a tibble, I won't know the names of the tibble or the number of columns ahead of time. What dplyr syntax would allow me to paste all columns in a tibble separated by a '.'?
Please note, I can do this with something like:
> apply(myTibble, 1, paste, collapse = ".")
[1] "a.TRUE.1" "b.FALSE.2" "c.TRUE.3"
but I am trying to understand dplyr better. So, yes, this is a specific problem I am trying to solve, but I am also stumped as to why I can't solve it with dplyr, which means there is something key about dplyr column selection I don't yet understand, and I'd like to learn, so that is why I'm asking specifically about a dplyr solution.
With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.
vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

Can "assign()" and "get()" be written more concisely?

Below is my code. I use an extra variation "tmp" to clean the "ABC_Chla". Because the "Location_name" can change, I use "assign()" and "get()" function.
Location_name <- "ABC_"
tmp <- get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
assign(paste(Location_name,"DO",sep = ""), tmp)
My code can achieve this goal, but it seems not concise (introduce a temporary variable). Is there a better way?
Assuming the inputs shown reproducibly in the Note at the end (next time please make sure your question includes complete reproducible code including inputs) we can make the following changes:
use paste0 instead of paste
create a variable locname to hold the name of the data frame and a variable e to be the environment where our data frame is located
use e[[...]] instead of get and assign
use magrittr %<>% two-way pipe
possibly use filter(is.finite(log.DO)) -- not shown below
giving this code:
library(dplyr)
library(magrittr)
e <- .GlobalEnv # change if our data frame is in some other environment
locname <- paste0(Location_name, "DO")
e[[locname]] %<>%
filter(log.DO != -Inf)
The result is:
get(locname, e)
## log.DO
## 1 1
## 2 2
Alternative
This alternative only uses ordinary pipes. We use e and locname from above.
library(dplyr)
e[[locname]] <- e[[locname]] %>%
filter(log.DO != -Inf)
Note
Test input:
ABC_DO <- data.frame(log.DO = c(1, -Inf, 2))
Location_name <- "ABC_"
You only have a temporary variable because you store the data in tmp, i don't see it as a problem.But, n this case, the only thing that i see you can do is pass the code of tmp directly to assign, like:
assign(
paste(Location_name,"DO",sep = ""),
get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
)

How to clean data where the variable name and property are in the same cell?

I need to clean data where the variable property and answer associated with a location are together in a single cell. The only thing consistent in my dataset is that they are separated by a colon (:).
I need to remap the data to the variable property becomes the column header and the data is mapped for each Location.
I've attached an example:
There can also be a bunch of other symbols that are irrelevant. I just need to extract the string before the colon and the string or integer after the colon and it is mapped correctly for each location.
How do I do this on R? What functions should I be using
Example data:
Example1 Sunny:"TRUE" NearCoast:False Schools:{"13"} 2
Example2 NearCoast:False Schools:{"6"} Sunny:"FALSE" 3
Example3 Schools:{"2"} Sunny:"TRUE" NearCoast:TRUE Transport:5
Also, would it be possible that I could add exceptions to this process. For example, if the cell is simply a number alone, it is ignored. Or, if the property name is a specific thing such as "transport", it ignores the cell too.
Try this example, as mentioned in comments, we can reshape wide-to-long, then string split on :, then again reshape long-to-wide.
df1 <- read.table(text = '
Example1 Sunny:"TRUE" NearCoast:False Schools:{"13"} 2
Example2 NearCoast:False Schools:{"6"} Sunny:"FALSE" 3
Example3 Schools:{"2"} Sunny:"TRUE" NearCoast:TRUE Transport:5',
header = FALSE, stringsAsFactors = FALSE)
library(tidyverse)
gather(df1, key = "k", value = "v", -V1) %>%
separate(v, into = c("type", "value"), sep = ":") %>%
filter(!is.na(value)) %>%
select(-k) %>%
spread(key = type, value = value)
# V1 NearCoast Schools Sunny Transport
# 1 Example1 False {"13"} "TRUE" <NA>
# 2 Example2 False {"6"} "FALSE" <NA>
# 3 Example3 TRUE {"2"} "TRUE" 5
In lack of a reproducible example, I can only provide guidelines. Assuming you can read in the data in a tabular fashion as shown in your 2nd image, you can do it with 4 "simple" steps with the packages dplyr and tidyr:
library(dplyr)
library(tidyr)
df <- read.table(...)
df %>% gather(keypair, column, 2:4) %>%
separate(keypair, into=c('key','value'), sep=':') %>%
mutate(value=gsub('"{}', '', value)) %>%
spread(key, value)
Go though each line, line by line, and try to understand what is happening before trying to run the next.

R - identify which columns contain currency data $

I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.
Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))
Taking advantage of the powerful parsers the readr package offers out of the box:
my_parser <- function(col) {
# Try first with parse_number that handles currencies automatically quite well
res <- suppressWarnings(readr::parse_number(col))
if (is.null(attr(res, "problems", exact = TRUE))) {
res
} else {
# If parse_number fails, fall back on parse_guess
readr::parse_guess(col)
# Alternatively, we could simply return col without further parsing attempt
}
}
library(dplyr)
emp_data %>%
mutate(foo = "USD13.4",
bar = "£37") %>%
mutate_all(my_parser)
# name salary foo bar
# 1 john 23456.33 13.4 37
# 2 carl 45677.43 13.4 37
# 3 hank 76234.88 13.4 37
A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.
doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols],
function(x) as.numeric(gsub('\\$|,', '', x)))

Resources