removing and replacing observations with string package - r

I have two datasets, I'm trying to join together. the column i am joining by does not exactly match up with each other. first file the column looks like this: 00:01:54:2145 etc. 00: for every single observation. I want to change all the observations in this column to be in this format: 01/54/2145.
I have tried several things with string package, but can't get it to work.
df1 <- df %>%
str_replace_all("00:")
I'm getting this error, but don't think that's the only problem:
argument is not an atomic vector; coercing
Thank you

library(stringr)
library(dplyr)
my_conversion <- Vectorize(function(str) {
str_replace(str, "^00:", "") %>%
str_replace_all(":", "/")
})
df <- data.frame(
a_column = 1:3, key_column = c("00:01:54:2145", "00:01:54:2145", "00:01:54:2145"))
df %>% mutate(key_column = my_conversion(key_column))

Related

dplyr: how to pass strings to dplyr's mutate argument

I want to write a helper function that summarizes the percentage change for column A, B and C in one shot. I want to pass a string to the "mutate" argument of dplyr with the help of rlang. Unfortunately, I get an error saying that I have an unexpected ",". Could you please take a look? Thanks in advance!
library(rlang) #read text inputs and return vars
library(dplyr)
set.seed(10)
dat <- data.frame(A=rnorm(10,0,1),
B=rnorm(10,0,1),
C=rnorm(10,0,1),
D=2001:2010)
calc_perct_chg <- function(input_data,
target_Var_list,
year_Var_name){
#create new variable names
mutate_varNames <- paste0(target_Var_list,rep("_pct_chg = ",length(target_Var_list)))
#generate text for formula
mutate_formula <- lapply(target_Var_list,function(x){output <- paste0("(",x,"-lag(",x,"))/lag(",x,")");return(output)})
mutate_formula <- unlist(mutate_formula) #convert list to a vector
#generate arguments for mutate
mutate_args <<- paste0(mutate_varNames,collapse=",",mutate_formula)
#data manipulation
output <- input_data %>%
arrange(!!parse_quo(year_Var_name,env=caller_env())) %>%
mutate(!!parse_quo(mutate_args,env=caller_env()))
#output data frame
return(output)
}
# error: unexpected ','
calc_perct_chg(input_data =dat,
target_Var_list=list("A","B","C"),
year_Var_name="D")
I don't think it's a good idea to evaluate string as code, also I think you are over-complicating it. Using across this should be easier.
library(dplyr)
calc_perct_chg <- function(input_data,
target_Var_list,
year_Var_name){
input_data %>%
arrange(across(all_of(year_Var_name))) %>%
mutate(across(all_of(target_Var_list), ~(.x - lag(.x))/lag(.x)))
}
calc_perct_chg(input_data = dat,
target_Var_list = c("A","B","C"),
year_Var_name = "D")

How to make purrr recognize colnames of df in a list?

I am trying to unite first and last names in each dataframe in a list of dataframes. The problem is that purrr doesn't seem to recognize colnames within each df.
Each df in data$authors_list looks something like
authid
surname
given-name
12345
Smith
John
85858
Scott
Jane
I want to unite the "surname" and "given-names" into a column called AuN.
data <- data %>%
mutate(authors_list = map(authors_list,
unite(col=AuN,
c(`given-name`,
surname),
sep = " ")))
However, I get the following error.
Error in unite(col = AuN, c(`given-name`, surname), sep = " ") :
object 'given-name' not found
I am new to using purrr, and I haven't been able to find solutions to a similar problem online. Any help would be appreciated!
I think this is what you're after. You need to put in .x in the unite call to stand in for each data frame in the list. For each one, it will unite with the parameters you specified.
library(tidyverse)
#Set up the data (but please in the future give us data so we don't have to set it up)
df <- tibble(authid = c(12345, 85858), surname = c("Smith", "Scott"), `given-name` = c("John","Jane"))
list_df <- list(df, df, df)
list_df_unite <- map(list_df, ~ unite(.x, AuN, c(`given-name`,surname), sep = " "))

R mutate & replace string with empty pattern or empty string

I am trying to remove some pattern (to_remove) from another string column (entry) inside mutate().
The problem is both my string and pattern columns contain some empty strings. So using some vectorized functions such as stringr::str_remove() would result some warnings and slow the process down by a lot.
I notice that without the empty strings & patterns (i.e. you replace them with some values) it would only take less than 1 sec to complete about 1e5 rows of records. However, with the warnings it would take over 10 secs.
I am wondering if there is any way I can use stringr::str_remove() inside mutate() but skipping those empty rows so that I can still have the speed benefit from vectorization.
Note that I can also use dplyr::rowwise() + gsub() but rowwise() slows things down a lot as well:(
Example code:
library(tidyverse)
library(stringr)
set.seed(123)
temp <- data.frame(
entry = c('A12','JW13','C','')
,to_remove = c('A','W','','D')
) %>%
sample_n(1e5,replace = T)
temp <- temp %>%
mutate(
removed = str_remove(entry,to_remove)
)
Try replacing the blank values with NA :
library(dplyr)
library(stringr)
temp %>%
mutate(to_remove = na_if(to_remove, ''),
removed = str_remove(entry,to_remove))
We can do
library(dplyr)
library(stringr)
temp %>%
mutate(removed = str_remove(entry, replace(to_remove, to_remove == "", NA)))

dplyr mutate inside for loop - Issue

I am performing Data Analysis and cleaning in R using tidyverse.
I have a Data Frame with 23 columns containing values 'NO','STEADY','UP' and 'down'.
I want to change all the values in these 23 columns to 0 in case of 'NO','STEADY' and 1 in other case.
What i did is, i created a list by name keys in which i have kept all my columns, After that i am using for loop, ifelse statements and mutate.
Please have a look at the code below
# Column names are kept in the list by name keys
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
After that, i used following code to get the desired result :
for (col in keys){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1)) }
I was expecting that, it will do the changes that i require, but nothing happens after this. (NO ERROR MESSAGE AND NO DESIRED RESULT)
After that, i researched further and executed following code
for (col in keys){
print(col)}
It gives me elements of list as characters like - "metformin"
So, i thought - may be this is the issue. Hence, i used the below code to caste the keys as symbols :
keys_new = sym(keys)
After that i again ran the same code:
for (col in keys_new){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1))}
It gives me following Error -
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
After all this. I also tried to create a function to get the desired results, but that too didn't worked:
change = function(name){
Dataset = Dataset %>%
mutate(name = ifelse(name %in% c('No','Steady'),0,1),
name = as.factor(name))
return(Dataset)}
for (col in keys){
change(col)}
This didn't perform any action. (NO ERROR MESSAGE AND NO DESIRED RESULT)
When keys_new is placed in this code:
for (col in keys_new){
change(col)}
I got the same Error :
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
PLEASE GUIDE
There's no need to loop or keep track of column names. You can use mutate_all -
Dataset %>%
mutate_all(~ifelse(. %in% c('No','Steady'), 0, 1))
Another way, thanks to Rui Barradas -
Dataset %>%
mutate_all(~as.integer(!. %in% c('No','Steady')))
There's a simpler way using mutate_at and case_when.
Dataset %>% mutate_at(keys, ~case_when(. %in% c("NO", "STEADY") ~ 0, TRUE ~ 1))
mutate_at will only mutate the columns specified in the keys variable. case_when then lets you replace one value by another by some condition.
This answer for using mutate through forloop.
I don't have your data, so i tried to make my own data, i changed the keys into a tibble using enframe then spread it into columns and used the row number as a value for each column, then check if the value is higher than 10 or not.
To use the column name in mutate you have to use !! and := in the mutate function
df <- enframe(c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
) %>% spread(key = value,value = name)
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
for (col in keys){
df = df %>%
mutate(!!as.character(col) := ifelse( df[col] > 10,0,100) )
}

Initialize an empty tibble with column names and 0 rows

I have a vector of column names called tbl_colnames.
I would like to create a tibble with 0 rows and length(tbl_colnames) columns.
The best way I've found of doing this is...
tbl <- as_tibble(data.frame(matrix(nrow=0,ncol=length(tbl_colnames)))
and then I want to name the columns so...
colnames(tbl) <- tbl_colnames.
My question: Is there a more elegant way of doing this?
something like tbl <- tibble(colnames=tbl_colnames)
my_tibble <- tibble(
var_name_1 = numeric(),
var_name_2 = numeric(),
var_name_3 = numeric(),
var_name_4 = numeric(),
var_name_5 = numeric()
)
Haven't tried, but I guess it works too if instead of initiating numeric vectors of length 0 you do it with other classes (for example, character()).
This SO question explains how to do it with other R libraries.
According to this tidyverse issue, this won't be a feature for tribbles.
Since you want to combine a list of tibbles. You can just assign NULL to the variable and then bind_rows with other tibbles.
res = NULL
for(i in tibbleList)
res = bind_rows(res,i)
However, a much efficient way to do this is
bind_rows(tibbleList) # combine all tibbles in the list
For anyone still interested in an elegant way to create a 0-row tibble with column names given by a character vector tbl_colnames:
tbl_colnames %>% purrr::map_dfc(setNames, object = list(logical()))
or:
tbl_colnames %>% purrr::map_dfc(~tibble::tibble(!!.x := logical()))
or:
tbl_colnames %>% rlang::rep_named(list(logical())) %>% tibble::as_tibble()
This, of course, results in each column being of type logical.
The following command will create a tibble with 0 row and variables (columns) named with the contents of tbl_colnames
tbl <- tibble::tibble(!!!tbl_colnames, .rows = 0)
You could abuse readr::read_csv, which allow to read from string. You can control names and types, e.g.:
tbl_colnames <- c("one", "two", "three", "c4", "c5", "last")
read_csv("\n", col_names = tbl_colnames) # all character type
read_csv("\n", col_names = tbl_colnames, col_types = "lcniDT") # various types
I'm a bit late to the party, but for future readers:
as_tibble(matrix(nrow = 0, ncol = length(tbl_colnames)), .name_repair = ~ tbl_colnames)
.name_repair allows you to name you columns within the same function.

Resources