using select and stringr together - r

I'm trying
qual %>% select(reasons_code) %>% str_replace('\\+.*',replacement = '')
but I get the Warning message: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : argument is not an atomic vector; coercing.
However, when I do the following, the replacement works fine.
str_replace(qual$reasons_code,'\\+.*',replacement = '')
Does anyone know why this is happening?

For ?str_replace, the input string is
string - Input vector. Either a character vector, or something coercible to one.
while, the output from select is a data.frame with a single column selected. It is not converted to vector. Instead of select, we can pull the column as vector and it should work
library(dplyr)
qual %>%
pull(reasons_code) %>%
str_replace('\\+.*',replacement = '')
Or if we prefer to use the OP's code with select, there are several ways to convert to vector - unlist is one of them
qual %>%
select(reasons_code) %>%
unlist %>%
str_replace('\\+.*',replacement = '')

Related

R mutate & replace string with empty pattern or empty string

I am trying to remove some pattern (to_remove) from another string column (entry) inside mutate().
The problem is both my string and pattern columns contain some empty strings. So using some vectorized functions such as stringr::str_remove() would result some warnings and slow the process down by a lot.
I notice that without the empty strings & patterns (i.e. you replace them with some values) it would only take less than 1 sec to complete about 1e5 rows of records. However, with the warnings it would take over 10 secs.
I am wondering if there is any way I can use stringr::str_remove() inside mutate() but skipping those empty rows so that I can still have the speed benefit from vectorization.
Note that I can also use dplyr::rowwise() + gsub() but rowwise() slows things down a lot as well:(
Example code:
library(tidyverse)
library(stringr)
set.seed(123)
temp <- data.frame(
entry = c('A12','JW13','C','')
,to_remove = c('A','W','','D')
) %>%
sample_n(1e5,replace = T)
temp <- temp %>%
mutate(
removed = str_remove(entry,to_remove)
)
Try replacing the blank values with NA :
library(dplyr)
library(stringr)
temp %>%
mutate(to_remove = na_if(to_remove, ''),
removed = str_remove(entry,to_remove))
We can do
library(dplyr)
library(stringr)
temp %>%
mutate(removed = str_remove(entry, replace(to_remove, to_remove == "", NA)))

Errors converting Character to Numeric R

I want to make a character column to numeric, so I can calculate the mean of basepay. However I keep getting different errors.
I use the code
dataset <- read.csv("Wagegap.csv")
SFWage <- dataset %>%
as.numeric(dataset$BasePay)%>%
group_by(gender,JobTitle, Year) %>%
summarise(averageBasePay = mean(BasePay, na.rm=TRUE)) %>%
select(gender, JobTitle, averageBasePay, Year)
clean <- SFWage %>% filter(gender != "")
It either wont recognize my basepay column if i don't use $, and if i use $ it shows
Error in function_list[i] :
'list' object cannot be coerced to type 'double'
The basepay column shows numbers with a "." instead of "," so I don't have to use a gsub()?
Try this before all the piping :
dataset$BasePay <- as.numeric(dataset$BasePay)

dplyr mutate inside for loop - Issue

I am performing Data Analysis and cleaning in R using tidyverse.
I have a Data Frame with 23 columns containing values 'NO','STEADY','UP' and 'down'.
I want to change all the values in these 23 columns to 0 in case of 'NO','STEADY' and 1 in other case.
What i did is, i created a list by name keys in which i have kept all my columns, After that i am using for loop, ifelse statements and mutate.
Please have a look at the code below
# Column names are kept in the list by name keys
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
After that, i used following code to get the desired result :
for (col in keys){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1)) }
I was expecting that, it will do the changes that i require, but nothing happens after this. (NO ERROR MESSAGE AND NO DESIRED RESULT)
After that, i researched further and executed following code
for (col in keys){
print(col)}
It gives me elements of list as characters like - "metformin"
So, i thought - may be this is the issue. Hence, i used the below code to caste the keys as symbols :
keys_new = sym(keys)
After that i again ran the same code:
for (col in keys_new){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1))}
It gives me following Error -
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
After all this. I also tried to create a function to get the desired results, but that too didn't worked:
change = function(name){
Dataset = Dataset %>%
mutate(name = ifelse(name %in% c('No','Steady'),0,1),
name = as.factor(name))
return(Dataset)}
for (col in keys){
change(col)}
This didn't perform any action. (NO ERROR MESSAGE AND NO DESIRED RESULT)
When keys_new is placed in this code:
for (col in keys_new){
change(col)}
I got the same Error :
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
PLEASE GUIDE
There's no need to loop or keep track of column names. You can use mutate_all -
Dataset %>%
mutate_all(~ifelse(. %in% c('No','Steady'), 0, 1))
Another way, thanks to Rui Barradas -
Dataset %>%
mutate_all(~as.integer(!. %in% c('No','Steady')))
There's a simpler way using mutate_at and case_when.
Dataset %>% mutate_at(keys, ~case_when(. %in% c("NO", "STEADY") ~ 0, TRUE ~ 1))
mutate_at will only mutate the columns specified in the keys variable. case_when then lets you replace one value by another by some condition.
This answer for using mutate through forloop.
I don't have your data, so i tried to make my own data, i changed the keys into a tibble using enframe then spread it into columns and used the row number as a value for each column, then check if the value is higher than 10 or not.
To use the column name in mutate you have to use !! and := in the mutate function
df <- enframe(c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
) %>% spread(key = value,value = name)
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
for (col in keys){
df = df %>%
mutate(!!as.character(col) := ifelse( df[col] > 10,0,100) )
}

Avoiding backtick characters with dplyr

How can I write the argument of select without backtick characters? I would like to do this so that I can pass in this argument from a variable as a character string.
df <- dat[["__Table"]] %>% select(`__ID` ) %>% mutate(fk_table = "__Table", val = 1)
Changing the argument of select to "__ID" gives this error:
Error: All select() inputs must resolve to integer column positions.
The following do not:
* "__ID"
Unfortunately, the _ characters in column names cannot be avoided since the data is downloaded from a relational database (FileMaker) via ODBC and needs to be written back to the database while preserving the column names.
Ideally, I would like to be able to do the following:
colName <- "__ID"
df <- dat[["__Table"]] %>% select(colName) %>% mutate(fk_table = "__Table", val = 1)
I've also tried eval(parse()):
df <- dat[["__Table"]] %>% select( eval(parse(text="__ID")) ) %>% mutate(fk_table = "__Table", val = 1)
It throws this error:
Error in parse(text = "__ID") : <text>:1:1: unexpected input
1: _
^
By the way, the following does work, but then I'm back to square one (still with backtick symbol).
eval(parse(text="`__ID`")
References about backtick characters in R:
Removing backticks in R output
What do backticks do in R?
R encoding ASCII backtick
You can use as.name() with select_():
colName <- "__ID"
df <- data.frame(`__ID` = c(1,2,3), `123` = c(4,5,6), check.names = FALSE)
select_(df, as.name(colName))

Convert character to numeric without NA in r

I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)

Resources