R - mutate columns with different formats

R - mutate columns with different formats - r

I'm trying to do analysis from multiple csv files, and in order to create a key that can be used for left_join I think that I need to try and merge two columns. At present I'm trying to use the tidyverse packages (inc. mutate), but I'm running into an issue as the two columns to merge have different formatting: 1 is a double and the other is in date format. I'm using the following code
qlik2 <- qlik %>%
separate('Admit DateTime', into = c('Admit Date', 'Admit Time'), sep = 10) %>%
mutate(key = MRN + `Admit Date`)
and getting tis output error:
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
If there's another way around this (or if the error is actually related to something else), then I'd appreciate any thoughts on the matter. Equally, if people know of a way to left_join with multiple keys, then that would work as well.
Thanks,
Cal

Hard without a reproducible example. But if i understand your question you either want a numeric key, or trying to concatinate a string with the plus +.
Numeric key
library(hablar)
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
convert(num(MRN, `Admit Date`)) %>%
mutate(key = MRN + `Admit Date`)
String key
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
mutate(key = paste(MRN, `Admit Date`))

Related

Problem with mutate when trying to create a line_id column

I need to create a line ID column within a dataframe for further pre-processing steps. The code worked fine up until yesterday. Today, however I am facing the error message:
"Error in mutate():
ℹ In argument: line_id = (function (x, y) ....
Caused by error:
! Can't convert y to match type of x ."
Here is my code - the dataframe consists of two character columns:
split_text <- raw_text %>%
mutate(text = enframe(strsplit(text, split = "\n", ))) %>%
unnest(cols = c(text)) %>%
unnest(cols = c(value)) %>%
rename(text_raw = value) %>%
select(-name) %>%
mutate(doc_id = str_remove(doc_id, ".txt")) %>%
# removing empty rows + add line_id
mutate(line_id = row_number())
Besides row_number(), I also tried rowid_to_column, and even c(1:1000) - the length of the dataframe. The error message stays the same.

Try explicitly specifying the data type of the "line_id" column as an integer using the as.integer() function, like this:
mutate(line_id = as.integer(row_number()))

This code works but is not fully satisfying, since I have to break the pipe:
split_text$line_id <- as.integer(c(1:nrow(split_text)))

How to pass user defined variable to filter dplr function in R? it seems that select works fine but filter gives wrong results

Here is the sample data:
sample,fit_result,Site,Dx_Bin,dx,Hx_Prev,Hx_of_Polyps,Age,Gender,Smoke,Diabetic,Hx_Fam_CRC,Height,Weight,NSAID,Diabetes_Med,stage
2003650,0,U Michigan,High Risk Normal,normal,0,1,64,m,,0,1,182,120,0,0,0
2005650,0,U Michigan,High Risk Normal,normal,0,1,61,m,0,0,0,167,78,0,0,0
2007660,26,U Michigan,High Risk Normal,normal,0,1,47,f,0,0,1,170,63,0,0,0
2009650,10,Toronto,Adenoma,adenoma,0,1,81,f,1,0,0,168,65,1,0,0
2013660,0,U Michigan,Normal,normal,0,0,44,f,0,0,0,170,72,1,0,0
2015650,0,Dana Farber,High Risk Normal,normal,0,1,51,f,1,0,0,160,67,0,0,0
2017660,7,Dana Farber,Cancer,cancer,1,1,78,m,1,1,0,172,78,0,1,3
2019651,19,U Michigan,Normal,normal,0,0,59,m,0,0,0,177,65,0,0,0
2023680,0,Dana Farber,High Risk Normal,normal,1,1,63,f,1,0,0,154,54,0,0,0
2025653,1509,U Michigan,Cancer.,cancer,1,1,67,m,1,0,0,167,58,0,0,4
2027653,0,Toronto,Normal,normal,0,0,65,f,0,0,0,167,60,0,0,0
below is the R code
library(tidyverse)
h <- 'Height'
w <- 'Weight'
data %>% select(h) %>% filter(h > 180)
I can see only height column in output but filter is not applied. I dont get any error when i run the code. similarly, below code also does not work
s <- 'Site'
data %>% select(s) %>% mutate(s = str_replace(s," ","_"))
Output:
Site s
1 U Michigan Site
2 U Michigan Site
3 U Michigan Site
4 Toronto Site
I want to replce the space in Site column but obviously its not recognizing s and creating a new column s.
I tried running below code and still face the same issue.
exp <- substitute(s <- 'Site')
r <- eval(exp,data)
data %>% select(r) %>% mutate(r = str_replace(s," ","_"))
I searched everywhere and could not find a solution, Any help would be great. Thanks in advance (i know the normal way to do it i just want to be able to pass variables to the function)

We may either convert to sym and evaluate (!!). Also, if we want to assign on the lhs of the operator, use := instead of = and evaluate with !!
library(dplyr)
library(stringr)
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(!! rlang::sym(s)," ","_"))
Similarly for the filter
data %>%
select(all_of(h)) %>%
filter(!! rlang::sym(h) > 180)
Yet another option would be to pass the variable objects in across (for filter can also use if_any/if_all) where we can pass one or more variables to loop across the columns
data %>%
select(all_of(s)) %>%
mutate(across(all_of(s), ~ str_replace(.x, " ", "_")))
Or use .data
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(.data[[s]]," ","_"))

Errors converting Character to Numeric R

I want to make a character column to numeric, so I can calculate the mean of basepay. However I keep getting different errors.
I use the code
dataset <- read.csv("Wagegap.csv")
SFWage <- dataset %>%
as.numeric(dataset$BasePay)%>%
group_by(gender,JobTitle, Year) %>%
summarise(averageBasePay = mean(BasePay, na.rm=TRUE)) %>%
select(gender, JobTitle, averageBasePay, Year)
clean <- SFWage %>% filter(gender != "")
It either wont recognize my basepay column if i don't use $, and if i use $ it shows
Error in function_list[i] :
'list' object cannot be coerced to type 'double'
The basepay column shows numbers with a "." instead of "," so I don't have to use a gsub()?

Try this before all the piping :
dataset$BasePay <- as.numeric(dataset$BasePay)

Translate Spark SQL function to "normal" R code

I am trying to follow an Vignette "How to make a Markov Chain" (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
This tutorial is interesting, because it is using the same data source as I use. But, a part of the code is using "Spark SQL code" (what I got back from my previous question Concat_ws() function in Sparklyr is missing).
My question: I googled a lot and tried to solve this by myself. But I have no idea how, since I don't know exactly what the data should look like (the author didn't gave an example of his DF before and after the function).
How can I transform this piece of code into "normal" R code (without using Spark) (especially: the concat_ws & collect_list functions are causing trouble
He is using this line of code:
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = concat_ws(" > ", collect_list(mid_campaign)),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "") %>%
collect()
From my previous question, I know that we can replace a part of the code:
concat_ws() can be replaced the paste() function
But again, another part of code is jumping in:
collect_list() # describtion: Aggregate function: returns a list of objects with duplicates.
I hope that I described this question as clear as possible.

paste has the ability to collapse the string vector with a separator that is provided with the collapse parameter.
This can act as a drop in replacement for concat_ws(" > ", collect_list(mid_campaign))
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = paste(mid_campaign, collapse = " > "),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "")

dplyr joins break on labeled columns (haven)

If I read in a Stata or SAS dataset with labels using haven, it will be (at least in haven 0.2.0) read with the following format:
library(dplyr)
df1 <- data_frame(fips = structure(c(1001, 1001, 1001, 1001, 1001),
label = "FIPS (numeric)"),
id = structure(letters[1:5], label = "ID"))
df2 <- data_frame(fips = structure(c(1001, 1003, 1005, 1007, 1009),
label = "FIPS (numeric)"),
state = structure("AL", label = "State Abbreviation"))
(If necessary, I can post some Stata data that produces this, but this should be easy to verify using any labeled Stata/SAS dataset.)
When I try to use any of the dplyr join functions to join on a labeled column, I am sorely disappointed:
df1 %>% inner_join(df2)
returns the error
Error in eval(expr, envir, enclos) : cannot join on columns 'fips' x
'fips': Can't join on 'fips' x 'fips' because of incompatible types
(numeric / numeric)
The only way to avoid it seems to be to remove the labels on the join variables:
df1 %>%
mutate(fips = `attr<-`(fips, 'label', NULL)) %>%
inner_join(df2 %>% mutate(fips = `attr<-`(fips, 'label', NULL)))
which raises the question of why the labels were read in the first place. (The join also obliterates the labels in df2.)
This would seem to be a bug in the way haven and dplyr interact. Is there a better solution?

Try converting the columns into a character string. This seems to work
df1$fips<-as.character(df1$fips)
df2$fips<-as.character(df2$fips)
df1 %>% inner_join(df2)
The help page for inner_join does state: "a character vector of variables to join by"

When dplyr joins on a variable that is a factor in one dataset and a character in the other it sends out a warning but completes the join. numeric and character vectors are not compatible classes so it errors out. By converting them both to character the join works fine
library(dplyr)
df1 %>%
mutate(fips = as.character(fips)) %>%
inner_join(
df2 %>%
mutate(fips = as.character)
)

This was fixed at some point, and works in dplyr 0.7.4. I can't track down the exact version where it was fixed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - mutate columns with different formats - r

Related

Problem with mutate when trying to create a line_id column

How to pass user defined variable to filter dplr function in R? it seems that select works fine but filter gives wrong results

Errors converting Character to Numeric R

Translate Spark SQL function to "normal" R code

dplyr joins break on labeled columns (haven)

Categories

Resources