I need to create a line ID column within a dataframe for further pre-processing steps. The code worked fine up until yesterday. Today, however I am facing the error message:
"Error in mutate():
ℹ In argument: line_id = (function (x, y) ....
Caused by error:
! Can't convert y to match type of x ."
Here is my code - the dataframe consists of two character columns:
split_text <- raw_text %>%
mutate(text = enframe(strsplit(text, split = "\n", ))) %>%
unnest(cols = c(text)) %>%
unnest(cols = c(value)) %>%
rename(text_raw = value) %>%
select(-name) %>%
mutate(doc_id = str_remove(doc_id, ".txt")) %>%
# removing empty rows + add line_id
mutate(line_id = row_number())
Besides row_number(), I also tried rowid_to_column, and even c(1:1000) - the length of the dataframe. The error message stays the same.
Try explicitly specifying the data type of the "line_id" column as an integer using the as.integer() function, like this:
mutate(line_id = as.integer(row_number()))
This code works but is not fully satisfying, since I have to break the pipe:
split_text$line_id <- as.integer(c(1:nrow(split_text)))
Here is the sample data:
sample,fit_result,Site,Dx_Bin,dx,Hx_Prev,Hx_of_Polyps,Age,Gender,Smoke,Diabetic,Hx_Fam_CRC,Height,Weight,NSAID,Diabetes_Med,stage
2003650,0,U Michigan,High Risk Normal,normal,0,1,64,m,,0,1,182,120,0,0,0
2005650,0,U Michigan,High Risk Normal,normal,0,1,61,m,0,0,0,167,78,0,0,0
2007660,26,U Michigan,High Risk Normal,normal,0,1,47,f,0,0,1,170,63,0,0,0
2009650,10,Toronto,Adenoma,adenoma,0,1,81,f,1,0,0,168,65,1,0,0
2013660,0,U Michigan,Normal,normal,0,0,44,f,0,0,0,170,72,1,0,0
2015650,0,Dana Farber,High Risk Normal,normal,0,1,51,f,1,0,0,160,67,0,0,0
2017660,7,Dana Farber,Cancer,cancer,1,1,78,m,1,1,0,172,78,0,1,3
2019651,19,U Michigan,Normal,normal,0,0,59,m,0,0,0,177,65,0,0,0
2023680,0,Dana Farber,High Risk Normal,normal,1,1,63,f,1,0,0,154,54,0,0,0
2025653,1509,U Michigan,Cancer.,cancer,1,1,67,m,1,0,0,167,58,0,0,4
2027653,0,Toronto,Normal,normal,0,0,65,f,0,0,0,167,60,0,0,0
below is the R code
library(tidyverse)
h <- 'Height'
w <- 'Weight'
data %>% select(h) %>% filter(h > 180)
I can see only height column in output but filter is not applied. I dont get any error when i run the code. similarly, below code also does not work
s <- 'Site'
data %>% select(s) %>% mutate(s = str_replace(s," ","_"))
Output:
Site s
1 U Michigan Site
2 U Michigan Site
3 U Michigan Site
4 Toronto Site
I want to replce the space in Site column but obviously its not recognizing s and creating a new column s.
I tried running below code and still face the same issue.
exp <- substitute(s <- 'Site')
r <- eval(exp,data)
data %>% select(r) %>% mutate(r = str_replace(s," ","_"))
I searched everywhere and could not find a solution, Any help would be great. Thanks in advance (i know the normal way to do it i just want to be able to pass variables to the function)
We may either convert to sym and evaluate (!!). Also, if we want to assign on the lhs of the operator, use := instead of = and evaluate with !!
library(dplyr)
library(stringr)
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(!! rlang::sym(s)," ","_"))
Similarly for the filter
data %>%
select(all_of(h)) %>%
filter(!! rlang::sym(h) > 180)
Yet another option would be to pass the variable objects in across (for filter can also use if_any/if_all) where we can pass one or more variables to loop across the columns
data %>%
select(all_of(s)) %>%
mutate(across(all_of(s), ~ str_replace(.x, " ", "_")))
Or use .data
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(.data[[s]]," ","_"))
I am trying to follow an Vignette "How to make a Markov Chain" (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
This tutorial is interesting, because it is using the same data source as I use. But, a part of the code is using "Spark SQL code" (what I got back from my previous question Concat_ws() function in Sparklyr is missing).
My question: I googled a lot and tried to solve this by myself. But I have no idea how, since I don't know exactly what the data should look like (the author didn't gave an example of his DF before and after the function).
How can I transform this piece of code into "normal" R code (without using Spark) (especially: the concat_ws & collect_list functions are causing trouble
He is using this line of code:
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = concat_ws(" > ", collect_list(mid_campaign)),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "") %>%
collect()
From my previous question, I know that we can replace a part of the code:
concat_ws() can be replaced the paste() function
But again, another part of code is jumping in:
collect_list() # describtion: Aggregate function: returns a list of objects with duplicates.
I hope that I described this question as clear as possible.
paste has the ability to collapse the string vector with a separator that is provided with the collapse parameter.
This can act as a drop in replacement for concat_ws(" > ", collect_list(mid_campaign))
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = paste(mid_campaign, collapse = " > "),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "")
If I read in a Stata or SAS dataset with labels using haven, it will be (at least in haven 0.2.0) read with the following format:
library(dplyr)
df1 <- data_frame(fips = structure(c(1001, 1001, 1001, 1001, 1001),
label = "FIPS (numeric)"),
id = structure(letters[1:5], label = "ID"))
df2 <- data_frame(fips = structure(c(1001, 1003, 1005, 1007, 1009),
label = "FIPS (numeric)"),
state = structure("AL", label = "State Abbreviation"))
(If necessary, I can post some Stata data that produces this, but this should be easy to verify using any labeled Stata/SAS dataset.)
When I try to use any of the dplyr join functions to join on a labeled column, I am sorely disappointed:
df1 %>% inner_join(df2)
returns the error
Error in eval(expr, envir, enclos) : cannot join on columns 'fips' x
'fips': Can't join on 'fips' x 'fips' because of incompatible types
(numeric / numeric)
The only way to avoid it seems to be to remove the labels on the join variables:
df1 %>%
mutate(fips = `attr<-`(fips, 'label', NULL)) %>%
inner_join(df2 %>% mutate(fips = `attr<-`(fips, 'label', NULL)))
which raises the question of why the labels were read in the first place. (The join also obliterates the labels in df2.)
This would seem to be a bug in the way haven and dplyr interact. Is there a better solution?
Try converting the columns into a character string. This seems to work
df1$fips<-as.character(df1$fips)
df2$fips<-as.character(df2$fips)
df1 %>% inner_join(df2)
The help page for inner_join does state: "a character vector of variables to join by"
When dplyr joins on a variable that is a factor in one dataset and a character in the other it sends out a warning but completes the join. numeric and character vectors are not compatible classes so it errors out. By converting them both to character the join works fine
library(dplyr)
df1 %>%
mutate(fips = as.character(fips)) %>%
inner_join(
df2 %>%
mutate(fips = as.character)
)
This was fixed at some point, and works in dplyr 0.7.4. I can't track down the exact version where it was fixed.