sparklyr: create new column with mutate function - r

I'm very surprised if this kind of problems cannot be solved with sparklyr:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
I receive this error:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun
But with this line:
aDataFrame %>% mutate(newValue=toupper("hello"))
things work. Some help?

It may be worth adding that the available documentation states:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
Hive
As stated in the documentation, a viable solution should be achievable with use of regexp_replace:
Returns the string resulting from replacing all substrings in
INITIAL_STRING that match the java regular expression syntax defined
in PATTERN with instances of REPLACEMENT. For example,
regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some
care is necessary in using predefined character classes: using '\s' as
the second argument will match the letter s; '\\s' is necessary to
match whitespace, etc.
sparklyr approach
Considering the above it should be possible to combine sparklyr pipeline with
regexp_replace to achieve effect cognate to applying gsub on the desired column. Tested code removing the - character within sparklyr in variable d could be build as follows:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
where class(aDataFrame ) returns: "tbl_spark" ....

I would strongly recommend you read the sparklyr documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr dataframes, and gsub is not one of those functions (but toupper is). If you really need gsub you're going to have to collect the data in to a local dataframe, then gsub it (you can still use mutate), then copy_to back to spark.

Related

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

Why am I having errors with order of functions using %>% in R?

This is the code I am trying to run:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I want to first add the new column ("Ubb") from "new_table" and then add a calculated column using that new column. However, I get an error saying that Ubb column does not exist. So it's not performing merge before running mutate? When I separate the functions everything works fine:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')
data_table<-data_table%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I would like to keep everything together just for style, but I'm also just curious, shouldn't R perform merge first and then mutate? How does order of operation during piping work?
Thank you!
you dont need to refer to column name with $ sign. i.e. use Normalized_value = ((1.8^(Ubb - Ct_adj))*10000)
because it is merged now. with $ sign I believe R, even though does the merge, has original data_table still in memory. because the assignment operator did not work yet. the assignment will take place after all operations.
Try running the code like this:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(Ubb - Ct_adj))*10000))
Notice that I'm not using the table name with the $ within the pipe. Your way is telling the mutate column to look at a vector. Maybe it's having some trouble understanding the length of that vector when used with the merge. Just call the variable name within the pipe. It's easiest to think of the %>% as 'and then': data_table() and then merge() and then mutate(). You might also want to think about a left_join instead of a merge.

Switching the order of paste() in piping in R

I am fairly new to R and I would like to paste the string "exampletext" in front of each file name within the path.
csvList <- list.files(path = "./csv_by_subject") %>%
paste0("*exampletext")
Currently this code renders things like "csv*exampletext" and I want it to be *exampletextcsv". I would like to continue to using dplyr and piping - help appreciated!
As others pointed out, the pipe is not necessary here. But if you do want to use it, you just have to specify that the second argument to paste0 is "the thing you are piping", which you do using a period (.)
list.files(path = "./csv_by_subject") %>%
paste0("*exampletext", .)
paste0('exampletext', csvList) should do the trick. It's not necessarily using dplyr and piping, but it's taking advantage of the vectorization features that R provides.
If you'd like to paste *exampletext before all of the file names, you can reverse the order of what you're doing now using paste0 and passing the second argument as list.files. paste0 can handle vectors as the second argument and will apply the paste to each element.
csvList <- paste0("*exampletext", list.files(path = "./csv_by_subject"))
This returns a few examples from a local folder on my machine:
csvList
[1] "*exampletext_error_metric.m"
[2] "*exampletext_get_K_clusters.m"
...

Use LaF and grepl together

I would like to read in a possibly large text file and filter the relevant lines on the fly based on a regular expression. My first approach was using the package LaF which supports chunkwise reading and then grepl to filter. However, this seems not to work:
library(LaF)
fh <- laf_open_csv("myfile.txt", column_types="string", sep="°")
# would be nice to declare *no* separator
fh[grepl("abc", fh[[1]]), ]
returns an error in as.character.default(x) -- no method to convert this S4 to character. It seems like grepl is applied to the S4 function and not to the chunks.
Is there a nice way to read text lines from a large file and filter them selectively?
OK, I just discovered process_blocks:
regfilter <- function(df, result) c(result, df[grepl("1745", df[[1]]),1])
process_blocks(fh, regfilter)
This works, now I only need to find a way to ignore separators..

Trouble interpolating column name for subset in R function

I have (fbodata) 'data.frame': 6181090 obs. of 41 variables:
I want to subset it and save the portion that pertains to a specific subset (like a zip). My approach seems to work when it is not in a function, but I ultimately want to use sapply.
nmakedir <- function(item, ccol) {
snipped a bunch of code that works
trim<- fbodata[ which(paste(ccol)==item),]
trim%>% drop_na(paste(ccol))
trim<- droplevels(trim
save(trim, file = paste(item, "rda", sep="."))
}
The line that doesn't work is one where I creating the subset with which. If i hardcode the line using fbodata$zip instead of paste(ccol) it works fine. Eventually, I plan to call it with something like:
sapply(unique(fbodata$zip),zip, FUN = nmakedir)
I appreciate any clues, I have been on this for a good long while.
A few things going on:
ccol is a string. paste(ccol) is the same string. You never need to call paste with only one argument. (You can use paste to coerce non-strings to strings, but in that case you should use as.character() to be clear.)
Keeping in mind that ccol is a string, what is fbodata$zip? It's a column! What is the equivalent using ccol and brackets? fbodata[[ccol]] or fbodata[, ccol]. You can use either of those interchangeably with fbodata$zip. So, this bad line
fbodata[ which(paste(ccol)==item),]
# should be this:
fbodata[which(fbodata[[ccol]] == item), ]
drop_na, like most dplyr functions, expects (quoting from the help) "bare variable names", not strings. Also from the help, "See Also: drop_na_ for a version that uses regular evaluation and is suitable for programming with". In this case, I don't think you need to do anything more than replace drop_na with drop_na_.
You are missing a right parenthesis on your droplevels command.
There might be more, but this is much as I can see without any sample data. Your sapply call looks funny to me because I thought zip is supposed to be a column name, but when you call sapply(unique(fbodata$zip),zip, FUN = nmakedir) it needs to be an object in your global environment. I would think sapply(unique(fbodata$zip), 'zip', FUN = nmakedir) makes more sense, but without a reproducile example there's no way to know.
It also seems like you're coding your own version of split. I would probably start this off with fbo_split = split(fbodata, fbodata$zip) and then use lapply to drop_na_, droplevels, and save, but maybe your snipped code makes that a less good idea.

Resources