Doing Calculations In Spark(R)

Doing Calculations In Spark(R) - r

I am using the sparklyr library.
I have a variable, wtd which I copied to spark:
copy_to(sc,wtd)
colnames(wtd) <- c("a","b","c","d","e","f","g")
Then I want to do a computation and store that in spark, not in my environment in R.
When I try:
sdf_register(wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d), "wtd2")
Error in UseMethod("sdf_register") :
no applicable method for 'sdf_register' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
The command wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d) works correctly, but that will store it in my environment, not in spark.

The first argument in your sequence of operations should be a "tbl_spark", not a regular data.frame. Your command,
wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d)
works because you are not using Spark at all, just normal R data.frames.
If you want to use it with spark, first, store the spark_tbl variable that is returned when you copy your data.frame:
colnames(wtd) <- c("a","b","c","d","e","f","g")
wtd_tbl <- copy_to(sc, wtd)
Then, you can execute your data pipeline using sdf_register(wtd_tbl %>% ..., "wtd2").
If you execute the pipeline as defined, you will get an exception saying:
Error: org.apache.spark.sql.AnalysisException: Window function rownumber() requires window to be ordered
This is because in order to use row_number() in Spark, first you need to provide an "order function". You can have this with arrange(). I assume that you want your rows ordered by the columns "c" and "b", so your final pipeline would be something like this:
sdf_register(wtd_tbl %>%
dplyr::group_by(c, b) %>%
arrange(c, b) %>%
dplyr::filter(row_number() == 1) %>%
dplyr::count(d),
"wtd2")
I hope this helps.

Related

How to get the linear regression equation for a certain variable?

When trying to retreive a linear regression equation I keep getting this error:
Error in eval(predvars, data, env) : object 'OBP' not found
Here is my chunk,
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
ggplot(aes(x=OBP,y=W))+
geom_point()+
geom_smooth(method = "lm")+
labs(title="OBP vs. Wins in MLB 1982-2002 ex:1994",x="On Base Percentage",y="Wins") %>%
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
m1<-lm(data = teams_oak,formula = W~OBP)
I've tried to pipe all the functions together to make sure that the code was reading for the mutated variable and it seems it cannot find it.

When you do this:
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
m1<-lm(data = teams_oak,formula = W~OBP)
on the last two lines you are using the pipe operator with an assignment on the last line. Within that assignment you are naming teams_oak as the data, but because this is in the pipe and the pipe hasn't finished yet, it doesn't have the OBP variable, so it fails.
I think simply omitting the last pipe operator and turning it into two assigment statements, should work, with a blank line and indentation to indicate the end of the pipe
teams_oak <- teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF)
m1<-lm(data = teams_oak,formula = W~OBP)
Personally I don't use pipes because it does make it easy to break code this way. They are not the best way to write clear code.

How do I solve the error object not found after I created this variable using mutate?

I have added a variable that is the sum of all policies for each customer:
mhomes %>% mutate(total_policies = rowSums(select(., starts_with("num"))))
However, when I now want to use this total_policies variable in plots or when using summary() it says: Error in summary(total_policies) : object 'total_policies' not found.
I don't understand what I did wrong or what I should do differently here.

May be slightly round about, but feel solves the purpose. Considering df is the dataset and it has customer_id, policy_id and policy_amount as variables then the below command should work
req_output = df %>% group_by(customer_id) %>% summarise (total_policies = sum (policy_amount)
if you still face the issue, kindly convert to data frame and try plotting
req_output = as.data.frame(req_output)

pooled resources in Simmer for R - how to use get_server_count correctly

I'm trying to create a simulation model that describes the dispensing process in a hospital pharmacy. I consider 3 main activities, i.e. verifying, dispensing and final checking. How would I be able to define a "pooled resource", i.e. provide for certain activities to be able to resort not only to one, but two types of resources, if needed? In other words, if there are no final checkers available (because they are utilized in the process), how could I allow for pharmacists to do this task, if they are available?
See example of code below. I did not manage to access the number of currently in simulation-time available resources with get_server_count in the trajectory in any way. I usually get error messages along the lines of:
Error in UseMethod("get_server_count") :
no applicable method for 'get_server_count' applied to an object of class "character"
I also tried an if statement in the trajectory to allow for a back-up resource to be used if the primary resource is not available. This got me the same message and an additional one:
In if (.) get_server_count("dispenser") > 0 else { :
the condition has length > 1 and only the first element will be used
See example code:
library(simmer)
library(dplyr)
set.seed(42)
#Defining Simmer environment:
pharmacy <- simmer("Dispensing Process")
#Defining 3 activities, i.e. verifying, dispensing, and final checking, and their
#durations:
dispProcess <- trajectory("dispensing process") %>%
seize("pharmacist", 1) %>%
timeout(5) %>%
release("pharmacist", 1) %>%
log_(get_server_count("dispenser")) %>%
seize("dispenser", 1) %>%
timeout(15) %>%
release("dispenser", 1) %>%
seize("final checker", 1) %>%
timeout(5) %>%
release("final checker", 1)
#Defining number of resources (i.e. staff) available:
pharmacy %>%
add_resource("pharmacist", 2) %>%
add_resource("dispenser", 4) %>%
add_resource("final checker", 2) %>%
add_generator("prescription", dispProcess, function() {10}, mon = 2)
#Defining length of simulation run:
pharmacy %>% run(until = 400)
pharmacy %>% get_mon_arrivals() %>% print()
With the code above, I would have expected the number of free "dispenser" resources at that point in simulation-time to be shown. This did not happen, as described above.
How can I access this information in the trajectory? Can I use if statement there to make seizing of certain types of resources dependent on their availability?
Exchanging the trajectory code to
dispProcess <- trajectory("dispensing process") %>%
seize("pharmacist", 1) %>%
timeout(5) %>%
release("pharmacist", 1) %>%
log_(get_server_count("dispenser")) %>%
seize("dispenser", 1) %>%
timeout(15) %>%
release("dispenser", 1) %>%
select(resources = c("pharmacist","final checker"), policy = "shortest-queue") %>%
seize_selected(amount = 1) %>%
timeout(5) %>%
release_selected(amount = 1)
got me the following error message:
Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('trajectory', 'R6')"
This is strange, as I believe I have used the select command as described in the Advanced Trajectory Usage tutorial (https://cran.r-project.org/web/packages/simmer/vignettes/simmer-03-trajectories.html).

About the relevant line:
log_(get_server_count("dispenser")) %>%
See the help page for get_server_count. It requires two arguments: 1) the simulation environment and 2) the name of the resource. You provided just the name of the resource.
This line executes the function immediately, but you need to provide the name of a function (or an anonymous function) to be executed during the simulation.
The output of get_server_count is numeric, as is documented, but the log_ activity requires a string, so you need to convert it.
Putting 1:3 together, the line should read as follows:
log_(function() paste(get_server_count(pharmacy, "dispenser"))) %>%
About the if statement, control statements cannot be used at the activity level. If you need different paths depending on some condition, see the branch activity.
Finally, about select, both simmer and dplyr export a select function. If you load dplyr after simmer, select refers to the dplyr function, and you should call simmer's version as simmer::select.

"could not find function %>%<-", issue with tidyr package and the %>% operator

I'm working on a script for a swirl lesson on using the tidyr package and I'm having some trouble with the %>% operator. I've got a data frame called passed that contains the name, class number, and final grade of 4 students. I want to add a new column called status and populate it with a character vector that says "passed". Before that, I used select to grab some columns from a data frame called students4 and stored it in a data frame called grade book
gradebook <- students4 %>%
select(id, class, midterm, final) %>%
passed<-passed %>% mutate(status="passed")
Swirl problems build on each other, and the last one just had me running the first to lines of code, so I think those two are correct. The third line was what was suggested after a couple of wrong attempts, so I think there's something about %>% that I'm not understanding. When I run the code I get an error that says;
Error in students4 %>% select(id, class, midterm, final) %>% passed <- passed %>% :
could not find function "%>%<-
I found another user who asked about the "could not find function "%>%" who was able to resolve the issue by installing the magrittr package, but that didn't do the trick for me. Any input on the issues in my code would be super appreciated!

It’s not a problem with the package or the operator. You’re trying to pipe into a new line with a new variable.
The %>%passes the previous dataframe into the next function as that functions df argument.
Instead of doing all of this:
Gradebook <- select(students4, id, class, midterm, final)
Gradebook2 <- mutate(Gradebook, test4 = 100)
Gradebook3 <- arrange(Gradebook2, desc(final))
You can pipe operator into the next argument if you’re working on the same dataframe.
Gradebook <- students4 %>%
select(students4, id, class, midterm, final) %>%
mutate(test4 = 100) %>%
arrange(desc(final))
Much cleaner and easier to read.
In your second line you’re trying to pass it to a new function but instead of there being a function you’re all of a sudden defining a variable. I don’t know the exercise you’re doing but you should remove the second operator.
gradebook <- students4 %>%
select(id, class, midterm, final)
passed <- passed %>% mutate(status="passed")

Sparklyr "NoSuchTableException" error after subsetting data

I am new to sparklyr and haven't had any formal training - which will become obvious after this question. I'm also more on the statistician side of the spectrum which isn't helping. I'm getting an error after sub-setting a Spark DataFrame.
Consider the following example:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local[*]")
iris_tbl <- copy_to(sc, iris, name="iris", overwrite=TRUE)
#check column names
colnames(iris_tbl)
#subset so only a few variables remain
subdf <- iris_tbl %>%
select(Sepal_Length,Species)
subdf <- spark_dataframe(subdf)
#error happens when I try this operation
spark_session(sc) %>%
invoke("table", "subdf")
The error I'm getting is:
Error: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
There are several other lines of the error.
I don't understand why I'm getting this error. "subdf" is a Spark DataFrame.

To understand why this doesn't work you have to understand what happens when you copy_to. Internally sparklyr will register temporary table using Spark metastore and treat it more or less like just another database. This is why:
spark_session(sc) %>% invoke("table", "iris")
can find the "iris" table:
<jobj[32]>
class org.apache.spark.sql.Dataset
[Sepal_Length: double, Sepal_Width: double ... 3 more fields]
subdf from the other hand is just plain local object. It is not registered in the metastore hence it cannot be accessed using Spark catalog. To make it work you can register Spark DataFrame:
subdf <- iris_tbl %>%
select(Sepal_Length, Species)
spark_dataframe(subdf) %>%
invoke("createOrReplaceTempView", "subdf")
or copy_to if data is small enough to be handled by the driver:
subdf <- iris_tbl %>%
select(Sepal_Length, Species) %>%
copy_to(sc, ., name="subdf", overwrite=TRUE)
If you work with Spark 1.x createOrReplaceTempView should be replaced with registerTempTable.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Doing Calculations In Spark(R) - r

Related

How to get the linear regression equation for a certain variable?

How do I solve the error object not found after I created this variable using mutate?

pooled resources in Simmer for R - how to use get_server_count correctly

"could not find function %>%<-", issue with tidyr package and the %>% operator

Sparklyr "NoSuchTableException" error after subsetting data

Categories

Resources