alias for count in Pyspark - count

I am new to Pyspark. I am trying to use alias for count function. For some reason if I use agg in front of count then alias is working but if I am not aggregating then the alias is giving me error.
.(count("firstName").alias("cnt"))
doesn't work;
.agg(count("firstName").alias("cnt"))
works.
I wanted to understand the issue with the 1st query.

You can try this:
.count().withColumnRenamed("count","cnt")
we cannot alias count function directly

Related

Why am I having errors with order of functions using %>% in R?

This is the code I am trying to run:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I want to first add the new column ("Ubb") from "new_table" and then add a calculated column using that new column. However, I get an error saying that Ubb column does not exist. So it's not performing merge before running mutate? When I separate the functions everything works fine:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')
data_table<-data_table%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I would like to keep everything together just for style, but I'm also just curious, shouldn't R perform merge first and then mutate? How does order of operation during piping work?
Thank you!
you dont need to refer to column name with $ sign. i.e. use Normalized_value = ((1.8^(Ubb - Ct_adj))*10000)
because it is merged now. with $ sign I believe R, even though does the merge, has original data_table still in memory. because the assignment operator did not work yet. the assignment will take place after all operations.
Try running the code like this:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(Ubb - Ct_adj))*10000))
Notice that I'm not using the table name with the $ within the pipe. Your way is telling the mutate column to look at a vector. Maybe it's having some trouble understanding the length of that vector when used with the merge. Just call the variable name within the pipe. It's easiest to think of the %>% as 'and then': data_table() and then merge() and then mutate(). You might also want to think about a left_join instead of a merge.

How do I write a dplyr pipe-friendly function where a new column name is provided from a function argument?

I'm hoping to produce a pipe-friendly function where a user specifies the "name of choice" for a new column produced by the function as one of the function arguments.
In the function below, I'd like name_for_elective to be something that the user can set at will, and afterwards, the user could expect that there will be a new column in their data with the name that they provided here.
I've looked at https://dplyr.tidyverse.org/reference/dplyr_data_masking.html, the mutate() function documentation, searched here, and tried working with https://dplyr.tidyverse.org/reference/rename.html, but to no avail.
elective_open<-function(.data,name_for_elective,course,tiebreaker){
name_for_elective<-rlang::ensym(name_for_elective)
course<-rlang::ensym(course)
tiebreaker<-rlang::ensym(tiebreaker)
.data%>%
mutate(!!name_for_elective =ifelse(!!tiebreaker==max(!!tiebreaker),1,0))%>%mutate(!!name_for_elective=ifelse(!!name_for_elective==0,!!course[!!name_for_elective==1],""))%>%
filter(!(!!course %in% !!name_for_elective))
}
I've included this example function because there are several references to the desired new column name, and I'm unsure if the context in which the reference is made changes syntax.
As you can see, I was hoping !!name_for_elective would let me name our new column, but no. I've played with {{}}, not using rlang::ensym, and still haven't got this figured out.
Any solution would be greatly appreciated.
This: Use dynamic variable names in `dplyr` may be helpful, but I can't seem to figure out how to extend this to work in the case where multiple references are made to the name argument.
Example data, per a good suggestion by #MrFlick, takes the form below:
dat<-tibble(ID=c("AA","BB","AA","BB","AA","BB"),Class=c("A_Class","B_Class","C_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.43,.97,.41,.27,.38))
The user could then run something like:
dat2<-dat%>%
elective_open(MyChosenName,Class,randomNo)
A desired result, if using the function a single time, would be:
desired_result_1<-tibble(ID=c("AA","BB","AA","BB"),
Class=c("A_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.41,.27,.38),
MyChosenName=c("C_Class","B_Class"))
The goal would be to allow this function to be used again if so desired, with a different name specified.
In the case where a user runs:
dat3<-dat%>%
elective_open(MyChosenName,Class,randomNo)%>%
mutate(Just_Another_One=1)%>%
elective_open(SecondName,Class,randomNo)
The output would be:
desired_result_2<-tibble(ID=c("AA","BB"),
Class=c("E_Class","F_Class"),
randomNo=c(.27,.38),
MyChosenName=c("C_Class","B_Class"),
Just_Another_One=c(1,1),
SecondName=c("A_Class","D_Class"))
In reality, there may be any number of records with the same ID, and any number of Class-es.
In this case you can just stick to using the embrace {{}} option for your variables. If you want to dynamically create column names, you're going to still need to use :=. The difference here is that you can use the glue-style syntax with the embrace operator to get the name of the symbol. This works with the data provided.
elective_open <- function(.data, name_for_elective, course, tiebreaker){
.data%>%
mutate("{{name_for_elective}}" := ifelse({{tiebreaker}}==max({{tiebreaker}}),1,0)) %>%
mutate("{{name_for_elective}}" := ifelse({{name_for_elective}}==0,{{course}}[{{name_for_elective}}==1],"")) %>%
filter(!({{course}} %in% {{name_for_elective}}))
}

Error in using contains() in filter command of dplyr

I am trying to filter the values of the column 2010. The actual column name is "Y2010". I know to get the output the easy way but i am trying to use the function contains() to fetch the values of the column Y2010 which are greater than 150000.
Code i used is:
filter(HistData, contains("2010")>150000)
This is not working. I am getting the following error:
Error in filter_impl(.data, quo) :
Evaluation error: No tidyselect variables were registered.
I couldn't understand what I am doing wrong.
This contains works fine when I use select command.
select(histdata,contains("2010").
Can anyone please explain what am I missing in the filter command.
This is due to the "select" helpers. I was using select command helpers like ends_with, contains etc in filter command.

How to get the expression used in an Assign activity

In the line below, "ThenActivity" is an Assign activity nested inside the Then part of an If activity. Im trying to get at the expression, but this snippet isnt working.
((Assign)ThenActivity).To.Expression.ToString();
This returns "1.13: CSharpReference"
When it should read R = 44.5M, which is the expression text, how do I get at it?
The statement should read something like this
((CSharpValue)(((Assign)ThenActivity).Value.Expression)).ExpressionText
Note: You need to get the assignment, then its expression, cast that as a CSharpValue, then finally you can get the ExpressionText.

How to pass R variables as input to SearchES method in ElasticSearch?

I am working with RElasticSearch package in R. I am able to connect to the proper index in ElasticSearch. Suppose my index contains two fields like id and name. Two of my R variables,say rid and rname contains the value i want to search. How should i use the searchES method to accomplish this? I have tried using like:
searchES(server=es.index,query="id":rid & "name":rname)
but it keeps throwing an error! Can someone please help me out?
You need to correctly build your as a character value in order for this to work. In order to concatenate strings in R, you should use paste(). For example
searchES(server=es.index,query=paste0("id:", rid, " AND name:", rname))

Resources