When using ARRAY_AGG it removes my record - u-sql

I am trying to use Array_agg to help combine some records, but it seems to be removing the record when I try to use it. The format of what I have written is below. All the records which are individual stay.
SELECT
columnA,
columnB,
columnC,
string.Join(", ", ARRAY_AGG(column1)) AS col1,
string.Join(", ", ARRAY_AGG(column2)) AS col2,
SUM(column3) AS col3,
SUM(column4) AS col4,
FROM table
GROUP BY columnA, columnB,columnC

Related

Left join without returning NA values

I am using R and the library dplyr.
I want to join a larger database with a smaller database (in terms of rows).
I use left join because I want to have a final database that has the same number of rows as the larger one.
This naturally returns NA values when the smaller database does not have a value corresponding to the joining key.
What I want to achieve is sort of copying the previous values of the smaller database into the rows where NA is returned by the left join.
In other words:
if is.na(columnvalue[j]) == TRUE then
columnvalue[j] = columnvalue[j-1]
where columnvalue is a joined column from the smaller database and j = 1,..., nrow(largerdataset).
A loop with that if statement should work, but it is a bit cumbersome. Is there any other smarter solution?
Thank you.
If you update with some sample data, I could provide full code for this. The general solution is to use fill from tidyr package, possibly with a group_by the key if needed. You would just write it as:
library(tidyverse)
data %>%
# group_by(key) %>%
tidyr::fill(var1, var2, var3, .direction = "up")

Order a dataframe primarily by the count of times a value appears in a column and secondarily by a second column

I have the dataframe below which I would like to order primarily numerically by the count of times a value appears in the first column (bigger values first) and secondarily alphabetically(A-Z) based on the second column.
Name<-c("jack","jack","bob","david","mary")
Surname<-c("arf","dfg","hjk","dfgg","bn")
n1<-data.frame(Name, Surname)
It should be something like:
n1<-n1[
order( n1[,1], n1[,2] ),
]
but I do not know how to order numerically based on count of values.
As suggested by #thelatemail, you can do this in base R using:
n1[order(-table(n1$Name)[n1$Name], n1$Surname), ]
To sort by surname first, swap the arguments to order() around.
Using sqldf likes the following:
library(sqldf)
n2 <- sqldf('SELECT * FROM
n1 JOIN (SELECT Name, COUNT(*) as C FROM n1 GROUP BY Name) as T
on n1.Name = T.Name
ORDER BY C DESC, Surname')
First grouped the names and then sort based on the count in decent order and Surname alphabetically.
Using dplyr like the following:
library(dplyr)
n1 %>%
as_tibble() %>%
count(Name) %>%
right_join(n1, by = "Name") %>% # join the count back to your original frame
arrange(desc(n), Surname) %>% # order by highest count first, then Surname
select(-n) # just return the original frame, with the sorting you asked for

How to get distinct values of a column in rquery?

Exploring the rquery package of John Mount, Win-Vector LLC, is there a way that I could get the distinct values of a column from a SQL table using the rquery package functions? (WITHOUT writing the appropriate SQL query but using the rquery functions since I need to use my code in Oracle, MSSQL and Postgres).
So I do not need:
rq_get_query(db, "SELECT DISTINCT (COL1) FROM TABLE1")
but I am looking for something similar to unique of base R.
I would use the sqldf package. It is very accessible, and think you would benefit.
install.packages("sqldf")
library(sqldf)
df = sqldf("SELECT DISTINCT COL1 FROM TABLE1")
View(df)
This returns the distinct values of Col1 and Col2. Can of course be any number of columns.
db_td(connection, "table") %.>%
project(., groupby = c("Col1", "Col2"), one = 0) %.>%
execute(connection, .)
The assignment of 0 to a new column is necessary, is supposed to be fixed in the next update of rquery, so it will work like this:
project(., groupby = c("Col1", "Col2"))

Using multiple columns in dplyr window functions?

Comming from SQL i would expect i was able to do something like the following in dplyr, is this possible?
# R
tbl %>% mutate(n = dense_rank(Name, Email))
-- SQL
SELECT Name, Email, DENSE_RANK() OVER (ORDER BY Name, Email) AS n FROM tbl
Also is there an equivilant for PARTITION BY?
I did struggle with this problem and here is my solution:
In case you can't find any function which supports ordering by multiple variables, I suggest that you concatenate them by their priority level from left to right using paste().
Below is the code sample:
tbl %>%
mutate(n = dense_rank(paste(Name, Email))) %>%
arrange(Name, Email) %>%
view()
Moreover, I guess group_by is the equivalent for PARTITION BY in SQL.
The shortfall for this solution is that you can only order by 2 (or more) variables which have the same direction. In the case that you need to order by multiple columns which have different direction, saying that 1 asc and 1 desc, I suggest you to try this:
Calculate rank with ties based on more than one variable

From SAS to R - proc sort nodupkey

I'm traslating a SAS script to R but I don't know how does SAS work...
I have this piece of code:
proc sort data=table
(keep= Field1 Field2 Field3 Field4 Field5)
out=table_nodup nodupkey;
by Field1 Field2 Field4;
run;
I don't know what the code will do and then I don't know how to translate it to R...any help? :)
According to this paper I'd say it can be described with dplyr as following:
library(dplyr)
table %>%
select(Field1, Field2, Field3, Field4, Field5) %>%
group_by(Field1, Field2, Field4) %>%
slice(1)
select is for SAS's keep, then nodupkey can be translated to grouping by by variables and taking the first occurrences. A good thing is that slice returns a dataframe that is already sorted by the groups that were used, so arrange is not needed.
Given the data frame table :
table<- table[,c(Field1,Field2,Field3,Field4,Field5)]#keep specific columns
table_nodup<-unique(table[with(data, order(Field1, Field2, Field4))])#orders the data based on the 3 columns and select unique rows

Resources