Sum and count of grouped records - sqlite

Lets say i have a table:
Col1
Col2
Col3
R1
C1
5
R2
C3
8
R1
C1
2
R1
C2
4
R2
C5
3
R2
C2
4
I need to get:
A count of same values of Col2 with correspondig Col1 and SUM
of Col3.
A sum and count of grouped results.
To achive the #1 my code looks like that:
SELECT Col1, Col2, COUNT(*), SUM(Col3)
FROM myTable
GROUP BY Col1, Col2
I get the result (and it is ok):
Col1
Col2
Count
Sum
R1
C1
2
7
R1
C2
1
4
R2
C2
1
4
R2
C3
1
8
R2
C5
1
3
Demo
For #2 i need to know the SUMof values of column Count and the SUM of values of column SUM, where values of column Col1 are equal. How could i upgrade my code?
The desired result would be something like that:
Col1
Col2
Count
Sum
R1
C1
2
7
R1
C2
1
4
3
11
R2
C2
1
4
R2
C3
1
8
R2
C5
1
3
3
15

You can simulate rollup records by adding records, that aggregate only to "Col1" values, to your initial result set, using a UNION ALL operation.
SELECT Col1, Col2, COUNT(*) AS cnt, SUM(Col3) AS total FROM myTable GROUP BY Col1, Col2
UNION ALL
SELECT Col1, NULL, COUNT(*) , SUM(Col3) FROM myTable GROUP BY Col1
ORDER BY Col1
Output:
Col1
Col2
cnt
total
R1
C1
2
7
R1
C2
1
4
R1
null
3
11
R2
C2
1
4
R2
C3
1
8
R2
C5
1
3
R2
null
3
15
Check the demo here.

Related

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me
You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g
You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

Reverse spread using gather

We have a rating matrix:
df <- data.frame(Customer.ID=c("c1",'c1','c1','c2','c2','c3'),
Movie.ID=c("m1", "m3", "m5", "m1", "m5", "m7"),
Rating=c(1,2,1,3,3,1))
df
Customer.ID Movie.ID Rating
1 c1 m1 1
2 c1 m3 2
3 c1 m5 1
4 c2 m1 3
5 c2 m5 3
6 c3 m7 1
When I spread and change row names like this:
df1 <- df %>% spread(key = 'Movie.ID', value = 'Rating')
df1 <- data.frame(df1, row.names = 'Customer.ID')
I get:
> df1
m1 m3 m5 m7
c1 1 2 1 NA
c2 3 NA 3 NA
c3 NA NA NA 1
I want to make df1 look like df again.
I have tried:
df2 <-setDT(df1, keep.rownames = TRUE)[]
df2 <- gather(df2, Video.ID, Rating, 2:4)
But it returns me:
> df2
rn m7 Video.ID Rating
1 c1 NA m1 1
2 c2 NA m1 3
3 c3 1 m1 NA
4 c1 NA m3 2
5 c2 NA m3 NA
6 c3 1 m3 NA
7 c1 NA m5 1
8 c2 NA m5 3
9 c3 1 m5 NA
While I am not certain why you are doing this (see #Jack Brookes comment), you can do this pretty readily with dplyr functions:
df1 %>%
rownames_to_column('Customer.ID') %>%
gather(m1:m7, key = 'Movie.ID', value = 'Rating') %>%
filter(!is.na(Rating))
Which gives us:
Customer.ID Movie.ID Rating
1 c1 m1 1
2 c2 m1 3
3 c1 m3 2
4 c1 m5 1
5 c2 m5 3
6 c3 m7 1

Finding duplicates in a dataframe and returning count of each duplicate record

I have a dataframe like
col1 col2 col3
A B C
A B C
A B B
A B B
A B C
B C A
I want to get an output in the below format:
col1 col2 col3 Count
A B C 3 Duplicates
A B B 2 Duplicates
I dont want to use any specific column in the function to find the duplicates.
That is the reason of not using add_count from dplyr.
Using duplicate will have
col1 col2 col3 count
2 A B C 3
3 A B B 2
5 A B C 3
So not the desired output.
We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.
library(dplyr)
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)
# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3
We can use data.table
library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
# col1 col2 col3 n
#1: A B C 3
#2: A B B 2
Or with base R
subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
# col1 col2 col3 n
#2 A B B 2
#3 A B C 3

remove duplicate rows in R based on values in all columns

I have the following dataset
col1 col2 col3
a b 1
a b 2
a b 3
unique(dataset) returns
col1 col2 col3
a b 1
dataset[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 2
a b 3
But the same thing fails to work in following
dataset2
col1 col2 col3
a b 1
a b 1
unique(dataset2) returns
col1 col2 col3
a b 1
dataset2[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 1
NA NA NA
Use !duplicated:
dataset[!duplicated(dataset[c("col1", "col2", "col3")]),]

un-intersect values in R

I have two data sets of at least 420,500 observations each, e.g.
dataset1 <- data.frame(col1=c("microsoft","apple","vmware","delta","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
dataset2 <- data.frame(col1=c("apple","cisco","proactive","dtex","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
> dataset1
col1 col2 col3
1 microsoft a.exe 2
2 apple b.exe 1
3 vmware c.exe 3
4 delta 4.exe 4
5 microsoft asd.exe 5
> dataset2
col1 col2 col3
1 apple a.exe 3
2 cisco b.exe 4
3 vmware d.exe 1
4 delta 5.exe 5
5 microsoft asd.exe 2
I would like to print all the observations in dataset1 that do not intersect one in dataset2 (comparing both col1 and col2 in each), which in this case would print everything except the last observation - observations 1 & 2 match on col2 but not col1 and observation 3 & 4 match on col1 but not col2, i.e.:
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could use anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c('col1', 'col2'))
# col1 col2 col3
#1 delta 4.exe -0.5836272
#2 vmware c.exe 0.4196231
#3 apple b.exe 0.5365853
#4 microsoft a.exe -0.5458808
data
set.seed(24)
df1 <- data.frame(col1 = c('microsoft', 'apple', 'vmware', 'delta',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
set.seed(22)
df2 <- data.frame(col1 = c( 'apple', 'cisco', 'proactive', 'dtex',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
data.table solution inspired by this:
library(data.table) #1.9.5+
setDT(dataset1,key=c("col1","col2"))
setDT(dataset2,key=key(dataset1))
dataset1[!dataset2]
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could also try without keying:
library(data.table) #1.9.5+
setDT(dataset1); setDT(dataset2)
dataset1[!dataset2,on=c("col1","col2")]

Resources