How to group by two columns at the same time? - sqlite

I'm trying to group by two columns at the same time and return the average of their values. For example, this chart:
Col A
Col B
Col C
P1
P2
10
P1
P3
15
P2
P1
20
P3
P2
30
should return:
P1 : 15 (10 + 15 + 20 / 3)
P2: 20
P3: 22.5
I've tried using union and group by one of the columns but it returns a separate value for each P based on its col A and col B average.

Try this:
SELECT t.col1, AVG(t.col2)
FROM
(
SELECT Col_A AS [col1], Col_C AS [col2] FROM YourTable
UNION
SELECT Col_B AS [col1], Col_C AS [col2] FROM YourTable
) as t
GROUP BY t.col1
And here is another way:
DECLARE #Tabletemp TABLE
(
col1 VARCHAR(2),
col2 FLOAT
)
INSERT #Tabletemp SELECT Col_A, Col_C FROM YourTable
INSERT #Tabletemp SELECT Col_B, Col_C FROM YourTable
SELECT col1, AVG(col2) FROM #Tabletemp GROUP BY col1
P.S. By "YourTable", I mean the original table with 3 columns.

Related

How to fetch data in batches using R

I have a dataframe in R with the following structure:
ID Date
ID-1 2020-02-10 13:12:04
ID-2 2020-02-12 15:02:24
ID-3 2020-02-14 12:25:32
I am using the following query to fetch the data from MySQL, that where I'm getting a problem because I have a large number if ID (i.e ~90K). When I'm passing 500-1000 ID it is working fine but passing 90K Id it throws an error.
Data_frame<-paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")
The query returns the output in the following manner which I need to rbind with DF using ID.
Query_Output<-
ID Name output
ID-1 Name1 23
ID-1 Name2 20
ID-2 Name1 40
ID-2 Name2 97
ID-3 Name1 34
ID-3 Name2 53
Required Output:
ID Date Name1 Name2
ID-1 2020-02-10 13:12:04 23 20
ID-2 2020-02-12 15:02:24 40 97
ID-3 2020-02-14 12:25:32 34 53
I have tried the below-mentioned code:
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(DF$ID, 2)
# It looks like this:
# [1] "'ID-1','ID-2'" "'ID-3','ID-4'" "'ID-5'"
# now we create a vector of SQL-queries out of that
# queries <- createQueries(IDbatches)
df_final <- data.frame() # initialize a dataframe
conn <- database # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))}
Surprised 90k rows kills your SQL but such is life
Not sure I understand why you are doing what you are doing rather that looping a for
for (batches in 0:90) {
b = batches*1000
SELECT ...
... WHERE ID > b & < b+1000
rbind(myData, result)
}
(That's not the solution just the method)
But if your method is working then is what you want dplyr::pivot_wider()

Power BI relativity measure with mutiple groups

I have a data that I simply create:
Col1 Col2 Col3 Col4
2014/1/1 A Y 10
2014/4/1 A Y 15
2015/1/1 A Z 15
2015/4/1 A Z 30
2014/1/1 B Y 20
2014/4/1 B Y 30
2015/1/1 B Z 40
2015/4/1 B Z 80
I want to create a measure in Power BI so I can create an interactive visualization. The above data is created for example so we need to suppose that col2, col3 have multiple factors.
The measure I want is relativity, the value in Col4 divide the first value under Col1 and Col2.
Result I supposed but I do not need this in data table since when I create the visualization and add the filter for other columns, Col5, Col6, and etc. that I did not show in this example:
Col1 Col2 Col3 Col4 relativity_Col3ALL relativity_Col3EqualsYorZ
2014/1/1 A Y 10 1 1
2014/4/1 A Y 15 1.5 1.5
2015/1/1 A Z 15 1.5 1
2015/4/1 A Z 30 3 2
2014/1/1 B Y 20 1 1
2014/4/1 B Y 30 1.5 1.5
2015/1/1 B Z 40 2 1
2015/4/1 B Z 80 4 2
So I plot it and add filters beside the plot. When I select Y in the filter Col3, the plot will automatically change.
I provide the code I think in R:
dt <- data.table::as.data.table(dt)
dt[, relativity := Col4 / Col4[1], by = .(Col1, Col2)]
But above code is incorrect because it did not consider Col3. I just want to mention Col4 / Col4[1] or Col4 / first(Col4).
I tried measure in Power BI:
relativity = CALCULATE(DIVIDE(dt[Col4], dt[AnotherMeasure]), MIN(dt[Col1]))
I know this is false.
Can anyone help?
UPDATE
I tried #Alexis Olson's code and modified as:
relativity =
VAR YR = MIN(dt[Col1].[Year])
VAR QT = MIN(dt[Col1].[Quarter])
VAR PF = CALCULATE(TOTALQTD(SUM(dt[Col4]), dt[Col1].[Date]), dt[Col1].[Year] = YR, dt[Col1].[Quarter] = QT)
RETURN
DIVIDE(SUM(dt[Col4]), PF)
However, when I visualize in the report, it all shows 1.
I also tried this:
relativity =
VAR YR = CALCULATE(MIN(dt[Col1].[Year]), ALLEXCEPT(dt, dt[Col2]))
VAR QT = CALCULATE(MIN(dt[Col1].[Quarter]), ALLEXCEPT(dt, dt[Col2]))
VAR PFQTD = TOTALQTD(SUM(dt[Col4]), dt[Col1].[Date])
VAR MPFQTD = CALCULATE(MAX(PFQTD), FILTER(dt, dt[Col1].[Year] = YR), FILTER(dt, dt[Col1].[Quarter] = QT))
RETURN
MPFQTD
Failed either
Using the logic from this Q&A, you can create a calculated column as follows:
relativity =
VAR FirstCol1 = CALCULATE ( MIN ( dt[Col1] ), ALLEXCEPT ( dt, dt[Col2], dt[Col3] ) )
VAR FirstCol4 = CALCULATE ( VALUES ( dt[Col4] ), dt[Col1] = FirstCol1 )
RETURN
DIVIDE ( dt[Col4], FirstCol4 )
This looks up the first date when Col2 and Col3 are the same value, then finds the Col4 value on that first date, and finally divides the current Col4 value by that first Col4 value.
The ALLEXCEPT removes the row context except for the columns you specify. If you want relativity_Col3_All, then simply remove that column from the ALLEXCEPT specification.

Combining two gene counts in one single plot using ggplot2?

Display of two gene counts in the same graph along two different conditions. Normalized Counts for these genes were obtained from Deseq2 using plotcounts functions. To plot these two genes in the same plot with the same x-axis which has three conditions Ctrl,T1,T2 and different y-axis (based on counts). And one extra variable is the replicates PAT1,2,3,4,5 which i want to be distinguished by different shape and genes "x" and "y" with two different colors. I tried something like this from the link mentioned which did not really worked so far
geneX
genecounts <- plotCounts(dds, gene = paste(geneX),
intgroup = c("timepoint","patient"),returnData = TRUE)
# count timepoint patient
# PAT1.ctrl 19.975535 ctrl PAT1
# PAT2.ctrl 15.095701 ctrl PAT2
# PAT3.ctrl 31.067328 ctrl PAT3
# PAT4.ctrl 23.507453 ctrl PAT4
# PAT5.ctrl 64.955803 ctrl PAT5
# PAT1.T1 25.087863 T1 PAT1
# PAT2.T1 12.265661 T1 PAT2
# PAT3.T1 21.514517 T1 PAT3
# PAT4.T1 12.853989 T1 PAT4
# PAT5.T1 29.887820 T1 PAT5
# PAT1.T2 16.234911 T2 PAT1
# PAT2.T2 7.620990 T2 PAT2
# PAT3.T2 36.834481 T2 PAT3
# PAT4.T2 7.085464 T2 PAT4
# PAT5.T2 13.330165 T2 PAT5
second gene Y plotcounts
# count timepoint patient
PAT1.ctrl 156949.94 ctrl PAT1
PAT2.ctrl 164856.70 ctrl PAT2
PAT3.ctrl 258139.79 ctrl PAT3
PAT4.ctrl 103669.21 ctrl PAT4
PAT5.ctrl 434170.02 ctrl PAT5
PAT1.T1 128839.83 T1 PAT1
PAT2.T1 98877.64 T1 PAT2
PAT3.T1 198419.57 T1 PAT3
PAT4.T1 97918.21 T1 PAT4
PAT5.T1 306861.69 T1 PAT5
PAT1.T2 124161.91 T2 PAT1
PAT2.T2 92150.86 T2 PAT2
PAT3.T2 265243.35 T2 PAT3
PAT4.T2 90364.91 T2 PAT4
PAT5.T2 399177.04 T2 PAT5
So far i used this code to generate individual ggplots
#ggplot(genecounts, aes(x = timepoint, y = count, color = patient)) + geom_beeswarm(cex =3)
Any help/suggestions would be highly appreciated
The first step is to add a column for the gene name to each data frame, then combine them.
You could start with geom_point: I would use color for patients and shape for genes. You will want to use a log scale, since the counts differ by orders of magnitude. Assuming that your data frames are named geneX and geneY:
library(dplyr)
library(ggplot2)
geneX %>%
mutate(gene = "X") %>%
bind_rows(mutate(geneY, gene = "Y")) %>%
ggplot(aes(timepoint, count)) +
geom_point(aes(color = patient, shape = gene)) +
scale_y_log10()
You can try geom-jitter instead to avoid point overlap.
If you want to connect the points, you will need to group by both gene and patient, which is a little more work:
geneX %>%
mutate(gene = "X") %>%
bind_rows(mutate(geneY, gene = "Y")) %>%
ggplot(aes(timepoint, count)) +
geom_line(aes(color = patient, group = interaction(patient, gene))) +
geom_point(aes(color = patient, shape = gene)) +
scale_y_log10()

How to get a new column in a data frame which has only elements which appear in the set more than once in R

Data:
DB1 <- data.frame(orderItemID = c(1,2,3,4,5,6,7,8,9,10),
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")),
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))
Expected output:
selection_order = c("yes","no","no","no","no","no","yes","no","no","no")
In the data set I have items with the same size or the same color, the same ItemID. Every registered user has his unique customerID.
I want to identify when a user orders products (more then one) with the same itemID (in different sizes or colors = for example the user with the customerID = 33 orders the same item (ItemID = 2) in two different colors) and mark it in a new column named like "selection order"(for example) with "Yes" or "No". It should NOT show me a "Yes", when he or she orders an item with an other ID. I just want to get a "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products).
I've tried a lot already,but nothing works. There are a few thousand different userID's and ItemId's-so I can´t subset for every Id. I tried it with the duplicated function - but it's not leading to a satisfactory solution:
The problem is, that if the same person orders more then one object (customerID is duplicated then) and another person(customerId) orders an item with the same Id (itemId is duplicated then) it gives me a "yes": and it must be a "No" in this case. (in the example the duplicate function will give me an "yes" at orderItemID 4 instead of an "no")
I think I understand what is your desired output now, try
library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
# orderItemID orderDate itemID size color customerID selection_order
# 1: 1 1.1.12 2 l blue 33 TRUE
# 2: 2 1.1.12 3 s black 15 FALSE
# 3: 3 1.1.12 2 xl blue 1 FALSE
# 4: 4 1.1.12 5 xs orange 33 FALSE
# 5: 5 1.1.12 12 m red 14 FALSE
# 6: 6 1.1.12 4 s navy 55 FALSE
# 7: 7 1.1.12 2 l red 33 TRUE
# 8: 8 1.1.12 3 m purple 78 FALSE
# 9: 9 2.1.12 1 xxs white 94 FALSE
# 10: 10 2.1.12 5 xxl black 23 FALSE
In order to convert back to a data.frame, use DB1 <- as.data.frame(DB1) (for older versions) or setDF(DB1) for the lates data.table version.
You can do it (less efficiently) with base R too
transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))
Or using the dplyr package
library(dplyr)
DB1 %>%
group_by(customerID, itemID) %>%
mutate(selection_order = n() > 1)
The following code will append a new column selection.order to your data frame if the row represents a duplicate (customerID, itemID) tuple.
# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))
# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]
# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)
If you just want the subset, as you say in the title, then do this:
DB1[duplicated(DB1[c("itemID", "customerID")]),]
If you want the column, then:
f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L
Note that is also easy to get the actual count by simplifying the last line above.

Get a subset of a dataframe which has only elements which appears in the set more than once in R [duplicate]

Data:
DB1 <- data.frame(orderItemID = c(1,2,3,4,5,6,7,8,9,10),
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")),
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))
Expected output:
selection_order = c("yes","no","no","no","no","no","yes","no","no","no")
In the data set I have items with the same size or the same color, the same ItemID. Every registered user has his unique customerID.
I want to identify when a user orders products (more then one) with the same itemID (in different sizes or colors = for example the user with the customerID = 33 orders the same item (ItemID = 2) in two different colors) and mark it in a new column named like "selection order"(for example) with "Yes" or "No". It should NOT show me a "Yes", when he or she orders an item with an other ID. I just want to get a "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products).
I've tried a lot already,but nothing works. There are a few thousand different userID's and ItemId's-so I can´t subset for every Id. I tried it with the duplicated function - but it's not leading to a satisfactory solution:
The problem is, that if the same person orders more then one object (customerID is duplicated then) and another person(customerId) orders an item with the same Id (itemId is duplicated then) it gives me a "yes": and it must be a "No" in this case. (in the example the duplicate function will give me an "yes" at orderItemID 4 instead of an "no")
I think I understand what is your desired output now, try
library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
# orderItemID orderDate itemID size color customerID selection_order
# 1: 1 1.1.12 2 l blue 33 TRUE
# 2: 2 1.1.12 3 s black 15 FALSE
# 3: 3 1.1.12 2 xl blue 1 FALSE
# 4: 4 1.1.12 5 xs orange 33 FALSE
# 5: 5 1.1.12 12 m red 14 FALSE
# 6: 6 1.1.12 4 s navy 55 FALSE
# 7: 7 1.1.12 2 l red 33 TRUE
# 8: 8 1.1.12 3 m purple 78 FALSE
# 9: 9 2.1.12 1 xxs white 94 FALSE
# 10: 10 2.1.12 5 xxl black 23 FALSE
In order to convert back to a data.frame, use DB1 <- as.data.frame(DB1) (for older versions) or setDF(DB1) for the lates data.table version.
You can do it (less efficiently) with base R too
transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))
Or using the dplyr package
library(dplyr)
DB1 %>%
group_by(customerID, itemID) %>%
mutate(selection_order = n() > 1)
The following code will append a new column selection.order to your data frame if the row represents a duplicate (customerID, itemID) tuple.
# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))
# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]
# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)
If you just want the subset, as you say in the title, then do this:
DB1[duplicated(DB1[c("itemID", "customerID")]),]
If you want the column, then:
f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L
Note that is also easy to get the actual count by simplifying the last line above.

Resources