Item LOCATION
R11565 D11
R11565 D12
R11565 D14
R11565 D15
I want output as
Item Location1 Location2 Location3 Location4
R11565 D11 D12 D13 D14
You could use pivoting logic here with the help of ROW_NUMBER():
WITH cte AS (
SELECT Item, LOCATION, ROW_NUMBER() OVER (PARTITION BY Item ORDER BY LOCATION) rn
FROM yourTable
)
SELECT
Item,
MAX(CASE WHEN rn = 1 THEN LOCATION END) AS Location1,
MAX(CASE WHEN rn = 2 THEN LOCATION END) AS Location2,
MAX(CASE WHEN rn = 3 THEN LOCATION END) AS Location3,
MAX(CASE WHEN rn = 4 THEN LOCATION END) AS Location4
FROM cte
GROUP BY
Item;
Related
COUNT(CASE WHEN a = apple THEN 1 END) as '1',
COUNT(CASE WHEN a = orange THEN 1 END) as '2',
COUNT(CASE WHEN a = mango THEN 1 END) as '3',
COUNT(CASE WHEN a = melon THEN 1 END) as '4',
COUNT(CASE WHEN a = grape THEN 1 END) as '5',
COUNT(CASE WHEN a = lemon THEN 1 END) as '6',
COUNT(CASE WHEN a = watermelon THEN 1 END) as '7',
I have a separate category, but I am counting values that are not grouped as count case when. It has such poor performance that I don't know how to solve it.
Is there a way to transpose dataframe with different column names FOr example
Col A Col B
Table1 Date
Table1 Country
Table2 Name
Table2 Date
Table3 ID
Table3 Place
Required Output (Columns with same name should be aligned in the same column like Date)
Col A Col1 Col2 Col3
Table1 Date Country
Table2 Date Name
Table3 ID Place
It seems like to get the desired output you have to adress the cases where there is > 1 instance of a ColB value and the cases where there is only 1 separately.
Option 1:
library(data.table)
setDT(df)
df[, single := .N == 1L, ColB]
df[, b_id := frank(ColB, ties.method = 'dense')]
out <-
merge(
dcast(df[single == F], ColA ~ b_id, value.var = 'ColB'),
dcast(df[single == T], ColA ~ rowid(ColA), value.var = 'ColB'),
by = 'ColA',
all = T
)
setnames(out, replace(paste0('Col', seq(0, ncol(out) - 1)), 1, names(out)[1]))
out
# ColA Col1 Col2 Col3
# 1: Table1 Date Country <NA>
# 2: Table2 Date Name <NA>
# 3: Table3 <NA> ID Place
Option 2:
library(data.table)
setDT(df)
df[, single := .N == 1L, ColB]
df[, b_id :=
interaction(single, fifelse(single, rowid(ColA), frank(ColB, ties.method = 'dense')))]
dcast(df, ColA ~ paste0('Col', as.integer(b_id)), value.var = 'ColB')
# ColA Col2 Col3 Col4
# 1: Table1 <NA> Date Country
# 2: Table2 Name Date <NA>
# 3: Table3 ID <NA> Place
Input data:
df <- fread('
ColA ColB
Table1 Date
Table1 Country
Table2 Name
Table2 Date
Table3 ID
Table3 Place
')
I want to keep empty groups (with a default value like NA or 0) when grouping by multiple conditions.
dt = data.table(user = c("A", "A", "B"), date = c("t1", "t2", "t1"), duration = c(1, 2, 1))
dt[, .("total" = sum(duration)), by = .(date, user)]
Result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
Desired result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
3: t2 B NA
One solution could be to add rows with 0 values before grouping, but it would require to create the Descartes product of many columns and manually checking if a value already exists for that combination, but I would prefer a built-in / simpler one.
You can try:
dt[CJ(user = user, date = date, unique = TRUE), on = .(user, date)]
user date duration
1: A t1 1
2: A t2 2
3: B t1 1
4: B t2 NA
Here is an option with complete from tidyr
library(tidyr)
library(dplyr)
dt1 <- dt[, .("total" = sum(duration)), by = .(date, user)]
dt1 %>%
complete(user, date)
# user date total
# <chr> <chr> <dbl>
# A t1 1
#2 A t2 2
#3 B t1 1
#4 B t2 NA
Or using dcast/melt
melt(dcast(dt, user ~ date, value.var = 'duration', sum),
id.var = 'user', variable.name = 'date', value.name = 'total')
Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1" )
ID2<-c("b8","b99","b5","b5","b2","b8" )
Value1<-c(2,5,6,6,2,7)
Value2<- c(23,51,63,64,23,23)
Year<- c(2004,2004,2004,2004,2005,2004)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with the smaller value.
Expected result:
ID1 ID2 Value1 Value2 Year new
2 a4 b99 5 51 2004 a4_b99_2004
4 a6 b5 6 64 2004 a6_b5_2004
5 a5 b2 2 23 2005 a5_b2_2005
6 a1 b8 7 23 2004 a1_b8_2004
I tried the following:
Find a unique identifier for the conditions I am interested
df$new<-paste(df$ID1,df$ID2, df$Year, sep="_")
I can use the unique identifier to find the rows of the database that contain the duplicates
IND<-which(duplicated(df$new) | duplicated(df$new, fromLast = TRUE))
In a for loop if unique identifier has duplicate compare the values and erase the rows, but the loop is too complicated and I cannot solve it.
for (i in df$new) {
if(sum(df$new == i)>1)
{
ind<-which(df$new==i)
m= min(df$Value1[ind])
df<-df[-which.min(df$Value1[ind]),]
m= min(df$Value2[ind])
df<-df[-which.min(df$Value2[ind]),]
}
}
Some different possibilities. Using dplyr:
df %>%
group_by(ID1, ID2, Year) %>%
filter(Value1 == max(Value1) & Value2 == max(Value2))
Or:
df %>%
rowwise() %>%
mutate(max_val = sum(Value1, Value2)) %>%
ungroup() %>%
group_by(ID1, ID2, Year) %>%
filter(max_val == max(max_val)) %>%
select(-max_val)
Using data.table:
setDT(df)[df[, .I[Value1 == max(Value1) & Value2 == max(Value2)], by = list(ID1, ID2, Year)]$V1]
Or:
setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)
][filter != FALSE
][, -c("max_val", "filter")]
Or:
subset(setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)], filter != FALSE)[, -c("max_val", "filter")]
Consider aggregate to retrieve the max values by your grouping, ID1, ID2, and Year:
df_new <- aggregate(.~ID1 + ID2 + Year, df, max)
df_new
# ID1 ID2 Year Value1 Value2
# 1 a6 b5 2004 6 64
# 2 a1 b8 2004 7 23
# 3 a4 b99 2004 5 51
# 4 a5 b2 2005 2 23
Solution without loading libraries:
ID1 ID2 Value1 Value2 Year
a6.b5.2004 a6 b5 6 64 2004
a1.b8.2004 a1 b8 7 23 2004
a4.b99.2004 a4 b99 5 51 2004
a5.b2.2005 a5 b2 2 23 2005
Code
do.call(rbind, lapply(split(df, list(df$ID1, df$ID2, df$Year)), # make identifiers
function(x) {return(x[which.max(x$Value1 + x$Value2),])})) # take max of sum
How can I reduce this nested query so that X,Y,Z are filtered prior to checking AA?
This works but is expensive since it calculates X,Y,Z for each subquery.
Only AA needs to be checked in each.
SELECT 3*b3.bin3 + 2*b2.bin2 + b1.bin1 FROM
(SELECT count(*) AS bin1 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 1 AND `AA` <= 2) b1
JOIN
(SELECT count(*) AS bin2 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 2.01 AND `AA` <= 3) b2
JOIN
(SELECT count(*) AS bin3 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 3.01 AND `AA` <= 4) b3;
are you on SQL 2008? you might be able to use with as. Try this
With b as
(
Select
*
from TD
where
x=1 and
y >= 2 and
z >= 2)
SELECT 3*b3.bin3 + 2*b2.bin2 + b1.bin1 FROM
(select
count ()
from b
where AA >= 1 and AA <= 2) bin1
join
(select
count ()
from b
where AA >= 2.01 and AA <= 3) bin2
join
(select
count ()
from b
where AA >= 3.01 and AA <= 4) bin3
--REDUCED FORM from Golden Ratio's hint.
WITH `v` AS
(SELECT `AA` FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2)
SELECT 3*bin3 + 2*bin2 + bin1 FROM
(SELECT count(*) AS bin1 FROM `v` WHERE
`AA` >= 1 AND `AA` <= 2)
JOIN
(SELECT count(*) AS bin2 FROM `v` WHERE
`AA` >= 2.01 AND `AA` <= 3)
JOIN
(SELECT count(*) AS bin3 FROM `v` WHERE
`AA` >= 3.01 AND `AA` <= 4);