Hi all i want to find the unique duplicate and their number of occurrence,
Table 1
Id name Amount(in and out) Day
1 ram 100 Sunday
2 ram -100 Sunday
3 ram 100 Monday
4 ram -100 Monday
5 ram 100 Wednesday
6 ram 100 Wednesday
Ram got 100 from the company on sunday i.e. id =1 and amount = 100 ,on same day he gave the money back to the Company i.e. id = 2,amount = -100
similay for id = 3 and id = 4
but id = 5 and 6 are duplicates as the amount is not reversed and it occured on same day .
i want to display
Count name Amount
2 ram 100
Count is the number of occurrence of duplicate values .
i have tried many logic but no use .please help me. thanks
Note : Duplicated means two sequential positive/negative values for the any day .
You can try below query (assuming table name as tbl):
Logic : if it is reversed then it will have one positive and one negative value in your case 100 and -100 so sum of them will be zero hence in having clause we are ignoring them .
select count(Id),name,amount from tbl group by name,day,amount having count(id)>1 and sum(amount)>0
Related
I am trying to determine the on time delivery rate of orders:
The column of interest is on time delivery orders, which contains a field of 0 (not on time) or 1 ( on time). How can I calculate in sql the on time rate for each person? Basically count the number of 0 / over total count(0's & 1's) for each person? Same thing for on time ( count 1/total count (0's & 1's)?
Heres a data example:
Week Delivery on time Person
1 0 sARAH
1 0 sARAH
1 1 sARAH
2 1 vIC
2 0 Vic
You may aggregate by person, and then take the average of the on time statistic:
SELECT Person, AVG(1.0*DeliveryOnTime) AS OnTime,
AVG(1.0 - DeliveryOnTime) AS NotOnTime
FROM yourTable
GROUP BY Person;
Demo
The demo given is for SQL Server, and the above syntax might have to change slightly depending on your actual database, which you did not reveal to us.
Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel
I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.
I need to find the sector with the lowest frequency in my data frame. Using min gives the minimum number of occurrences, but I would like to obtain the corresponding sector name with the lowest number of occurrences...So in this case, I would like it to print "consumer staples". I keep getting the frequency and not the actual sector name. Is there a way to do this?
Thank you.
sector_count <- count(portfolio, "Sector")
sector_count
Sector freq
1 Consumer Discretionary 5
2 Consumer Staples 1
3 Health Care 2
4 Industrials 3
5 Information Technology 4
min(sector_count$freq)
[1] 1
You want
sector_count$Sector[which.min(sector_count$freq)]
The which.min(sector_count$freq) function selects the index or row where the minimum value is found. The sector_count$Sector vector is then subset to the corresponding value.
I have two very large datasets for demand and returns of products (about 4 million entries per dataset, but unequal length). The first dataset gives [1] the date of demand, [2] the id of the customer and [3] the id of the product. The second dataset gives the [1] date of return, [2] the id of the customer and [3] the id of the product.
Now I would like to match all demands for given customers and products with the returns of the same customer and product. Pairs of product types and customers are not unique, because customer can demand a product multiple times. Therefore, I want to match a demand for a product with the earliest return in the dataset. It can also happen that some products are not returned, or that some products are returned which have not been demanded (because customers return items that were demanded before the starting data in the dataset).
To that end I've written the following code:
transactionNumber = 1:nrow(demandSet) #transaction numbers for the demandSet
matchedNumber = rep(0, nrow(demandSet)) #vector of which values in the returnSet correspond to the transactions in the demandSet
for (transaction in transactionNumber){
indices <- which(returnSet[,2]==demandSet[transaction,2]&returnSet[,3]==demandSet[transaction,3])
if (length(indices)>0){
matchedNumber[transaction] <- indices[which.min(returnSet[indices,][,1])] #Select the index of the transaction with the minimum date
}
}
However, this takes around a day to compute. Anyone have a better suggestion? Note that the suggestions from match two columns with two other columns do not work here, since match() overflows memory.
As a working example consider
demandDates = c(1,1,1,5,6,6,8,8)
demandCustIds = c(1,1,1,2,3,3,1,1)
demandProdIds = c(1,2,3,4,1,5,2,6)
demandSet = data.frame(demandDates,demandCustIds,demandProdIds)
returnDates = c(1,1,4,4,4)
returnCustIds = c(4,4,1,1,1)
returnProdIds = c(5,7,1,2,3)
returnSet = data.frame(returnDates,returnCustIds,returnProdIds)
(This actually doesn't work completely correctly, since transaction 7 is incorrectly matched with return 4, however for the sake of the question lets assume this I what I want... I can fix this later)
require(data.table)
DD<-data.table(demandSet,key="demandCustIds,demandProdIds")
DR<-data.table(returnSet,key="returnCustIds,returnProdIds")
DD[DR,mult="first"]
demandCustIds demandProdIds demandDates returnDates
1: 1 1 1 4
2: 1 2 1 4
3: 1 3 1 4
4: 4 5 NA 1
5: 4 7 NA 1