sorting data to provide top entries - r

I have a table like below. Each row has store id, discount % for one of their coupons. Each store could have multiple coupons but (store+discount %) is a primary key. I would like to find out top 10 coupons (by decreasing order of discount %) but would like to get only 2 coupon from the same store. What is the most efficient way to do this? My logic involves sorting data multiple times. Is there a better and more efficient way? I would like to do this in R.
Sample data:
df <- data.frame(Store=c("Lowes","Lowes","Lowes","Lowes","HD","HD","HD","ACE",
"ACE","Misc","Misc","Other","Other","Last","Last","Last"),
`discount_%`=c("60%","50%","40%","30%","60%","50%","40%","30%",
"20%","50%","30%","20%","10%","10%","5%","3%"),
check.names = FALSE)
my solution is ignore the store and sort the table by discount then
create a ID. ID would represent coupons in descending order
Then by Store and discount create ID2 which would have rankings of
coupons by store.
then filter all rows where ID2>2
then sort table by ID
take top 10 rows

Try this:
df$`discount_%` <- as.numeric(gsub("%","",df$`discount_%`))
require(data.table)
setDT(df)[order(-`discount_%`),.SD[1:2],by=Store][order(-`discount_%`)[1:10],]
Output:
Store discount_%
1: Lowes 60
2: HD 60
3: Lowes 50
4: HD 50
5: Misc 50
6: Misc 30
7: ACE 30
8: ACE 20
9: Other 20
10: Other 10
Data is easier to work with in R without special characters, but if you need to add the percent sign back, try something like this:
paste0(df$`discount_%`,"%")

Related

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

Count in group_concat

I have this situation in Mysql table.
-----------------
code gr. state
-----------------
10 a available
10 a sold
10 b available
10 a available
10 a sold
10 a printed
10 b available
10 b sold
10 b available
------------------
I need to group these data for group getting something like
group a -> available(3), sold(2), printed(1)
group b -> available(2), sold(1), printed(0)
I tried combining group_concat() and count() but can't get the result I need.
My goal is to have 1 single row per group (group by is ok)
The states are always these 3 (available, sold, printed)
thx for help
SUM with IF could give you the right answear.
SELECT gr,
sum(if(state,'available',1,0)) available,
sum(if(state,'sold',1,0)) sold,
sum(if(state,'printed',1,0)) printed
FROM table
GROUP BY gr

Using "shift" function in R to subtract one row from another by group

I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.

How to extract all the values on the basis of a match of a columns in two dataframes using R?

I have a dataframe like this say n
id subject
-------------
1 discount less
2 product good
3 product good
4 wonderful service
5 discount less
and another dataframe say p like this
Subject Rate
----------------
product good 20
wonderful service 30
discount less 10
i want the output as :
id subject rate
--------------------
1,5 discount less
2,3 product good
4 wonderful service
if I match like p$id <- n$id[match(p$subject,n$subject)] then only the first element matched will be shown...but i want all the ids....
Can anyone guide me on this
how about something like this:
n$subject<-as.character(n$subject)
id=sapply(unique(n$subject),function(x) paste(as.character(n[n$subject==x,]$id), collapse=", "))
subject=unique(n$subject)
df1=data.frame(id=id,subject=subject)
df2=merge(df1,p,by="subject")
df2=df2[c("id", "subject", "Rate")]

Conditionally matching elements in multiple columns of two large datasets with each other

I have two very large datasets for demand and returns of products (about 4 million entries per dataset, but unequal length). The first dataset gives [1] the date of demand, [2] the id of the customer and [3] the id of the product. The second dataset gives the [1] date of return, [2] the id of the customer and [3] the id of the product.
Now I would like to match all demands for given customers and products with the returns of the same customer and product. Pairs of product types and customers are not unique, because customer can demand a product multiple times. Therefore, I want to match a demand for a product with the earliest return in the dataset. It can also happen that some products are not returned, or that some products are returned which have not been demanded (because customers return items that were demanded before the starting data in the dataset).
To that end I've written the following code:
transactionNumber = 1:nrow(demandSet) #transaction numbers for the demandSet
matchedNumber = rep(0, nrow(demandSet)) #vector of which values in the returnSet correspond to the transactions in the demandSet
for (transaction in transactionNumber){
indices <- which(returnSet[,2]==demandSet[transaction,2]&returnSet[,3]==demandSet[transaction,3])
if (length(indices)>0){
matchedNumber[transaction] <- indices[which.min(returnSet[indices,][,1])] #Select the index of the transaction with the minimum date
}
}
However, this takes around a day to compute. Anyone have a better suggestion? Note that the suggestions from match two columns with two other columns do not work here, since match() overflows memory.
As a working example consider
demandDates = c(1,1,1,5,6,6,8,8)
demandCustIds = c(1,1,1,2,3,3,1,1)
demandProdIds = c(1,2,3,4,1,5,2,6)
demandSet = data.frame(demandDates,demandCustIds,demandProdIds)
returnDates = c(1,1,4,4,4)
returnCustIds = c(4,4,1,1,1)
returnProdIds = c(5,7,1,2,3)
returnSet = data.frame(returnDates,returnCustIds,returnProdIds)
(This actually doesn't work completely correctly, since transaction 7 is incorrectly matched with return 4, however for the sake of the question lets assume this I what I want... I can fix this later)
require(data.table)
DD<-data.table(demandSet,key="demandCustIds,demandProdIds")
DR<-data.table(returnSet,key="returnCustIds,returnProdIds")
DD[DR,mult="first"]
demandCustIds demandProdIds demandDates returnDates
1: 1 1 1 4
2: 1 2 1 4
3: 1 3 1 4
4: 4 5 NA 1
5: 4 7 NA 1

Resources