I have a data set which has products and their quantity sold. I want to write a R code which tells me the best selling product.
Products Quantity
Laminated 520
Laminated 150
Laminated 639
Laminated 702
SUPERSTAR 3
TAMAX 500
TAMAX 20
TAMAX 40
GreenDragon 40
GreenDragon 50
XPLODE 40
XPLODE 20
EXPERT 40
KHANJARBIOSL 40
Here just by looking the data set we can say laminated is the best product in terms of quantity sold. Can we write an R code for this.
Thanks
There could be multiple ways to do this. One way using tapply is to get sum of Quantity for each Product, get the name of the maximum value.
names(which.max(tapply(df$Quantity, df$Products, sum, na.rm = TRUE)))
#[1] "Laminated"
You can use data.table package. First do the sum, then sort it in descending order based on aggregated value. Then fetch first row.
tb = data.frame("Products" =c("Laminated", "Laminated", "Laminated", "Laminated", "SUPERSTAR", "TAMAX", "TAMAX", "TAMAX", "GreenDragon", "GreenDragon", "XPLODE", "XPLODE", "EXPERT", "KHANJARBIOSL"), "Quantity" = c(520,150,639,702,3,500,20,40,40,50,40,20,40,40))
library(data.table)
tb = data.table(tb)
tb[,sum(Quantity), by="Products"][order(-V1)][1]
Related
I am currently working on the so-called "Moneyball" problem. I am basically trying to select the best combination of three baseball players (based on certain baseball-relevant statistics) for the least amount of money.
I have the following dataset (OBP, SLG, and AB are statistics that describe the performance of a player):
# the table has about 100 observations;
# the data frame is called "batting.2001"
playerID OBP SLG AB salary
giambja01 0.3569001 0.6096154 20 410333
heltoto01 0.4316547 0.4948382 57 4950000
berkmla01 0.2102326 0.6204506 277 305000
gonzalu01 0.4285714 0.3880131 409 9200000
martied01 0.4234079 0.5425532 100 5500000
My goal is to pick three players who in combination have the highest possible sum of OBP, SLG, and AB, but at the same time do not exceed a total salary of 15.000.000 dollar.
My approach so far has been rather simple... I just tried to arrange (in descending order) the columns OBP, SLG, and AB and simply picking the three players on the top that in combination do not exceed the salary restriction of 15 Million dollar:
batting.2001 %>%
arrange(desc(OPB), desc(SLG), desc(AB))
Can anyone of you think of a better solution? Also, what if I would like to get the best combination of three players for the least amount of money? What approach would you use in that scenario?
Thanks in advance, and looking forward to reading your solutions.
I have a table like below. Each row has store id, discount % for one of their coupons. Each store could have multiple coupons but (store+discount %) is a primary key. I would like to find out top 10 coupons (by decreasing order of discount %) but would like to get only 2 coupon from the same store. What is the most efficient way to do this? My logic involves sorting data multiple times. Is there a better and more efficient way? I would like to do this in R.
Sample data:
df <- data.frame(Store=c("Lowes","Lowes","Lowes","Lowes","HD","HD","HD","ACE",
"ACE","Misc","Misc","Other","Other","Last","Last","Last"),
`discount_%`=c("60%","50%","40%","30%","60%","50%","40%","30%",
"20%","50%","30%","20%","10%","10%","5%","3%"),
check.names = FALSE)
my solution is ignore the store and sort the table by discount then
create a ID. ID would represent coupons in descending order
Then by Store and discount create ID2 which would have rankings of
coupons by store.
then filter all rows where ID2>2
then sort table by ID
take top 10 rows
Try this:
df$`discount_%` <- as.numeric(gsub("%","",df$`discount_%`))
require(data.table)
setDT(df)[order(-`discount_%`),.SD[1:2],by=Store][order(-`discount_%`)[1:10],]
Output:
Store discount_%
1: Lowes 60
2: HD 60
3: Lowes 50
4: HD 50
5: Misc 50
6: Misc 30
7: ACE 30
8: ACE 20
9: Other 20
10: Other 10
Data is easier to work with in R without special characters, but if you need to add the percent sign back, try something like this:
paste0(df$`discount_%`,"%")
I am trying to do some market basket analysis using the arules package, but when I use the summary() function on an itemMatrix object to check which are the most frequent items, the numbers do not add up.
If I do:
library(arules)
x <- read.transactions("Supermarket2014-15.csv")
summary(x)
I get:
transactions as itemMatrix in sparse format with
5001 rows (elements/itemsets/transactions) and
997 columns (items) and a density of 0.003557162
most frequent items:
45 28 42 35 22 (Other)
503 462 444 440 413 15474
But if I check with a for loop, or even in Excel, the count for the product 45 is 513 and not 503. The same for 28, which should be 499, and so on.
The odd thing is if I sum up all the totals (15474+413+440+444+462+503) I get the correct number for the total of transacted products.
The data has several NA values and products are factors.
And here is the raw data (Day ranges from 1 to 28, Product ranges from 1 to 50):
If you look at the result of your str(x) call then you see under #iteminfo and $labels that some items have labels like "1;1", etc. This means that the items are not correctly separated after reading the file in. The default separator in read.transactions() is a white space, but you seem to have (some) semicolons there. Try sep=";" in read.transactions().
I want to create a table of the 10 most frequent reasons people discontinue a course. There are around 2,000 responses to my discontinuation survey, with the dataset entitled 'Discontinued'. There are 35 categories to describe the 'Reason'. Currently I have been using the below code but this gives me the frequecy for all 35 discontinuation codes.
Discontinued[,list(Count= .N), by = reason][order(-Count)]
The data.table way to sort is setorder. So instead of
Discontinued[,list(Count= .N), by = reason][order(-Count)][1:10]
it should be faster to use
setorder(Discontinued[, list(Count= .N), by = reason], -Count)[1L:10L]
I am trying to get this working by some simple method.
Say, there is a table for Cars Sold and with the name of the Car Model and the Price the Car was sold for
Eg.,
CarName Price
AcuraLegend 30000
AcuraTSX 40000
HondaCivic 20000
HondaCivic 22000
HondaCivic 22000
ToyotaCamry 18000
and then 2900 more entries
What I need is to find the maximum price each car was sold for and the number of that type of car sold for the maximum amount. So, if we were to use the above dataframe, assuming that the max price paid for HondaCivic in the entire dataframe was 22000, and only 2 cars were sold for this price, for HondaCivic I would have,
CarName MaxPricePaidForCar NumberofCarsSoldforMaxPrice
HondaCivic 22000 2
Now, I have managed to put this together with a rather tedious way of using tapply and merge, etc etc.
Any suggestions on a simpler method would be very helpful,
To do this for each unique type of car, you can use ddply in the plyr package:
ddply(carList,.(carName), .fun = summarise, maxPrice = max(Price),
numCars = sum(Price == max(Price)))
Here is another approach using data.table. If your data frame is large and speed is of concern, this should give you approximately a 4x speedup.
library(data.table)
carT = data.table(carList)
carT[,list(maxPrice = max(Price), numCars = sum(Price == max(Price))),'carName']
I quite like cast from the reshape package for these little tasks:
cast(df, CarName ~., c(function(x) sum(x == max(x)),max))