id product id2 year cost
1 biscuits 202-55-3041 2017 2
2 biscuits 903-36-9457 2014 2
3 biscuits 938-33-7254 2014 2
4 biscuits 739-29-5963 2017 2
5 biscuits 731-49-5483 2017 2
6 biscuits 892-15-2567 2018 2
7 biscuits 518-79-7674 2017 2
8 biscuits 305-63-7908 2017 2
This is my current data set the name of this data is called 'total1'
I am a beginner in R and I was wondering if there was a way to add up the cost of the product based on the year, for example;
In 2017 there were 10 biscuits sold
In 2018 there were 8 biscuits sold
I am trying to determine which is the least profitable year in terms of biscuits sold.
I apologise if this is answered elsewhere if it is direct me thank you.
Assuming that the number of sold items is stored in column cost, here's a simple solution using tapply:
tapply(total1$cost, total1$year, sum)
2014 2017 2018
4 10 2
Another simple solution is by using aggregate:
Edit:
thanks to #Darren Tsai's comment, the code here is simplified:
aggregate(cost ~ year, total1, sum)
total1$year total1$cost
1 2014 4
2 2017 10
3 2018 2
Related
I have three columns in excel year, month value.
I want to average value considering month and year. In R language this function is done by group_by(). In excel how could this be done?
year month value
2019 1 12
2019 1 34
2019 2 56
2019 2 15
2020 1 16
2020 3 67
2020 4 89
2018 6 123
2018 6 45
2018 7 98
2019 3 53
2019 1 23
2020 1 12
2020 3 1
If one has Office 365 we can use:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(CHOOSE({1,2},y,m)),{1,2}),
CHOOSE({1,1,2},u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))
Put this in the first cell and it will spill the results.
Once the HSTACK is release we can replace the CHOOSE with it:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(HSTACK(y,m)),{1,2}),
HSTACK(u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))
Averageifs would do what you want, but you might want to review using the Filter function to duplicate the Group_By() method for other similar procedures. Once grouped, you can sum/average/sort, etc.
Averageifs:
=AVERAGEIFS(C:C,A:A,2018,B:B,6)
Filter:
=filter(C:C,(A:A=2018)*(B:B=6))
=Average(filter(C:C,(A:A=2018)*(B:B=6)))
See this spreadsheet for examples of both. I realize you're using Excel, but these formulas should work on both (though they are not the same)
I have a dataframe named 'reviews' like this:
score_phrase title score release_year release_month release_day
1 Amazing LittleBigPlanet PS Vita 9 2012 9 12
2 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition 9 2012 9 12
3 Great Splice: Tree of Life 8.5 2012 9 12
4 Great NHL 13 8.5 2012 9 11
5 Great NHL 13 8.5 2012 9 11
6 Good Total War Battles: Shogun 7 2012 9 11
7 Awful Double Dragon: Neon 3 2012 9 11
8 Amazing Guild Wars 2 9 2012 9 11
9 Awful Double Dragon: Neon 3 2012 9 11
10 Good Total War Battles: Shogun 7 2012 9 11
Objective: Slight mismatch/typo in column values cause duplication in records. Here Row 1 and Row 2 are duplicates and Row 2 should be dropped after de-duplication.
I used dedup() function of 'SCRUBR' package to perform de-duplication but on a large dataset, I get incorrect number of duplicates when I toggle tolerance level for string matching.
For example:
partial_dup_data <- reviews[1:100,] %>% dedup(tolerance = 0.7)
#count w/o duplicates: 90
attr(partial_dup_data, "dups")
# count of identified duplicates: 16
Could somebody suggest what I am doing incorrectly? Is there another approach to achieve the objective?
customer_id transaction_id month year
1 3 7 2014
1 4 7 2014
2 5 7 2014
2 6 8 2014
1 7 8 2014
3 8 9 2015
1 9 9 2015
4 10 9 2015
5 11 9 2015
2 12 9 2015
I am well familiar with R basics. Any help will be appreciated.
the expected output should look like following:
month year number_unique_customers_added
7 2014 2
8 2014 0
9 2015 3
In the month 7 and year 2014, only customers_id 1 and 2 are present, so number of customers added is two. In the month 8 and year 2014, no new customer ids are added. So there should be zero customers added in this period. Finally in year 2015 and month 9, customer_ids 3,4 and 5 are the new ones added. So new number of customers added in this period is 3.
Using data.table:
require(data.table)
dt[, .SD[1,], by = customer_id][, uniqueN(customer_id), by = .(year, month)]
Explanation: We first remove all subsequent transactions of each customer (we're interested in the first one, when she is a "new customer"), and then count unique customers by each combination of year and month.
Using dplyr we can first create a column which indicates if a customer is duplicate or not and then we group_by month and year to count the new customers in each group.
library(dplyr)
df %>%
mutate(unique_customers = !duplicated(customer_id)) %>%
group_by(month, year) %>%
summarise(unique_customers = sum(unique_customers))
# month year unique_customers
# <int> <int> <int>
#1 7 2014 2
#2 8 2014 0
#3 9 2015 3
I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.
I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2