Ntile and decile function depended on two columns in R [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I would like to have a new column with Ntile but it should depend on column 1 - "year" and show the ntile number for column 2 - "mileage".
year mileage
<dbl> <dbl>
1 2011 7413
2 2011 10926
3 2011 7351
4 2011 11613
5 2012 8367
6 2010 25125
mydata$Ntile <- ntile(mydata$mileage, 10)
I know the easy to use function ntile, but I do not know how to make it depend on 2 columns. I would like to have ntiles for mileage but for each year, 2010, 2011 and 2012 to be calculated in new column "Ntile".
PS: I know there is not enough data to calculate Ntiles for 2011 and 2012, it is just an example.

I like the data.table approach:
library(data.table)
mydata <- as.data.table(mydata)
mydata[, Ntile:=ntile(mileage,10), by=year]
Best!

Related

I need to classify by categories without mixing the data of different columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have the following dataset:
Year Company Product Sales
2017 X A 10
2017 Y A 20
2017 Z B 20
2017 X B 10
2018 X B 20
2018 Y B 30
2018 X A 10
2018 Z A 10
I want to obtain the following summary:
Year Product Sales
2017 A 30
B 30
2018 A 50
B 20
and also the following summary:
Year Company Sales
2017 X 20
Y 20
Z 20
2018 X 50
Y 10
Z 10
Is there any way to do it without using loops?
I know I could do something with the function aggregate, but I don't know how to proceed with it without mixing the data of company, product and year. For example, I get the total sales of product A and B, but it's mixing the sales of both years instead of giving A and B in 2017, and separated in 2018.
Do you have any suggestions?
Let's say your dataframe is called df:
df1 = df.groupby('Year', 'Product')['Sales'].sum()
df2 = df.groupby('Year', 'Company')['Sales'].sum()
I believe this would help you create your two summary dataframes without mixing anything :) !

How to sum a variable by group but do not aggregate the data frame in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
although I have found a lot of ways to calculate the sum of a variable by group, all the approaches end up creating a new data set which aggregates the double cases.
To be more precise, if I have a data frame:
id year
1 2010
1 2015
1 2017
2 2011
2 2017
3 2015
and I want to count the number of times I have the same ID by the different years, there are a lot of ways (using aggregate, tapply, dplyr, sqldf etc) which use a "group by" kind of functionality that in the end will give something like:
id count
1 3
2 2
3 1
I haven't managed to find a way to calculate the same thing but keep my original data frame, in order to obtain:
id year count
1 2010 3
1 2015 3
1 2017 3
2 2011 2
2 2017 2
3 2015 1
and therefore do not aggregate my double cases.
Has somebody already figured out?
Thank you in advance

Aggregate in multiple columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I've multiple stations (+1000) with more than 50 years to work on it, so I configured my df with the stations on the columns and dates on rows as the example.
Now I need to make summations of my parameter for each year in each column of data, but also i must know how many fields with no NA it counted to give me a specific value for each year in each station.
I hope you guys can help me, and sorry about the syntax and language.
year<-c(rep(2000,12),rep(2001,12),rep(2002,12), rep(2003,12))
data <- data.frame( year, month=rep(1:12,4),est1=rnorm(12*4,2,1),est2=rnorm(12*4,2,1),est3=rnorm(12*4,2,1))
data[3,3]<-NA
Sums:
> apply(data[,-(1:2)], 2, tapply, data$year, sum, na.rm=T)
est1 est2 est3
2000 23.46997 21.36984 28.24381
2001 27.32517 28.84098 24.11784
2002 23.41737 25.47548 23.82606
2003 24.63551 24.51148 28.17723
Non NA's:
> apply(!is.na(data[,-(1:2)]), 2, tapply, data$year, sum)
est1 est2 est3
2000 11 12 12
2001 12 12 12
2002 12 12 12
2003 12 12 12
And a version without apply (see #r2evans comment below):
sapply(data[,-(1:2)], tapply, data$year, sum, na.rm=T)
sapply(data.frame(!is.na(data[,3:5])), tapply, data$year, sum)

R counter, counting frequency in a table [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Add column with order counts
(2 answers)
Closed 6 years ago.
I have following data set
id year
2 20332 2005
3 6383 2005
14 20332 2006
15 6806 2006
16 23100 2006
I would like to have an additional column, which counts the number of years the id variable is already available:
id year Counter
2 20332 2005 1
3 6383 2005 1
14 20332 2006 2
15 6806 2006 1
16 23100 2006 1
The dataset is currently not sorted according to the year. I thought about mutate rather than a function.
Any ideas? Thanks!
We can use ave from base R
df1$Counter <- with(df1, ave(id, id, FUN = seq_along))

Aggregates by group and including counts across rows [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I have this data frame:
YEAR NATION VOTE
2015 NOR 1
2015 USA 0
2015 CAN 1
2015 RUS 1
2014 USA 1
2014 USA 1
2014 USA 0
2014 NOR 1
2014 NOR 0
2014 CAN 1
...and it goes on and on with more years, nations and votes. VOTE is binary, yes(1) or no(0). I am trying to code an output table that aggregates on year and nation, but that also that brings the total number of votes for each nation (the sum of 0's and 1's) together with the total number of 1's, in an output table like the one sketched below (sumVOTES being the total number of votes for that nation that year, i.e. sum of all 1s and 0s):
YEAR NATION VOTE-1 sumVOTES %-1s
2015 USA 8 17 47.1
2015 NOR 7 13 53.8
2015 CAN 3 11 27.2
2014 etc.
etc.
You are not providing your data.frame in a reproducible manner.
But this should work...
library(data.table)
# assuming 'df' is your data.frame
setDT(df)[, .('VOTE-1' = sum(VOTE==1),
'sumVOTES' = .N,
'%-1s' = 1e2*sum(VOTE==1)/.N),
by = .(YEAR, NATION)]
setDT converts data.frame to data.table by reference.

Resources