calculate mean for subgroups [duplicate] - r

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 3 years ago.
I want to calculate the month value by calculating the mean of the weeksums per month.
e.g. for June (06) and distance 10 I have the weeksums 1(2017_28), 6(2017_29) and 1 (2017_31), I want to summarise these weeks to get the total monthsum 8 and the mean value 2.6667 (8:3).
I got the monthsum but I don't know how to calculate the mean
df %>%
group_by(year_month, distance) %>%
mutate(monthsum = sum(weeksum))
year year_month month year_week distance weeksum
1 2017 2017_05 05 2017_21 15 4
2 2017 2017_05 05 2017_21 10 1
3 2017 2017_05 05 2017_22 5 5
4 2017 2017_05 05 2017_22 0 1
5 2017 2017_06 06 2017_22 0 11
6 2017 2017_06 06 2017_23 20 7
7 2017 2017_06 06 2017_23 0 6
8 2017 2017_07 07 2017_28 10 1
9 2017 2017_07 07 2017_28 0 1
10 2017 2017_07 07 2017_29 10 6
11 2017 2017_07 07 2017_29 5 3
12 2017 2017_07 07 2017_30 0 12
13 2017 2017_07 07 2017_31 10 1
14 2017 2017_07 07 2017_31 0 7
This is what I want:
year year_month month year_week distance monthsum mean
1 2017 2017_05 05 2017_21 15 4 4
2 2017 2017_05 05 2017_21 10 1 1
3 2017 2017_05 05 2017_22 5 5 5
4 2017 2017_05 05 2017_22 0 1 1
5 2017 2017_06 06 2017_22 0 17 8.5
6 2017 2017_06 06 2017_23 20 7 7
7 2017 2017_07 07 2017_28 10 8 2.6667
8 2017 2017_07 07 2017_28 0 20 6.6667
9 2017 2017_07 07 2017_29 5 3 3

First of, I hope you use dplyrand not plyr to be up to date.
Also simply extend your statement with a mean() function like this:
df %>%
group_by(year_month, distance) %>%
mutate(monthsum = sum(weeksum), monthmean = mean(weeksum))
Also in your case use summarizeinstead of mutate to get a better view:
df %>%
group_by(year_month, distance) %>%
summarize(monthsum = sum(weeksum), monthmean = mean(weeksum))

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

How to find maximum value from dataframe with specific condition? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe named employee with 100 rows like this :
Date Name ride food income bonus sallary
1 01 Jan 2020 Ludociel 10 6 330000 0 330000
2 01 Jan 2020 Estarossa 15 8 465000 100000 565000
3 01 Jan 2020 Tarmiel 8 10 420000 100000 520000
4 01 Jan 2020 Sariel 5 8 315000 0 315000
5 01 Jan 2020 Escanor 15 7 435000 100000 535000
6 01 Jan 2020 Ban 13 9 465000 100000 565000
7 01 Jan 2020 Meliodas 6 15 540000 100000 640000
8 01 Jan 2020 King 15 12 585000 100000 685000
9 01 Jan 2020 Zeldris 15 11 555000 100000 655000
10 01 Jan 2020 Rugal 15 6 405000 100000 505000
11 02 Jan 2020 Ludociel 14 6 390000 100000 490000
12 02 Jan 2020 Estarossa 12 14 600000 100000 700000
...
100 10 Jan 2020 Rugal 13 10 495000 100000 595000
The problem is I want to find which employee that has the highest total sallary from 1 Jan to 10 Jan. My expected output is just a vector like this :
[1] "varName" is the highest with total sallary "varTotal_sallary"
I have tried using for loop + if clause and it only return total of 1 name only, and every name will have the function.
function_ludociel<-function(name, date, sallary){
total=integer()
for(i in 100){
if(date[i]=="01 Jan 2020" & name[i]=="Ludociel"){
total=sum(sallary)
}
}
return(total)
}
ludociel=function_ludociel(employee$name,employee$date,employee$sallary)
After that I planned to combine them in 1 variable and use max(), but i know it is silly to code.
Anyone have solution for this? Thankyou very much...
Convert date to actual date class
Use aggregate to calculate total salary from 1st Jan to 10th Jan
Select row with maximum salary
Print the result.
employee$Date <- as.Date(employee$Date, '%d %b %Y')
sub_data <- aggregate(sallary~Name, employee,
subset = Date >= as.Date('2020-01-01') &
Date <= as.Date('2020-01-10'), sum)
max_data <- sub_data[which.max(sub_data$sallary), ]
sprintf('%s has the highest salary %d', max_data$Name, max_data$sallary)

Summarizing percentage by subgroups

I don't know how to explain my problem, but I want to summarize the categories distance and get the percentage for each distance per month. In my table 1 week is 100% and now I want to calculate the same for the month but using the percentage from the weeks.
Something like sum(percent)/ amount of weeks in this month
This is what I have:
year month year_week distance object_remarks weeksum percent
1 2017 05 2017_21 15 ctenolabrus_rupestris 3 0.75
2 2017 05 2017_21 10 ctenolabrus_rupestris 1 0.25
3 2017 05 2017_22 5 ctenolabrus_rupestris 5 0.833
4 2017 05 2017_22 0 ctenolabrus_rupestris 1 0.167
5 2017 06 2017_22 0 ctenolabrus_rupestris 9 1
6 2017 06 2017_23 20 ctenolabrus_rupestris 6 0.545
7 2017 06 2017_23 0 ctenolabrus_rupestris 5 0.455
I want to have an output like this:
year month distance object_remarks weeksum percent percent_month
1 2017 05 15 ctenolabrus_rupestris 3 0.75 0.375
2 2017 05 10 ctenolabrus_rupestris 1 0.25 0.1225
3 2017 05 5 ctenolabrus_rupestris 5 0.833 0.4165
4 2017 05 0 ctenolabrus_rupestris 1 0.167 0.0835
5 2017 06 0 ctenolabrus_rupestris 14 1.455 0.7275
6 2017 06 20 ctenolabrus_rupestris 6 0.545 0.2775
Thanks a lot!
You may need to use group_by() twice.
df %>%
select(-year_week) %>%
group_by(month, distance) %>%
mutate(percent = sum(percent), weeksum = sum(weeksum)) %>%
distinct %>%
group_by(month) %>%
mutate(percent_month = percent/sum(percent))
# A tibble: 6 x 7
# Groups: month [2]
# year month distance object_remarks weeksum percent percent_month
# <int> <int> <int> <chr> <int> <dbl> <dbl>
# 1 2017 5 15 ctenolabrus_rupestris 3 0.75 0.375
# 2 2017 5 10 ctenolabrus_rupestris 1 0.25 0.125
# 3 2017 5 5 ctenolabrus_rupestris 5 0.833 0.416
# 4 2017 5 0 ctenolabrus_rupestris 1 0.167 0.0835
# 5 2017 6 0 ctenolabrus_rupestris 14 1.46 0.728
# 6 2017 6 20 ctenolabrus_rupestris 6 0.545 0.272

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

Manipulating csv spreadsheet using R

I want to write a basic loop that looks like this:
Import spreadsheet as data frame
scanning by Variable in header find missing data point "NA" remove all data for that calendar month for that variable, i.e.:
Here var 'X' has 'NA' at the second january. I want to remove all january values of 'X'
X Y Z
jan 3 3 3
jan NA 4 5
jan 2 6 2
feb 1 8 NA
feb 4 2 3
feb 9 4 1
mar 5 NA 5
mar 8 7 4
mar 9 7 5
Creating new dataframes that looks like:
X
feb 1
feb 4
feb 9
mar 5
mar 8
mar 9
Y
jan 3
jan 4
jan 6
feb 8
feb 2
feb 4
Z
jan 3
jan 5
jan 2
mar 5
mar 4
mar 5
Save remaining 'complete months' (in this case 'X'feb-mar, 'Y' jan-feb, 'Z' jan&mar) in new data frame to export as new .csv file
Any help getting started would be huge. If this has already been asked please direct me to the source I wasn't sure exactly how search for this.
Try:
ddf2 = ddf[,c(1,2)]
xdf = ddf[ddf$month!=ddf2$month[is.na(ddf2$X)], c(1,2)]
xdf
month X
4 feb 1
5 feb 4
6 feb 9
7 mar 5
8 mar 8
9 mar 9
ddf2 = ddf[,c(1,3)]
ydf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,3)]
ydf
month Y
1 jan 3
2 jan 4
3 jan 6
4 feb 8
5 feb 2
6 feb 4
ddf2 = ddf[,c(1,4)]
zdf = ddf[ddf$month!=ddf2$month[is.na(ddf2[,2])], c(1,4)]
zdf
month Z
1 jan 3
2 jan 5
3 jan 2
7 mar 5
8 mar 4
9 mar 5

Resources