select lowest values which sum up to 10% of total - r

Im new to this place and I'm not super experienced with R but I need it at work and I really hope you can support me
So i have a huge data set but i will explain the issue using small sample
I have already grouped my data set to achieve a layout which i want
So basically i have multiple EXCPosOutlet and EXCPPMonth names and i need to remove lowest values per EXCPosOutlet per EXCMonth which sum up to 10% of total for that individual group.
So lets say that total of AvaragePrice for a sampleName for Month 612 is 1000$. i need to remove all rows with lowest values of AveragePrice which sum up to 100$
If removing is messy, even creating extra column (mutate) using ifelse for example which would just tell me if it falls under my criteria, that would be totally enough
I have tried all ntile, quntile fucntions but im not geeting what i need.
Thank you so much in advance
LEt me know if I should provide more details

One possibility is to use the dplyr package and, for legibility, the pipe operator %>%. There's other ways towards the same result, but you might want to give it a try:
library(dplyr)
## generate example data:
data.frame(
EXCPosOutlet = gl(3,12),
AveragePrice = runif(36) * 100
) %>%
## sort dataframe by outlet and (increasing) price:
arrange(EXCPosOutlet, AveragePrice) %>%
## group by outlet:
group_by(EXCPosOutlet) %>%
## calculate cumulative price:
mutate(cumAveragePrice = cumsum(AveragePrice)) %>%
## keep rows which, per outlet, total less than the treshold of $100:
filter(cumAveragePrice <= 100)

Related

3-way Contingency Table R: How to get marginal sum, percentages per group

I have been trying to create a contingency table in R with percentage distribution of education for husbands (6 category) and wives (6 category) BY marriage cohort (total 4 cohorts). My ideal is something like this: IdealTable.
However, what I have been able to get at max is: CurrentTable.
I am not able to figure out how to convert my row and column sums to percentages (similar to the ideal). The current code that I am using is:
three.table = addmargins(xtabs(~MarriageCohort + HerEdu + HisEdu, data = mydata))
ftable(three.table)
Is there a way I can turn the row and column sums into percentages for each marriage cohort?
How can I add labels to this and export the ftable?
I am relatively new to R and tried to find solutions to my questions above on google, but havent been successful. Posting my query on this platform for the first time and any help with this will be greatly appreciated! Thank you!
One approach would be to create separate xtab runs for each MarriageCohort:
Cohorts <- lapply( mydata, mydata["MarriageCohort"],
function(z) xtabs( ~HerEdu + HisEdu, data = z) )
Then get totals in each Cohorts item before dividing the cohort addmargins(.) result by those totals and multiplying by 100 to get percent values:
divCohorts <- lapply(Cohorts, function(tbl) 100*addmargins(tbl)/sum(tbl) )
Then you will need to clean those items up to your desires. You have not included data so the cleanup remains your responsibility. (I did not use sapply because that could give you a big matrix that might be difficult to manage, but you could try it and see if you in the second stepwere satisfied with that approach.)

How to eloquently calculate the mean of means using group by function in R?

I try to group by year and then calculate the average of means, but I don't know the fastest way to do it and the way I do it gives me an error.
First I calculate how many rows per year the table has:
avg_awarded_moves_year <- imdb_globes %>% group_by(year_film) %>%
tally()
And then again use transmute function to add the table the average per year.
avg_awarded_moves_year <- imdb_globes %>% group_by(year_film) %>%
transmute(average_per_year =
sum(averageRating)/avg_awarded_moves_year$n)
The error I encounter: Error: Column "average_per_year" must be length 12 (the group size) or one, not 76
I can bet that there is a faster and more eloquent way to do it. I tried to divide the sum by "n()" , but it didn't work as well. I don't want to use mean function because the sample consists o means already.

Apply if function to identify the value of variable based on the value of another variable

I am trying to identify the value of the variable in an R data frame conditioning on the value of another variable, but unable to do it.
Basically, I am giving 3 different Dose of vaccine to three groups of animals (5 animal per group ( Total )) and recording results as Protec which means the number of protected animals in each group. From the Protec, I am calculating the proportion of protection (Protec/Total as Prop for each Dose group. For example
library(dplyr)
Dose=c(1,0.25,0.0625);Dose #Dose or Dilution
Protec=c(5,4,3);Protec
Total=c(rep(5,3));Total
df=as.data.frame(cbind(Dose,Protec,Total));df
df=df %>% mutate(Prop=Protec/Total);df
Question is, what is the log10 of minimum value of Dose for which Prop==1, which can be found using the following code
X0=log10(min(df$Dose[df$Prop1==1.0]));X0
The result should be X0=0
If the Protec=c(5,5,3), the Prop becomes c(1.0,1.0,0.6) then the X0 should be -0.60206.
If the Protec=c(5,5,5), the Prop becomes c(1.0,1.0,1.0), For which I want X0=0.
if the Protec=c(5,4,5), the Prop becomes c(1.0,0.8,1.0), then also I want X0=0 because I consider them as unordered and take the highest dose for calculating X0
I think it requires if function but the conditions for which I don't know how to write the code.
can someone explain how to do it in R?. thanking you in advance
We can use mutate_at to create apply the calculation on multiple columns that have column name starting with 'Protec'
library(dplyr)
df1 <- df %>%
mutate_at(vars(starts_with("Protec")), list(Prop = ~./Total))

Summarizing Data across age groups in R

I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function
library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources