Summarizing Data across age groups in R - r

I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function

library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]

Related

select lowest values which sum up to 10% of total

Im new to this place and I'm not super experienced with R but I need it at work and I really hope you can support me
So i have a huge data set but i will explain the issue using small sample
I have already grouped my data set to achieve a layout which i want
So basically i have multiple EXCPosOutlet and EXCPPMonth names and i need to remove lowest values per EXCPosOutlet per EXCMonth which sum up to 10% of total for that individual group.
So lets say that total of AvaragePrice for a sampleName for Month 612 is 1000$. i need to remove all rows with lowest values of AveragePrice which sum up to 100$
If removing is messy, even creating extra column (mutate) using ifelse for example which would just tell me if it falls under my criteria, that would be totally enough
I have tried all ntile, quntile fucntions but im not geeting what i need.
Thank you so much in advance
LEt me know if I should provide more details
One possibility is to use the dplyr package and, for legibility, the pipe operator %>%. There's other ways towards the same result, but you might want to give it a try:
library(dplyr)
## generate example data:
data.frame(
EXCPosOutlet = gl(3,12),
AveragePrice = runif(36) * 100
) %>%
## sort dataframe by outlet and (increasing) price:
arrange(EXCPosOutlet, AveragePrice) %>%
## group by outlet:
group_by(EXCPosOutlet) %>%
## calculate cumulative price:
mutate(cumAveragePrice = cumsum(AveragePrice)) %>%
## keep rows which, per outlet, total less than the treshold of $100:
filter(cumAveragePrice <= 100)

Drop all rows besides the largest number per observation in R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 1 year ago.
I am trying to merge two datasets for my senior thesis on corporate political actibity. One shows all of the data I have on each company, which is made up off several previously merged datasets, and the other shows the year, the companies' ticker, and a variable called "dirnbr". "dirnbr" shows how many people were on the board in a given year, except it is showing it like this:
Basically, it is creating several entries per year, one for each person on the board, going from 1 to the total number on the board (which is the only number I really care about). I just want my dataset to show total number of people on the board in a given year, year, and ticker. This would then allow me to merge them using an inner_join command and then see what percentage of people on a board of directors in a given year were formerly involved in politics. (I have that information in my larger dataset).
Basically, I would like to drop every observation besides the largest "dirnbr" entry per year and ticker. Is there a way to do this (or achieve the same result in another way?)?
Please let me know, any help is very appreciated.
You could use
library(dplyr)
df %>%
group_by(ticker, year) %>%
filter(dirnbr == max(dirnbr))
or
df %>%
group_by(ticker, year) %>%
slice_max(dirnbr)

Trouble with summing specific rows and columns

I have a problem and was wondering if there was a code that would let me solve my problem faster than doing it manually.
So for my example, I have 100 different funds with numerous securities in the fund. Within each fund, I have the Name of each type of security in the fund, the Date which shows the given quarter, the State where the security is issued, and the Weighting of each security of the total fund. The Name is not important, just the State from where it was issued is.
I was wondering if there was a way that would allow me to add up the Weighting from each different fund based on the specific State I want for each quarter. So let's say from Fund1, I need the sum of the Weighting just for the state SC and AZ in 16-1Q. The sum would be (.18 + .001). I do not need to include the weighting for KS because I am not interested in that specific state. I would only be interested in the states SC and AZ for every FundId. However, in my real problem I am interested in ~30 states. I would then do the same task for Fund1 for 16-2Q and so on until 17-4Q. My end goal is to find the sum of every portfolio weighting for the states I'm interested in and see how it changes over time. I can do this manually by each fund, but is there a way to automatically sum up the Weighing for each FundId based on the State I want and for each Date (16-1Q, 16-2Q, etc.)?
In the end I would like a table such as:
(.XX) is the sum of portfolio weight
Example of Data
The Example of Data link you sent has a much better data format than the "XX is the sum of portfolio weight" example... only in Excel would you prefer this other kind of format
so using the Example data frame, do this operation
library(dplyr)
example_data <- example_data %>%
group_by(Fund_Id) %>%
summarize(sum = sum(Weighting))
We can use aggregate in base R
aggregate(Weighting ~ Fund_id, example_data, sum)

summarizing multiple variables in group by

I need to find the mean of a variable and the number of times a particular combination occurs for that mean value in r.
In the example I have grouped by variables cli, cus and ron and need to summarize to find the mean of age and frequency of cash for this combination:
df%>% group_by(.dots=c("cli","cus","ron")) %>% summarise_all(mean(age),length(cash))
This doesn't work; is there another way out?
may be it is just me as I seemed to have just over complicated this one, just summarise gets me what I needed
df%>% group_by(.dots=c("cli","cus","ron")) %>% summarise(mean(age),length(cash))

How to aggregate count data into a specific geographic location

I have a dataset called 'model_data', in which the unit of observation is a geographic cell (gid) taken from the UCDP PRIO-GRID data. This is simply a standardised spatial grid structure that allows for finely-grained analysis at a very local level. I am researching the effect of power balance between actors in civil wars on their use of violence against civilians i.e. if actors perform well (operationalised as inflicting a majority of the battle deaths in any one gid) will they target more or less civilians in the same gid. To this end, I have merged my dataset using an inner_join (by gid) with a dataset containing all individual incidents of armed violence (UCDP Georeferenced Events Dataset).
When I merge, the resulting dataset consists of duplicate gid observations for each individual incident of violence from the GED dataset. I need to find a way of aggregating all civilians deaths, all side_a deaths, and all side_b deaths in each specific gid, so that each observation in the dataset is a unique gid with all data on various types of deaths from that gid.
model_data <- inner_join(grid, ged, by = c("year", "gid" = "priogrid_gid", "xcoord" = "longitude", "ycoord" = "latitude"))
As you can see from the first column, there are multiple observations with the same gid. I would like to aggregate all the data from the observations with the same gid into one observation.
I've researched a lot on how the best way to do this, but have been unsuccessful as of yet. From what I gather, the aggregate() function from the "sp" package would be my best bet, but I cannot work out how to use it in the way I need! Thank you for any help that may come my way
How about this?
library(dplyr)
model_data %>%
select(-id) %>%
distinct()
Assuming just using the "gid" without the "id" will get you where you want to go.

Resources