Average of a variable over different levels of another variable in R - r

I have a very large dataset with employers, employees and salaries.
Each employee has a salary and is linked to an employer.
Employers can have hundreds, even thousands of employees working for them.
I want to find the average salary per employer. ie, I want to return an output with just 1 line per employer with an average salary based on all the employees they have.
Thanks

You can use aggregate for this:
aggregate(salaries, by = list(employers), FUN = mean)

Related

Counting overlapping prescriptions in R

Firstly, I'm new to R and I apologize. So I'm working with data involving prescriptions. Since it's on a secure VM, I can't copy and paste, but the data structure looks like this:
Patient ID | Medication | Start Date | End Date
There are multiple rows for each patient, since each patient has been precribed more than one medication.
What I want to do is the following:
Find out how many medications/which medications the patients are on that overlap each other in terms of time frame, and then return how many overlapping prescriptions the patients has. Is there a way to do this in R?

Counting rows with unique ID in R studio with length(unique)

I have a dataset with people's IDs (some people (IDs) have multiple rows) and risks (transcribed in a number between 1 and 7, and some NA) and I wanted to count the number of people in each risk group, without counting the same person twice.
When creating a subset containing only 1 row/person, I obtain a certain number of people for each group. However, when I use this function (for each risk group):
length(unique(data$person_id[data$RISK==1])
it seems like I get one extra person in each risk group (so 7 extra people in total).
Does someone have an explanation for this? Do I have to do -1 each time I use this function?
Thanks in advance!

Connecting data from 2 data sources to group as one

I'm currently working with csv's with data about test participants on a diet program.
One CSV has the the partcipants information 'suubject id', chosen group', 'extra calories' etc.
The second set of csv's are about the different diet groups, meals, calories etc. (of which there are 10)
My task was to find the total calories of each group then find the total calories of each participants chosen group.
To get the total calories of each group I just made a variable that summed the total calories of each group
one <- sum(groupOne$calories)
Then I cleaned the data up a bit in the participants file by removing the 'g' in the row name.
I would ideally like to get some output that has the participants subjectID and their the total calories of their group. Something like below:
|SubjectID||Group||Groups Total Calories|
1 G3 100cal
2 G6 200cal
After that I'm kind of stuck, I don't quite know how to group the two together to together and spit out some data that matches the participants to the groups to output a clean display of the participants subjectId, their group and the total calories of that group.

Calculating new columns in PowerBI

I've got this table I've defined in PowerBI:
I'd like to define a new table which has the percentage of medals won by USA from the total of medals that were given that year for each sport.
An example:
Year Sport Percentage
1986 Aquatics 0.0%
How could I do it?
You can use SUMMARIZE() to calculate a new table:
NewTable =
SUMMARIZE(
yourDataTable;
[Year];
[Sports];
"Pct";
DIVIDE(
CALCULATE(
COUNTROWS(yourDataTable);
yourDataTable[Nat] = "USA"
);
CALCULATE(
COUNTROWS(yourDataTable);
ALLEXCEPT(
yourDataTable;
yourDataTable[Year];
yourDataTable[Sports]
)
);
0
)
I know that an answer has already been accepted, but I feel that I should provide my suggested solution to utilize all of Power BI's capabilities.
By creating a calculated table, you are limited in what you can do with the data, in that it is hard coded to be filtered to USA and is only based on Year and Sport. While that is the current requirements, what if they change? Then you have to recode your table or make another one.
My suggestion is to use measures to accomplish this task, and here's how...
First, here is my set of sample data.
With that data, I created a simple measure that count the rows to get the count of medals.
Medal Count = COUNTROWS(Olympics)
Throwing together a basic matrix with that measure we can see the data like this.
A second measure can then be created to get a percentage for a specific country.
Country Medal Percentage = DIVIDE([Medal Count], CALCULATE([Medal Count], ALL(Olympics[Country])), BLANK())
Adding that measure to the matrix we can start to see our percentages.
From that matrix, we can see that USA won 25% of all medals in 2000. And their 2 medals in Sport B made up 33.33% of all medals that year.
With this you can utilize slicers and the layout of the matrix to get the desired percentage. Here's a small example with a country and year slicer that shows the same numbers.
From here you are able to cut the data by any sport or year and see the percentage of any selected country (or countries).

Summarizing Data across age groups in R

I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function
library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]

Resources