This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 1 year ago.
I am trying to merge two datasets for my senior thesis on corporate political actibity. One shows all of the data I have on each company, which is made up off several previously merged datasets, and the other shows the year, the companies' ticker, and a variable called "dirnbr". "dirnbr" shows how many people were on the board in a given year, except it is showing it like this:
Basically, it is creating several entries per year, one for each person on the board, going from 1 to the total number on the board (which is the only number I really care about). I just want my dataset to show total number of people on the board in a given year, year, and ticker. This would then allow me to merge them using an inner_join command and then see what percentage of people on a board of directors in a given year were formerly involved in politics. (I have that information in my larger dataset).
Basically, I would like to drop every observation besides the largest "dirnbr" entry per year and ticker. Is there a way to do this (or achieve the same result in another way?)?
Please let me know, any help is very appreciated.
You could use
library(dplyr)
df %>%
group_by(ticker, year) %>%
filter(dirnbr == max(dirnbr))
or
df %>%
group_by(ticker, year) %>%
slice_max(dirnbr)
Related
I have a question regarding the filtering of a loan dataset for my upcoming thesis.
My dataset consists of loan data which is reported for 5 years on a quarterly basis. The column of interest is the 'Loan Identifier' as well as the 'Cut-Off-Date'. I just want to observe the loans (via Loan Identifier) that exist at the first reporting date (first quarter) for every upcoming quarter (cut-off-date).
For example, if there are the loans with the identifier c("1001","1002","1003") in the first cut-off-date and the second cut-off date, one quarter later, has loans with identifiers ("1002","1003","1004"), R should filter for only the identifiers that existed in the first quarter ("1002","1003"). So that new loans during the analysis are completely ignored.
Is there also the possibility to do that all in one file? Or should I extract the data of each cut-off-date in a new table?
Thanks and best regards!
I am thinking about assigning each loan in the first quarter as a vector. After that, I should split up the loan dataset for each cut-off-date and merge the vector with the new tables via left_join. So that every loan that does not match with the vector is disregarded.
As I have multiple loan pools with 15 pool-cut-off dates, this seems very impractical for me. Maybe there is a smarter and more effective solution.
I am very new to R so please bear with me!
I have a dataset with moth species, names of people who recorded the moths (Recorders), the year in which they were recorded, etc.
I would like to create a new table in which I have the number of different moth recorders per year. So far I have managed to make a table that gives me the total recordings made per year, but it's not quite what I need.
Here is the code I have used, would anybody be able to offer amendments or perhaps alternative ways to go about this?
#create table with number of moth recorders per year
library(plyr)
diversity <- ddply(mydata4, c("Year"), summarise,
N = length(Recorder))
diversity
Thank you!
As you are new to R and actively learning by the sounds of it; I'll give you a nudge in the right direction. I've always found things stuck best when I've figured them out myself and don't want to rob you of that.
So: It sounds like what you want is to have a count of the distinct recorders grouped by year. (Hint hint)
I suggest having a look at the dplyr and tidyr packages (for which there is a handy cheatsheet) as they are very useful for this sort of manipulation of data frames.
Also, as you are just picking up R, another useful thing worth taking a look at (though not relevant to your immediate problem) is the Tidyverse Code Style Guide.
For those looking to have the answer spelled out, see below. Look away now if you want to figure it out yourself.
The original question states that there is a data set with the following properties:
Moth Species
Name of person who Recorded it
Year the moth was Recorded in.
The code provided in the question was reported to produce a table of the total number of recordings made per year. From this we can infer that the original table has one row per recording.
The question also refers to two specific columns: Year and Recorder. From this information and the fact that the question mentioned the data set included moth species we can infer that the data set has at least three columns:
Species
Recorder
Year
So, let's make up some sample data:
mydata4 <- data.frame(
Species = c("Red", "Blue", "Red", "Blue", "Green"),
Year = c("2019", "2019", "2019", "2018", "2018"),
Recorder = c("Alice", "Alice", "Bob", "Alice", "Alice")
)
Now, as I mentioned above, we desire a count of distinct Recorders grouped by year... so:
library(dplyr)
mydata4 %>% group_by(Year) %>% distinct(Recorder) %>% count()
We group by year, we make sure that the rows in each group are distinct by Recorder and finally we count the rows in each group, as by this point we have made sure that each group only has one row per Recorder who recorded at least one moth in that year.
# A tibble: 2 x 2
# Groups: Year [2]
Year n
<fct> <int>
1 2018 1
2 2019 2
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
This is my first time posting a question, so may not have the correct info to start, apologies in advance. Am new to R. Prefer to use dplyr or tidyverse because those are the packages we've used so far. I did search for a similar question, but most gender/sex related questions are around separating the data, or performing operations on each separately.
I have a table of population counts, with variables (factors) Age Range, Year and Sex, with Population as the dependent variable. I want to create a plot to show if the population is aging - that is, showing how the relative proportion of different ages groups changes over time. But gender is not relevant, so I want to add together the population counts for males and females, for each year and age range.
I don't know how to provide a copy of the raw data .csv file, so if you have any suggestions, please let me know.
This is a sample of the data(output table):
And here is the code so far:
file_name <- "AusPopDemographics.csv"
AusDemo_df = read.table(file_name,",", header=TRUE)
(grp_AusDemo_df <- AusDemo_df %>% group_by(Year, Age))
I am guessing it may be something like pivot(wider) to bring male and female up as column headings, then transmute() to sum them and create a new population column.
Thanks for your help.
With dplyr you could do something like this
library(dplyr)
grp_AusDemo_df <- AusDemo_df %>%
group_by(Year, Age) %>%
summarise(Population = sum(Population, na.rm = TRUE))
So I have one dataset (DF1) that includes baseball players, the year, and their stats in that year. I have another (DF2) that lists the players, the year, and their salary in that year.
I would like to add the salary column information to DF1 when player name AND year match in both datasets.
I tried
DF1$Salary <- DF2$salary[match(Pitching$playerID, Salaries$playerID)]
But realized that if I did this the information was only correct for the first year. I need to only make the match if year and player ID are the same. Can someone help me? Thanks!
I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function
library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]