How to add together duplicate values in columns? [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have three columns; loan_id, amount, date. I have 1,048,575 entries and I need to add together all the duplicates in loan_id column(there are different payments on the same loan_id) and in the second table the amount values should be added together matching with the loan_id.
Sample of how my data looks like this

Try
aggregate(df$amount,list(df$loan_id),sum)

So you want the total amount for each loan_id irrespective of date?
One way to do aggregate functions like this in R is by using the data.table package.
library(data.table)
# assuming you start with a data.frame
mydata = data.table(mydata)
mydata[,sum(amount), by=loan_id]

Related

How to delete rows in r [duplicate]

This question already has answers here:
How do I delete rows in a data frame?
(10 answers)
Closed 1 year ago.
So I've been trying to subset and remove the observations of a country from my data frame (ESS6). I have been able to remove certain variables with this function, -c(variable), but that is not useful since I only want to remove certain rows from the variable countries (cntry).
Thank you for your help :)
Try using dplyr and the "filter" function

How to Extract Two Columns of Data in R [duplicate]

This question already has answers here:
Extracting specific columns from a data frame
(10 answers)
Closed 4 years ago.
I have to extract two columns from this data set (Cars93 on MASS) and create a separate folder consisting only of the two columns MPG.highway and EngineSize. How do I go about doing this?
You can look at Cars93 on Mass and just get the first ten rows to see it.
You can create a subset using the names directly using the subset function or alternately,
new_df <- Cars93[,c("MPG.highway","EngineSize")]
#or
new_df <- subset(Cars93, keep = c("MPG.highway","EngineSize"))

Filtering Column by Multiple values [duplicate]

This question already has answers here:
Filter multiple values on a string column in dplyr
(6 answers)
Closed 2 years ago.
I would like to filter values based on one column with multiple values.
For example, one data.frame has s&p 500 tickers, i have to pick 20 of them and associated closing prices. How to do it?
If I understand well you question, I believe you should do it with dplyr:
library(dplyr)
target <- c("Ticker1", "Ticker2", "Ticker3")
filter(df, Ticker %in% target)
The answer can be found in https://stackoverflow.com/a/25647535/9513536
Cheers !

Sorting the data in the dataset [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 6 years ago.
I want to sort a variable in the dataset.
la3 <-order(la1$Id)
Iam getting the output as index. How to get the output as real values in the datatset
la3 <-la1[order(la1$Id),]
The length of the order will correspond to the length of the column and specify the ordered position.
Using an index call of the original data will therefore put the rows in that order.
Using dplyr
library(dplyr)
la1 %>%
arrange(Id)

With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group? [duplicate]

This question already has answers here:
finding the index of a max value in R
(4 answers)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 8 years ago.
If needed I can put together a dataset, but my question is somewhat general.
accts <- accts[, .SD[which.max(EE)], by=DnB.Name]
I've got a DT of about 350k rows, and some of the DnB.Name's (Duns and Bradstreet Company Name) are duplicates with differing employee counts (EE), I only care about the highest number of each and can disregard the rest.
Anyway, DT is usually lightning quick, so I figure I must be doing something wrong?
sort by EE, then take the first row for each group using a self join:
ordered<-accts[order(-EE)] #Descending order
setkey(ordered,DnB.Name) #must setkey before join
ordered[J(unique(DnB.Name)),mult="first"]
For reference, check out this post on crossvalidated: https://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier
EDIT: even faster, but weird syntax:
accts[accts[, .I[which.max(EE)], by = DnB.Name]$V1]
For reference, check this post with a similar question:
Subset by group with data.table

Resources