Summarise a set of dates contained in a column [duplicate] - r

This question already has answers here:
Count number of unique levels of a variable
(7 answers)
Closed 2 years ago.
I have a dataset with transactions made from 2018/07/01 to 2019/06/30 and I want to find how many unique dates are in the "DATE" column (it has over 260k rows, so a date can be repeated several times).
I have tried the following but it just lists all the dates contained in the "DATE" column:
numberofdates <- dplyr::summarize(transactionData, DATE)
Thanks for your help!

Try one of these approaches:
In dplyr:
library(dplyr)
#Code 1
numberofdates <- transactionData %>% summarize(n_distinct(DATE))
Or in base R:
#Code 2
length(unique(transactionData$DATE))

With data.table we can use
library(data.table)
numberofdates <- uniqueN(transactionData$DATE)

Related

Dplyr Version of ColSum or Dynamic Group_By in R [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 2 years ago.
I'm having a hard time working through something simple. I have a data frame where the first column is "Cat" and includes 3 different variables which I would like to group_by and summarize. Columns 2-5 are considered Months so 1 is the first month, 2 is the second month etc. What I'm trying to do is group_by Cat and sum up the individual columns. I've tried working with colSums and aggregate. Any help would greatly appreciated! Thanks
dff<-data.frame(Cat=c('A','B','C','A','A','A','B','C'),
'1'=c(10,20,30,80,10,15,20,15),
'2'=c(15,10,20,30,60,45,50,65),
'3'=c(10,20,30,80,20,25,27,85),
'4'=c(90,70,50,30,10,15,20,15),
'5'=c(1,120,3,8,7,10,25,30))
Using aggregate in base R
aggregate(. ~ Cat, dff, sum)
Or with dplyr
library(dplyr)
dff %>%
group_by(Cat) %>%
summarise(across(everything(), sum))

Subsetting a dataframe based on a vector of strings [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

How to add together duplicate values in columns? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have three columns; loan_id, amount, date. I have 1,048,575 entries and I need to add together all the duplicates in loan_id column(there are different payments on the same loan_id) and in the second table the amount values should be added together matching with the loan_id.
Sample of how my data looks like this
Try
aggregate(df$amount,list(df$loan_id),sum)
So you want the total amount for each loan_id irrespective of date?
One way to do aggregate functions like this in R is by using the data.table package.
library(data.table)
# assuming you start with a data.frame
mydata = data.table(mydata)
mydata[,sum(amount), by=loan_id]

Dropping dates except specific year from panel data [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 4 years ago.
I have panel data from 2000 to 2017. I want to select rows which are 2005.
Amazingly
mydata <- subset(mydata, select= c(mydata$Year>="2005"))
did not work. Any suggestion?
It is assumed the data is in date or numeric format.
library(dplyr)
df%>%
filter(year>=2005)
Only 2005:
library(dplyr)
flights %>%
filter(year==2005)

Filtering Column by Multiple values [duplicate]

This question already has answers here:
Filter multiple values on a string column in dplyr
(6 answers)
Closed 2 years ago.
I would like to filter values based on one column with multiple values.
For example, one data.frame has s&p 500 tickers, i have to pick 20 of them and associated closing prices. How to do it?
If I understand well you question, I believe you should do it with dplyr:
library(dplyr)
target <- c("Ticker1", "Ticker2", "Ticker3")
filter(df, Ticker %in% target)
The answer can be found in https://stackoverflow.com/a/25647535/9513536
Cheers !

Resources