Use of aggregate function in R [duplicate] - r

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 5 months ago.
I have data like this:
ID <- c(1001,1001, 1001, 1002,1002,1002)
activity <- c(123,123,123, 456,456,789)
df<- data.frame(ID,activity)
I want to count the number of unique activity values within ID to end up with a dataframe like this:
N<- c(1,1,1,2,2,2)
data.frame(df,N)
So we can see that person 1001 did only 1 activity while person 1002 did two.
I think it can be done with aggregate but am happy to use another approach.

dplyr option
sum_df <- df %>%
group_by(ID) %>%
summarize(count_distinct = n_distinct(activity)) %>%
left_join(df,
by = 'ID')

Related

Conditional sum of a column according to the value of another column when grouping [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
I'm trying to sum PONDERA when ESTADO==1 and then group by AGLOMERADO
new <- recorte %>% group_by(AGLOMERADO) %>%
summarise(TOTocupied=sum(recorte[recorte$ESTADO==1,"PONDERA"]))
The sum is working correctly, but I can't get the result to be grouped by AGLOMERADO, it gives me back the same result for each AGLOMERADO:
AGLOMERADO TOTocupied
1 100
2 100
3 100
What am I doing wrong?
Don't use $ in dplyr pipe. Also no need to refer to the dataframe again since we are using pipes.
You can try -
library(dplyr)
new <- recorte %>%
group_by(AGLOMERADO)%>%
summarise(TOTocupied = sum(PONDERA[ESTADO==1], na.rm = TRUE))

Sorting Column in R [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 3 years ago.
I have data that includes a treatment group, which is indicated by a 1, and a control group, which is indicated by a 0. This is all contained under the variable treat_invite. How can I separate these and take the mean of pct_missing for the 1's and 0's? I've attached an image for clarification.
enter image description here
assuming your data frame is called df:
df <- df %>% group_by(treat_invite) %>% mutate(MeanPCTMissing = mean(PCT_missing))
Or, if you want to just have the summary table (rather than the original table with an additional column):
df <- df %>% group_by(treat_invite) %>% summarise(MeanPCTMissing =
mean(PCT_missing))

Padding multiple columns in a data frame or data table [duplicate]

This question already has answers here:
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 5 years ago.
I have a data frame like the following and would like to pad the dates.
Notice that four days are missing for id 3.
df = data.frame(
id = rep(1,1,1,2,2,3,3,3),
date = lubridate::ymd("2017-01-01","2017-01-02","2017-01-03",
"2017-05-10","2017-05-11","2017-01-03",
"2017-01-08","2017-01-09"),
type = c("A","A","A","B","B","C","C","C"),
val1 = rnorm(8),
val2 = rnorm(8))
df
I tried the padr package as I wanted a quick solution, but this doesn't seem to work.
?pad
padr::pad(df)
library(dplyr)
df %>% padr::pad(group = c('id'))
df %>% padr::pad(group = c('id','date'))
Any ideas on tools or other packages to pad a dataset across multiple columns and based on groupings
EDIT:
So there are three missing dates in my df.
"2017-01-03","2017-01-08","2017-01-09"
Thus, I want the final dates to include three extra rows that contain
"2017-01-04","2017-01-05","2017-01-06","2017-01-07"

data mining: subset based on maximum criteria of several observations [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 6 years ago.
Consider the example data
Zip_Code <- c(1,1,1,2,2,2,3,3,3,3,4,4)
Political_pref <- c('A','A','B','A','B','B','A','A','B','B','A','A')
income <- c(60,120,100,90,80,60,100,90,200,200,90,110)
df1 <- data.frame(Zip_Code, Political_pref, income)
I want to group_by each $Zip_code and obtain the maximum $income based on each $Political_pref factor.
The desired output is a df which has 8obs of 3 variables. That contains, 2 obs for each $Zip_code (an A and B for each) which had the greatest income
I am playing with dplyr, but happy for a solution using any package (possibly with data.table)
library(dplyr)
df2 <- df1 %>%
group_by(Zip_Code) %>%
filter(....)
We can use slice with which.max
library(dplyr)
df1 %>%
group_by(Zip_Code, Political_pref) %>%
slice(which.max(income))

how to use dcast on a dataframe that has only 1 row in R [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
Hi I have a dataframe like this
Start <- c("A")
End <- c("C")
Days <- c("Day1")
df2 <- data.frame(Start,End,Days)
I am trying to use dcast
df2 <- dcast(df2,Days ~ End,value.var="Days")
but it returns is
Days C
1 Day1 Day1
My desired output is the count
Days C
1 Day1 1
What am I missing here? Kindly provide some inputs on this. Is there a better way to do this using dplyr?
We can create a sequence column of 1 and then use dcast
dcast(transform(df2, i1=1), Days~End, value.var='i1')
# Days C
#1 Day1 1
Or another option is using the fun.aggregate
dcast(df2, Days~End, length)
# Days C
#1 Day1 1
As the OP mentioned about dplyr, it involves using the first method as it doesn't have the fun.aggregate
library(dplyr)
df2 %>%
mutate(C=1) %>%
select(Days:C)
Hi you are on the right track. What you need when you cast your data frame is to have a function that is applied to the aggregation during the casting.
In this case , you want something that counts the occurence of each group to do so you use the function length
dcast(df2,Days ~ End, length ) # or dcast(df, Days ~ End, table)

Resources