Proportion of non-zero values by multiple columns/groups

Proportion of non-zero values by multiple columns/groups - r

I have a question similar to Calculate proportion of positives values by group, but for grouping the average fraction by many columns. I'd like to get the proportion of non-zero values in "num" by "Year" and "season". Something that works for n# of columns, no matter where they are in the df in relation to each other.
My data:
> head(df)
# A tibble: 6 x 6
Year Month Day Station num season
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 2017 1 3 266 4 DRY
2 2018 1 3 270 2 DRY
3 2018 1 3 301 1 DRY
4 2018 1 4 314 0 DRY
5 2018 2 4 402 0 DRY
6 2018 1 4 618 0 WET
I thought something like this would work, but I get a warning message:
> aggregate(df$num>0~df[,c(1,6)],FUN=mean) # Average proportion of num > 0 per year & season
Error in model.frame.default(formula = env_subset$den > 0 ~ env_subset[, :
invalid type (list) for variable 'env_subset[, c(1, 6)]'

With dplyr, I think this is what you want:
library(dplyr)
df %>% group_by(Year, season) %>%
summarize(prop_gt_0 = mean(num > 0), .groups = "drop")
# # A tibble: 3 × 3
# Year season prop_gt_0
# <int> <chr> <dbl>
# 1 2017 DRY 1
# 2 2018 DRY 0.5
# 3 2018 WET 0
It's usually better to refer to columns by name rather than by number, so, as you say it works "no matter where they are in the df".
You can still use aggregate--I prefer the formula interface for working with column names:
aggregate(num ~ Year + season, data = df, FUN = \(x) mean(x > 0))
# Year season num
# 1 2017 DRY 1.0
# 2 2018 DRY 0.5
# 3 2018 WET 0.0

Related

One row time lag using R

I have a data df as follows:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3),
year=c(2011,2012,2013,2010,2011,2012,2013,2012,2013),
points=c(45,69,79,53,13,12,11,89,91),
result = c(2,3,5,4,6,1,2,3,4))
But I want to make df as below:
df <- data.frame(id = c(1,1,2,2,2,3),
year=c(2011,2012,2010,2011,2012,2012),
points=c(45,69,53,13,12,89),
result = c(3,5,6,1,2,4))
Here, I want to do some regression with the response variable result. Since I want to estimate result, I have to delay the response variable result and leave the other dependent variable points. So, for my regression setting, result is the response variable and points is the dependent variable. In summary, I want to do time lag for result. Within each id, each last row should be removed because, there are no next result.
I simplified my problem for demonstration purpose. Is there any way to achieve this using R?

Tidyverse solution:
library(tidyverse)
df %>% group_by(id) %>% mutate(lead_result = lead(result)) %>% na.exclude
# A tibble: 6 x 5
# Groups: id [3]
id year points result lead_result
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2011 45 2 3
2 1 2012 69 3 5
3 2 2010 53 4 6
4 2 2011 13 6 1
5 2 2012 12 1 2
6 3 2012 89 3 4

A data.table solution:
library(data.table)
na.omit(setDT(df)[, result := shift(result, type = "lead"), by = "id"], "result")
Output
id year points result
<num> <num> <num> <num>
1: 1 2011 45 3
2: 1 2012 69 5
3: 2 2010 53 6
4: 2 2011 13 1
5: 2 2012 12 2
6: 3 2012 89 4

'no applicable method' for applying dbplyr's sql_render to a data.frame

I was testing an example from RStudio about "render SQL code" from dbplyr:
library(nycflights13)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
dbplyr::sql_render(ranked)
But when run, it returns the following error message:
Error in UseMethod("sql_render") :
no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
Can someone explain why?

When you are working on a "normal" data.frame, then it returns a frame, in which case sql_render is inappropriate (and will be very confused). If we work with just your initial code, then we can see that SQL has nothing to do with it:
library(dplyr)
library(nycflights)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
ranked
# # A tibble: 336,776 x 5
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <dbl>
# 1 2013 1 1 2 313
# 2 2013 1 1 4 276
# 3 2013 1 1 2 313
# 4 2013 1 1 -1 440
# 5 2013 1 1 -6 742
# 6 2013 1 1 -4 633
# 7 2013 1 1 -5 691
# 8 2013 1 1 -3 570
# 9 2013 1 1 -3 570
# 10 2013 1 1 -2 502.
# # ... with 336,766 more rows
But dbplyr won't be able to do something with that:
library(dbplyr)
sql_render(ranked)
# Error in UseMethod("sql_render") :
# no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
If, however, we have that same flights data in a database, then we can do what you are expecting, with some minor changes.
# pgcon <- DBI::dbConnect(odbc::odbc(), ...) # to my local postgres instance
copy_to(pgcon, flights, name = "flights_table") # go get some coffee
flights_db <- tbl(pgcon, "flights_table")
ranked_db <- flights_db %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
# Adding missing grouping variables: `year`, `month`, `day`
We can see some initial data, showing the top 10 rows of what the query will eventually return:
ranked_db
# # Source: lazy query [?? x 5]
# # Database: postgres [postgres#localhost:/]
# # Groups: year, month, day
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with more rows
and we can see what the real SQL query looks like:
sql_render(ranked_db)
# <SQL> SELECT "year", "month", "day", "dep_delay", RANK() OVER (PARTITION BY "year", "month", "day" ORDER BY "dep_delay" DESC) AS "rank"
# FROM "flights_table"
Realizing that, due to the way dbplyr operates, we don't know how many rows will be returned until we collect it:
nrow(ranked_db)
# [1] NA
res <- collect(ranked_db)
nrow(res)
# [1] 336776
res
# # A tibble: 336,776 x 5 # <--- no longer 'Source: lazy query [?? x 5]'
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with 336,766 more rows

Check the documentation of the package. So you can render a code with the SQL syntax.
Maybe the chunk of code below helps you:
library(dplyr)
library(SqlRender)
library(nycflights13)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay))) %>%
ungroup()
sql <- "SELECT * FROM #x WHERE month = #a;"
render(sql, x = ranked, a = 2)

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25

For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Dplyr solution using slice and group

Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel

First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.

You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014

Identify repeats in r

Hi I have a dataframe and it looks like this:
test = data.frame("Year" = c("2015","2015","2016","2017","2018"),
"UserID" = c(1,2,1,1,3), "PurchaseValue" = c(1,5,3,3,5))
where "Year" is the time of purchase and "UserID" is the buyer.
I want to create a variable "RepeatedPurchase" that gives "1" if it is a repeated purchase and else 0 (if it is the only purchase/ if it is the first time purchase).
Thus, the desired output would look like this:
I tried to achieve this by first creating a variable "Se" that tells if that purchase is the 1st/ 2nd/ 3rd... purchase of that buyer but my code didn't work. Wondering what's wrong with my code or is there a better way I can identify repeated purchase? Thanks!
library(dplyr)
df %>% arrange(UserID, Year) %>% group_by(UserID) %>% mutate(Se = seq(n())) %>% ungroup()

You do not need dplyr. You can use duplicated() as following:
test=data.frame("Year" = c("2015","2015","2016","2017","2018"), "UserID" = c(1,2,1,1,3), "PurchaseValue" = c(1,5,3,3,5))
repeated<-duplicated(test$UserID)
# [1] FALSE FALSE TRUE TRUE FALSE
test$RepeatedPurchase<-ifelse(repeated==T,1,0)
test
# Year UserID PurchaseValue RepeatedPurchase
# 1 2015 1 1 0
# 2 2015 2 5 0
# 3 2016 1 3 1
# 4 2017 1 3 1
# 5 2018 3 5 0
Cheers!,

We can start by counting the number of purchases for each UserID and assign 1 when it exceeds 1
test %>% group_by(UserID) %>% mutate(RepeatedPurchase = ifelse(1:n()>1, 1, 0))
# A tibble: 5 x 4
# Groups: UserID [3]
Year UserID PurchaseValue Repeatedpurchase
<fct> <dbl> <dbl> <dbl>
1 2015 1.00 1.00 0
2 2015 2.00 5.00 0
3 2016 1.00 3.00 1.00
4 2017 1.00 3.00 1.00
5 2018 3.00 5.00 0

Here is another dplyr solution. We can group_by the UserID and PurchaseValue, and then use as.integer(n() > 1) to evaluate if the count is larger than 1.
library(dplyr)
test2 <- test %>%
group_by(UserID, PurchaseValue) %>%
mutate(RepeatedPurchase = as.integer(n() > 1)) %>%
ungroup()
test2
# # A tibble: 5 x 4
# Year UserID PurchaseValue RepeatedPurchase
# <fct> <dbl> <dbl> <int>
# 1 2015 1 1 0
# 2 2015 2 5 0
# 3 2016 1 3 1
# 4 2017 1 3 1
# 5 2018 3 5 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Proportion of non-zero values by multiple columns/groups - r

Related

One row time lag using R

'no applicable method' for applying dbplyr's sql_render to a data.frame

Determine percentage of rows with missing values in a dataframe in R

Dplyr solution using slice and group

Identify repeats in r

Categories

Resources