Discrepency in dplyr in R : n() and length(variable-name) giving different answers after group by - r

I think if I have any data frame, when I use group_by and then invoke n() OR if I use group_by and invoke length(any variable_name in the data frame) they should give me the same answer.
However today I noticed that this is not the case.
I am not allowed to post this data, but here is the code.
Can someone try to understand why total count and c2 are not the same?
Please note that in the used data frame, WAVE_NO and REF_PERIOD_WAVE will give rise to the same groups. I just used this for printing nicely. Also DATE_OF_INTERVIEW is all NA in WAVE_NO = 1 to 24.
library(dplyr)
library(RMySQL)
con <- dbConnect(dbDriver("MySQL"), host = Sys.getenv("mydb"), db = "hhd", user = Sys.getenv("MY_USER"), password = Sys.getenv("MY_PASSWORD"))
dbListTables(con)
asp <- tbl(con,"my_table")
> asp %>% group_by(WAVE_NO,REF_PERIOD_WAVE) %>%
summarise(total_count = n(), c2 = length(DATE_OF_INTERVIEW)) %>% as.data.frame
`summarise()` has grouped output by 'WAVE_NO'. You can override using the `.groups` argument.
WAVE_NO REF_PERIOD_WAVE total_count c2
1 1 W1 2014 166744 NA
2 2 W2 2014 160705 NA
3 3 W3 2014 157442 NA
4 4 W1 2015 158443 NA
5 5 W2 2015 158666 NA
6 6 W3 2015 158624 NA
7 7 W1 2016 158624 NA
8 8 W2 2016 159778 NA
9 9 W3 2016 160511 NA
10 10 W1 2017 161167 NA
11 11 W2 2017 160847 NA
12 12 W3 2017 168165 NA
13 13 W1 2018 169215 NA
14 14 W2 2018 172365 NA
15 15 W3 2018 173181 NA
16 16 W1 2019 174405 NA
17 17 W2 2019 174405 NA
18 18 W3 2019 174405 NA
19 19 W1 2020 174405 NA
20 20 W2 2020 174405 NA
21 21 W3 2020 174405 NA
22 22 W1 2021 176661 NA
23 23 W2 2021 178677 NA
24 24 W3 2021 178677 NA
25 25 W1 2022 178677 11
26 26 W2 2022 178677 11
>

The problem is that while n() translates to COUNT in MySQL, length translates to length which gives the length of a string:
library(dbplyr)
library(dplyr)
md <- lazy_frame(a = gl(5, 3), b = rnorm(15), con = simulate_mysql())
md %>%
group_by(a) %>%
summarize(n = n(), len = length(b))
# <SQL>
# SELECT `a`, COUNT(*) AS `n`, length(`b`) AS `len`
# FROM `df`
# GROUP BY `a`

Related

How to compare two or more lines in a long dataset to create a new variable?

I have a long format dataset like that:
ID
year
Address
Classification
1
2020
A
NA
1
2021
A
NA
1
2022
B
B_
2
2020
C
NA
2
2021
D
NA
2
2022
F
F_
3
2020
G
NA
3
2021
G
NA
3
2022
G
G_
4
2020
H
NA
4
2021
I
NA
4
2022
H
H_
I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.
I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:
ID
year
Address
Classification
Aim
1
2020
A
NA
NA
1
2021
A
NA
NA
1
2022
B
B_
B_
2
2020
C
NA
NA
2
2021
D
NA
NA
2
2022
F
F_
F_
3
2020
G
NA
G_
3
2021
G
NA
G_
3
2022
G
G_
G_
4
2020
H
NA
H_
4
2021
I
NA
NA
4
2022
H
H_
H_
I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).
library(dplyr)
library(tidyr)
df %>%
group_by(ID, Address) %>%
tidyr::fill(Classification, .direction = "up")
Output:
# ID year Address Classification
# <int> <int> <chr> <chr>
# 1 1 2020 A NA
# 2 1 2021 A NA
# 3 1 2022 B B_
# 4 2 2020 C NA
# 5 2 2021 D NA
# 6 2 2022 F F_
# 7 3 2020 G G_
# 8 3 2021 G G_
# 9 3 2022 G G_
#10 4 2020 H H_
#11 4 2021 I NA
#12 4 2022 H H_
Data
df <- read.table(text = "ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_", header = TRUE)

R inserting rows between dates by group based on second column

I have a df that looks like this
ID FINAL_DT START_DT
23 NA 2020-03-20
25 NA 2020-04-10
29 2020-02-02 2020-01-23
30 NA 2020-01-02
What I would like to do is for each ID add a row for every month starting from START_DT and ending at whatever comes first FINAL_DT or the current date. Expected output would be the follow:
ID FINAL_DT START_DT ACTIVE_MONTH
23 NA 2020-03-20 2020-03
23 NA NA 2020-04
23 NA NA 2020-05
25 NA 2020-04-10 2020-04
25 NA NA 2020-05
29 2020-02-02 2020-01-23 2020-01
29 2020-02-02 NA 2020-02
30 NA 2020-01-02 2020-01
30 NA NA 2020-02
30 NA NA 2020-03
30 NA NA 2020-04
30 NA NA 2020-05
I have the following code which works but does not account for FINAL_DT
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, current_date, by = "month"))
I have tried the following but get an error I believe due to the NA's:
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, min(FINAL_DT,current_date), by = "month"))
The day of the month also does not matter I am not sure if it would be easier to drop that before or after.
Here is another approach. You can use floor_date to get the first day of the month to use in your sequence of months. Then, you can include the full sequence to today's date, and filter based on FINAL_DT. You can use as.yearmon from zoo if you'd like a month/year object for month.
library(zoo)
library(tidyr)
library(dplyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(ACTIVE_MONTH = floor_date(START_DT, unit = "month")) %>%
group_by(ID) %>%
complete(ACTIVE_MONTH = seq.Date(floor_date(START_DT, unit = "month"), current_date, by = "month")) %>%
filter(ACTIVE_MONTH <= first(FINAL_DT) | is.na(first(FINAL_DT))) %>%
ungroup() %>%
mutate(ACTIVE_MONTH = as.yearmon(ACTIVE_MONTH))
Output
# A tibble: 12 x 4
ID ACTIVE_MONTH FINAL_DT START_DT
<dbl> <yearmon> <date> <date>
1 23 Mar 2020 NA 2020-03-20
2 23 Apr 2020 NA NA
3 23 May 2020 NA NA
4 25 Apr 2020 NA 2020-04-10
5 25 May 2020 NA NA
6 29 Jan 2020 2020-02-02 2020-01-23
7 29 Feb 2020 NA NA
8 30 Jan 2020 NA 2020-01-02
9 30 Feb 2020 NA NA
10 30 Mar 2020 NA NA
11 30 Apr 2020 NA NA
12 30 May 2020 NA NA
Here is an approach that returns rows for each MONTH with the help of lubridate.
library(dplyr)
library(tidyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(MONTH = month(START_DT)) %>%
group_by(ID) %>%
complete(MONTH = seq(MONTH, min(month(FINAL_DT)[!is.na(FINAL_DT)],month(current_date))))
# A tibble: 12 x 4
# Groups: ID [4]
# ID MONTH FINAL_DT START_DT
# <int> <dbl> <fct> <fct>
# 1 23 3 NA 2020-03-20
# 2 23 4 NA NA
# 3 23 5 NA NA
# 4 25 4 NA 2020-04-10
# 5 25 5 NA NA
# 6 29 1 2020-02-02 2020-01-23
# 7 29 2 NA NA
# 8 30 1 NA 2020-01-02
# 9 30 2 NA NA
#10 30 3 NA NA
#11 30 4 NA NA
#12 30 5 NA NA

how to get previous matching value

how to get value of "a" if value of b matches with most recent previous value e.g - row$3 of b matches with previous row$1 ,row$6 matches with row$4
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30.133,29,30.1223,33,17,33,11,17,14,13.913,14))
year a b *NEW*
2013 10 30.133 NA
2013 11 29 NA
2014 NA 30.1223 10
2014 13 33 NA
2014 22 17 NA
2015 NA 33 13
2015 19 11 NA
2015 NA 17 22
2016 10 14 NA
2016 15 13.913 10
2016 NA 14 15
Thanks
For OPs example case
One way could be is to use duplicated() function.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,33,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 33 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 13
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 NA
# 10 2016 15 14 10
# 11 2016 NA 14 15
General purpose usage
The above solution fails when the duplicates are not in sequential order. As per #DavidArenburg's advice. I have changed the fourth element df$b[4] <- 14. The general solution would require the usage of another handy function order() and should work for different possible cases.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,14,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# sort the matching column
df <- df[order(df$b),]
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# To original order
df <- df[order(as.integer(rownames(df))),]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 14 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 NA
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 13
# 10 2016 15 14 10
# 11 2016 NA 14 15
Here, the solution is based on the base package' functions. I am sure there should other ways of doing this using other packages.

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources