How to compare two or more lines in a long dataset to create a new variable? - r

I have a long format dataset like that:
ID
year
Address
Classification
1
2020
A
NA
1
2021
A
NA
1
2022
B
B_
2
2020
C
NA
2
2021
D
NA
2
2022
F
F_
3
2020
G
NA
3
2021
G
NA
3
2022
G
G_
4
2020
H
NA
4
2021
I
NA
4
2022
H
H_
I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.
I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:
ID
year
Address
Classification
Aim
1
2020
A
NA
NA
1
2021
A
NA
NA
1
2022
B
B_
B_
2
2020
C
NA
NA
2
2021
D
NA
NA
2
2022
F
F_
F_
3
2020
G
NA
G_
3
2021
G
NA
G_
3
2022
G
G_
G_
4
2020
H
NA
H_
4
2021
I
NA
NA
4
2022
H
H_
H_

I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).
library(dplyr)
library(tidyr)
df %>%
group_by(ID, Address) %>%
tidyr::fill(Classification, .direction = "up")
Output:
# ID year Address Classification
# <int> <int> <chr> <chr>
# 1 1 2020 A NA
# 2 1 2021 A NA
# 3 1 2022 B B_
# 4 2 2020 C NA
# 5 2 2021 D NA
# 6 2 2022 F F_
# 7 3 2020 G G_
# 8 3 2021 G G_
# 9 3 2022 G G_
#10 4 2020 H H_
#11 4 2021 I NA
#12 4 2022 H H_
Data
df <- read.table(text = "ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_", header = TRUE)

Related

Pivot/merge some columns in a dataset while keeping remaining columns

I have a dataset that resembles this (but with more columns):
table <- "year site square triangle circle
1 2019 A 3 9 5
2 2019 A 5 NA 34
3 2019 B 0 0 69
4 2019 B NA 111 2
5 2020 C 0 45 3
6 2020 C 29 0 NA
7 2020 D NA 0 1
8 2021 D 3 NA 4
9 2021 D 158 5 0
10 2021 D 2 9 0"
df <- read.table(text=table, header = TRUE)
df
I want to pivot a portion of the the table so that it resembles this:
year site type count
1 2019 A square 3
2 2019 A triangle 9
3 2019 A circle 5
4 2019 A square 5
5 2019 A triangle NA
6 2019 A circle 34
7 2019 B square 0
8 2019 B triangle 0
9 2019 B circle 60
(and so on)
I've tried solutions from here, but this does not deal with counts so I lose that value when I use these solutions.
For example, the below code leaves me with NAs in each column and I lose the count values
df2 <- df[1:2]
df2$type <- apply(df[3:5], 1, function(k) names(df[3:5])[k])
df2
year site type
1 2019 A circle, NA, NA
2 2019 A NA, NA, NA
3 2019 B NA
4 2019 B NA, NA, triangle
5 2020 C NA, circle
6 2020 C NA, NA
7 2020 D NA, square
8 2021 D circle, NA, NA
9 2021 D NA, NA
10 2021 D triangle, NA
I've also tried using tidyr gather() package, but this won't allow me to keep multiple columns.
library(tidyr)
df3 <- gather(df, year, site, `square`:`circle`)
head(df3)
year site
1 square 3
2 square 5
3 square 0
4 square NA
5 square 0
6 square 29
My only idea is to make a new column of unique numbers (1-X) in my dataframe, use that with gather(), then merge the original dataframe and the new dataframe by that unique ID, then remove the unwanted columns. This would work, but I'm wondering if there's a better, cleaner solution?
How about tidyr::pivot_longer:
library(tidyr)
tidyr::pivot_longer(df, -c(year, site))
#> # A tibble: 30 x 4
#> year site name value
#> <int> <chr> <chr> <int>
#> 1 2019 A square 3
#> 2 2019 A triangle 9
#> 3 2019 A circle 5
#> 4 2019 A square 5
#> 5 2019 A triangle NA
#> 6 2019 A circle 34
#> 7 2019 B square 0
#> 8 2019 B triangle 0
#> 9 2019 B circle 69
#> 10 2019 B square NA
#> # … with 20 more rows

How to find first non-NA leading or lagging value?

I have rows grouped by ID and I want to calculate how much time passes until the next event occurs (if it does occur for that ID).
Here is example code:
year <- c(2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018)
id <- c(rep("A", times = 4), rep("B", times = 4), rep("C", times = 4))
event_date <- c(NA, 2016, NA, 2018, NA, NA, NA, NA, 2015, NA, NA, 2018)
df<- as.data.frame(cbind(id, year, event_date))
df
id year event_date
1 A 2015 <NA>
2 A 2016 2016
3 A 2017 <NA>
4 A 2018 2018
5 B 2015 <NA>
6 B 2016 <NA>
7 B 2017 <NA>
8 B 2018 <NA>
9 C 2015 2015
10 C 2016 <NA>
11 C 2017 <NA>
12 C 2018 2018
Here is what I want the output to look like:
id year event_date years_till_next_event
1 A 2015 <NA> 1
2 A 2016 2016 0
3 A 2017 <NA> 1
4 A 2018 2018 0
5 B 2015 <NA> <NA>
6 B 2016 <NA> <NA>
7 B 2017 <NA> <NA>
8 B 2018 <NA> <NA>
9 C 2015 2015 0
10 C 2016 <NA> 2
11 C 2017 <NA> 1
12 C 2018 2018 0
Person B does not have the event, so it is not calculated. For the others, I want to calculate the difference between the leading event_date (ignoring NAs, if it exists) and the year.
I want to calculate years_till_next_event such that 1) if there is an event_date for a row, event_date - year. 2) If not, then return the first non-NA leading value - year. I'm having difficulty with the 2nd part of the logic, keeping in mind the event could occur not at all or every year, by ID.
Using zoo with dplyr
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(years_till_next_event = na.locf0(event_date, fromLast = TRUE) - year )
Here is a data.table option
setDT(df)[, years_till_next_event := nafill(event_date, type = "nocb") - year, id]
which gives
id year event_date years_till_next_event
1: A 2015 NA 1
2: A 2016 2016 0
3: A 2017 NA 1
4: A 2018 2018 0
5: B 2015 NA NA
6: B 2016 NA NA
7: B 2017 NA NA
8: B 2018 NA NA
9: C 2015 2015 0
10: C 2016 NA 2
11: C 2017 NA 1
12: C 2018 2018 0
You can create a new column to assign a row number within each id if the value is not NA, fill the NA values from the next values and subtract the current row number from it.
library(dplyr)
df %>%
group_by(id) %>%
mutate(years_till_next_event = replace(row_number(),is.na(event_date), NA)) %>%
tidyr::fill(years_till_next_event, .direction = 'up') %>%
mutate(years_till_next_event = years_till_next_event - row_number()) %>%
ungroup
# id year event_date years_till_next_event
# <chr> <dbl> <dbl> <int>
# 1 A 2015 NA 1
# 2 A 2016 2016 0
# 3 A 2017 NA 1
# 4 A 2018 2018 0
# 5 B 2015 NA NA
# 6 B 2016 NA NA
# 7 B 2017 NA NA
# 8 B 2018 NA NA
# 9 C 2015 2015 0
#10 C 2016 NA 2
#11 C 2017 NA 1
#12 C 2018 2018 0
data
df <- data.frame(id, year, event_date)

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

R losing data with distinct()

Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.
Example data frame "a"
SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1
Coding:
a<-distinct(a,SiteID, .keep_all = TRUE)
I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.
(Thank you for updating with more representative sample data!)
a now has 20 rows, with 17 different SiteID values.
Three of those SiteIDs have multiple rows:
library(tidyverse)
a %>%
add_count(SiteID) %>%
filter(n > 1)
## A tibble: 6 x 5
# SiteID PYear Habitat num.1 n
# <chr> <int> <chr> <int> <int>
#1 002401F 2006 F NA 2 # Both have NA for num.1
#2 002401F 2016 F NA 2 # ""
#3 004101W 2007 W NA 2 # Drop
#4 004101W 2007 W 16 2 # Keep this one
#5 006601W 2007 W 2 2 # Keep this one
#6 006601W 2007 W NA 2 # Drop
If we want to prioritize the rows without NA in num.1, we can arrange by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct function will prioritize num.1's with a non-NA value.
(An alternative is also provided in case you want to keep the original sorting in a, but still moving NA values in num.1 to the end. In the is.na(num.1) term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)
a %>%
arrange(SiteID, num.1) %>%
#arrange(SiteID, is.na(num.1)) %>% # Alternative to preserve orig order
distinct(SiteID, .keep_all = TRUE)
SiteID PYear Habitat num.1
1 000901W 2011 W NA
2 001101W 2007 W NA
3 001801W 2005 W NA
4 002001W 2017 W NA
5 002401F 2006 F NA # Kept first appearing row, since both NA num.1
6 004001F 2006 F NA
7 004001W 2006 W NA
8 004101W 2007 W 16 # Kept non-NA row
9 004701F 2017 F NA
10 006201F 2008 F NA
11 006501F 2009 F NA
12 006601W 2007 W 2 # Kept non-NA row
13 006803F 2009 F NA
14 007310F 2018 F NA
15 007602W 2017 W NA
16 008103W 2011 W NA
17 008203F 2007 F 1
Import of sample data
a <- read.table(header = T, stringsAsFactors = F,
text = " SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1")

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources