R losing data with distinct() - r

Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.
Example data frame "a"
SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1
Coding:
a<-distinct(a,SiteID, .keep_all = TRUE)
I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.

(Thank you for updating with more representative sample data!)
a now has 20 rows, with 17 different SiteID values.
Three of those SiteIDs have multiple rows:
library(tidyverse)
a %>%
add_count(SiteID) %>%
filter(n > 1)
## A tibble: 6 x 5
# SiteID PYear Habitat num.1 n
# <chr> <int> <chr> <int> <int>
#1 002401F 2006 F NA 2 # Both have NA for num.1
#2 002401F 2016 F NA 2 # ""
#3 004101W 2007 W NA 2 # Drop
#4 004101W 2007 W 16 2 # Keep this one
#5 006601W 2007 W 2 2 # Keep this one
#6 006601W 2007 W NA 2 # Drop
If we want to prioritize the rows without NA in num.1, we can arrange by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct function will prioritize num.1's with a non-NA value.
(An alternative is also provided in case you want to keep the original sorting in a, but still moving NA values in num.1 to the end. In the is.na(num.1) term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)
a %>%
arrange(SiteID, num.1) %>%
#arrange(SiteID, is.na(num.1)) %>% # Alternative to preserve orig order
distinct(SiteID, .keep_all = TRUE)
SiteID PYear Habitat num.1
1 000901W 2011 W NA
2 001101W 2007 W NA
3 001801W 2005 W NA
4 002001W 2017 W NA
5 002401F 2006 F NA # Kept first appearing row, since both NA num.1
6 004001F 2006 F NA
7 004001W 2006 W NA
8 004101W 2007 W 16 # Kept non-NA row
9 004701F 2017 F NA
10 006201F 2008 F NA
11 006501F 2009 F NA
12 006601W 2007 W 2 # Kept non-NA row
13 006803F 2009 F NA
14 007310F 2018 F NA
15 007602W 2017 W NA
16 008103W 2011 W NA
17 008203F 2007 F 1
Import of sample data
a <- read.table(header = T, stringsAsFactors = F,
text = " SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1")

Related

How to compare two or more lines in a long dataset to create a new variable?

I have a long format dataset like that:
ID
year
Address
Classification
1
2020
A
NA
1
2021
A
NA
1
2022
B
B_
2
2020
C
NA
2
2021
D
NA
2
2022
F
F_
3
2020
G
NA
3
2021
G
NA
3
2022
G
G_
4
2020
H
NA
4
2021
I
NA
4
2022
H
H_
I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.
I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:
ID
year
Address
Classification
Aim
1
2020
A
NA
NA
1
2021
A
NA
NA
1
2022
B
B_
B_
2
2020
C
NA
NA
2
2021
D
NA
NA
2
2022
F
F_
F_
3
2020
G
NA
G_
3
2021
G
NA
G_
3
2022
G
G_
G_
4
2020
H
NA
H_
4
2021
I
NA
NA
4
2022
H
H_
H_
I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).
library(dplyr)
library(tidyr)
df %>%
group_by(ID, Address) %>%
tidyr::fill(Classification, .direction = "up")
Output:
# ID year Address Classification
# <int> <int> <chr> <chr>
# 1 1 2020 A NA
# 2 1 2021 A NA
# 3 1 2022 B B_
# 4 2 2020 C NA
# 5 2 2021 D NA
# 6 2 2022 F F_
# 7 3 2020 G G_
# 8 3 2021 G G_
# 9 3 2022 G G_
#10 4 2020 H H_
#11 4 2021 I NA
#12 4 2022 H H_
Data
df <- read.table(text = "ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_", header = TRUE)

R Modify dataframes to the same length

I've got a list containing multiple dataframes with two columns (Year and area).
The problem is that some dataframes only contain information from 2002-2015 or 2003-2017 and other from 2001-2018 and so one. So they differ in length.
list:
list(structure(list(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4), class ="data.frame"),
structure(list(Year= c(2001,2004,2018), Area=c(1,2,4), class ="data.frame",
(list(Year= c(2008,2009,2014,2015,2016), Area=c(1,2,3,4,5), class ="data.frame"))
How can I modify them all to the same length (from 2001-2018) by adding NA or better 0 for area if there is no area information for that year.
Let
A = data.frame(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4))
B = data.frame(Year= c(2001,2004,2018), Area=c(1,2,4))
C = list(A, B)
Then we have
Ref = data.frame(Year = 2001:2018)
New.List = lapply(C, function(x) dplyr::left_join(Ref, x))
with the desired result
[[1]]
Year Area
1 2001 1
2 2002 2
3 2003 NA
4 2004 3
5 2005 4
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 NA
[[2]]
Year Area
1 2001 1
2 2002 NA
3 2003 NA
4 2004 2
5 2005 NA
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 4
To make sure that all data.frames in the list share the same spelling of Year, do
lapply(C, function(x) {colnames(x)[1] = "Year"; x})
provided the first column is always the Year-column.

Create a variable based on values of two years from another variable in R

It looks simple, but I couldn't find the answer on-line. I have panel data with city characteristics over the years 1995-2015. For some variables, I just have data for the years 2000 and 2010. Therefore, I want to create new variables where I impute the missing data of the years 1995-2004 with the 2000 values and the years 2005-2015 with 2010 values.
My data set looks like this example:
cities idhm year
1 B NA 1995
2 C NA 1996
3 D NA 1997
4 E NA 1998
5 F NA 1999
6 G 24599 2000
7 H NA 2001
8 I NA 2002
9 J NA 2003
10 K NA 2004
11 L NA 2005
12 M NA 2006
13 N NA 2007
14 O NA 2008
15 P NA 2009
16 Q 5598 2010
17 R NA 2011
18 S NA 2012
19 T NA 2013
20 U NA 2014
21 V NA 2015
I want to have a data set like this one:
cities idhm year newvar
1 B NA 1995 24599
2 C NA 1996 24599
3 D NA 1997 24599
4 E NA 1998 24599
5 F NA 1999 24599
6 G 24599 2000 24599
7 H NA 2001 24599
8 I NA 2002 24599
9 J NA 2003 24599
10 K NA 2004 24599
11 L NA 2005 5598
12 M NA 2006 5598
13 N NA 2007 5598
14 O NA 2008 5598
15 P NA 2009 5598
16 Q 5598 2010 5598
17 R NA 2011 5598
18 S NA 2012 5598
19 T NA 2013 5598
20 U NA 2014 5598
21 V NA 2015 5598
Any help is welcomed.
I suspect your data may be larger than this example, so a more general case would be to use a rolling join. I find that easiest with data.table.
First, make a dictionary of complete data to join on.
library(data.table)
setDT(data1)
dictionary <- data1[!is.na(idhm),.(year,idhm)]
dictionary
# year idhm
#1: 2000 24599
#2: 2010 5598
Then perform the join on = "year" and roll = "nearest".
result <- dictionary[data1,on = "year",roll="nearest"]
result[,.(cities,year,idhm)]
# cities year idhm
# 1: B 1995 24599
# 2: C 1996 24599
# 3: D 1997 24599
# 4: E 1998 24599
# 5: F 1999 24599
# 6: G 2000 24599
# 7: H 2001 24599
# 8: I 2002 24599
# 9: J 2003 24599
#10: K 2004 24599
#11: L 2005 24599
#12: M 2006 5598
#13: N 2007 5598
#14: O 2008 5598
#15: P 2009 5598
#16: Q 2010 5598
#17: R 2011 5598
#18: S 2012 5598
#19: T 2013 5598
#20: U 2014 5598
#21: V 2015 5598
# cities year idhm
Data
data1 <- structure(list(cities = structure(1:21, .Label = c("B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V"), class = "factor"), idhm = c(NA,
NA, NA, NA, NA, 24599L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5598L,
NA, NA, NA, NA, NA), year = 1995:2015), class = "data.frame", row.names = c(NA,
-21L))
We can do :
df$new_var <- NA
df$new_var[df$year >= 1995 & df$year <= 2004] <- df$idhm[df$year == 2000]
df$new_var[df$year >= 2005 & df$year <= 2015] <- df$idhm[df$year == 2010]
Or using dplyr :
library(dplyr)
df %>%
mutate(new_var = case_when(between(year, 1995, 2004) ~idhm[year == 2000],
between(year, 2005, 2015) ~idhm[year == 2010]))
# cities idhm year new_var
#1 B NA 1995 24599
#2 C NA 1996 24599
#3 D NA 1997 24599
#4 E NA 1998 24599
#5 F NA 1999 24599
#6 G 24599 2000 24599
#7 H NA 2001 24599
#8 I NA 2002 24599
#9 J NA 2003 24599
#10 K NA 2004 24599
#11 L NA 2005 5598
#12 M NA 2006 5598
#13 N NA 2007 5598
#14 O NA 2008 5598
#15 P NA 2009 5598
#16 Q 5598 2010 5598
#17 R NA 2011 5598
#18 S NA 2012 5598
#19 T NA 2013 5598
#20 U NA 2014 5598
#21 V NA 2015 5598

How to make all elements of all columns align by creating empty spaces so that it stays in the same pattern

I have a temporal dataset, however, it is incomplete so I can not reconstruct the series accurately. These are the data:
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
I need it to end up like this:
df2<-data.frame(year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,
2015,2016,2017,2018),
sample1=c(NA,NA,"D","D","DDD","D","U","UU","UUU","U","D","DDD",NA,NA,NA),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U",NA,NA,NA,NA,NA),
sample3=c(NA,NA,NA,"D","DDD","D","U","UU","UUU","U","D","DDD","D",NA,NA),
sample4=c(NA,NA,"D","D",NA,NA,NA,NA,"UUU","U","D","DDD","D","U","U"),
sample5=c(NA,"UU","D",NA,NA,NA,"U","UU","UUU","U",NA,NA,"D","U",NA))
I need all the columns aligned in the same pattern, the best result was using DNA alignment functions, but these times to find the best alignment invert the elements, in my case can not occur this.
I have no idea how to do this.
dplyr's add_row function makes this pretty easy, once the initial dataframe exists.
library(dplyr)
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1 = c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2 = c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3 = c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4 = c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5 = c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
df2 <- df %>%
add_row(year = 2016:2018)
library(dplyr)
df <- data_frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA)) %>%
add_row(year = c(2004, 2005), .before = 1) %>%
add_row(year = c(2016:2018))
Result:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources