R Modify dataframes to the same length - r

I've got a list containing multiple dataframes with two columns (Year and area).
The problem is that some dataframes only contain information from 2002-2015 or 2003-2017 and other from 2001-2018 and so one. So they differ in length.
list:
list(structure(list(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4), class ="data.frame"),
structure(list(Year= c(2001,2004,2018), Area=c(1,2,4), class ="data.frame",
(list(Year= c(2008,2009,2014,2015,2016), Area=c(1,2,3,4,5), class ="data.frame"))
How can I modify them all to the same length (from 2001-2018) by adding NA or better 0 for area if there is no area information for that year.

Let
A = data.frame(Year= c(2001,2002,2004,2005), Area=c(1,2,3,4))
B = data.frame(Year= c(2001,2004,2018), Area=c(1,2,4))
C = list(A, B)
Then we have
Ref = data.frame(Year = 2001:2018)
New.List = lapply(C, function(x) dplyr::left_join(Ref, x))
with the desired result
[[1]]
Year Area
1 2001 1
2 2002 2
3 2003 NA
4 2004 3
5 2005 4
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 NA
[[2]]
Year Area
1 2001 1
2 2002 NA
3 2003 NA
4 2004 2
5 2005 NA
6 2006 NA
7 2007 NA
8 2008 NA
9 2009 NA
10 2010 NA
11 2011 NA
12 2012 NA
13 2013 NA
14 2014 NA
15 2015 NA
16 2016 NA
17 2017 NA
18 2018 4
To make sure that all data.frames in the list share the same spelling of Year, do
lapply(C, function(x) {colnames(x)[1] = "Year"; x})
provided the first column is always the Year-column.

Related

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

Fill in blanks from the previous cell multiplied by the current cell in a different column in R

I have the below data:
year<-c(2015:2030)
actual<-c(NA,NA,NA,3170.620936,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
delta<-c(0.276674282,
0.23515258,
0.133083622,
0.236098022,
0.399974342,
0.385942573,
0.165095681,
0.163945346,
0.155695778,
0.147270755,
0.146505261,
0.133997582,
0.123100693,
0.119131947,
0.115589755,
0.103675414)
df<-cbind.data.frame(year,actual,delta)
df
year actual delta
1 2015 NA 0.2766743
2 2016 NA 0.2351526
3 2017 NA 0.1330836
4 2018 3170.621 0.2360980
5 2019 NA 0.3999743
6 2020 NA 0.3859426
7 2021 NA 0.1650957
8 2022 NA 0.1639453
9 2023 NA 0.1556958
10 2024 NA 0.1472708
11 2025 NA 0.1465053
12 2026 NA 0.1339976
13 2027 NA 0.1231007
14 2028 NA 0.1191319
15 2029 NA 0.1155898
16 2030 NA 0.1036754
What I am trying to do is to replace NA's after the last valid data point multiplied by the current delta. So, in this case, I want to multiply "actual" in 2016 by "delta" in 2017 and fill in the 2017 value for "actual". I have tried the below code with no success:
df$actual_filled<-df$actual
df
library(dplyr)
df<-df%>%
mutate( actual_filled=lag(actual_filled,1)*delta)
df
year actual delta actual_filled
1 2015 NA 0.2766743 NA
2 2016 NA 0.2351526 NA
3 2017 NA 0.1330836 NA
4 2018 3170.621 0.2360980 NA
5 2019 NA 0.3999743 1268.167
6 2020 NA 0.3859426 NA
7 2021 NA 0.1650957 NA
8 2022 NA 0.1639453 NA
9 2023 NA 0.1556958 NA
10 2024 NA 0.1472708 NA
11 2025 NA 0.1465053 NA
12 2026 NA 0.1339976 NA
13 2027 NA 0.1231007 NA
14 2028 NA 0.1191319 NA
15 2029 NA 0.1155898 NA
16 2030 NA 0.1036754 NA
As you can see, the filling process ends in 2019. I thought it would populate the new data till the end of the series. The code I wrote is acting as if I am using the "actual" data, rather than "actual_filled". Could someone tell me what I am doing wrong and how I can fix this?
Here's a solution that works via a loop:
df$actual_filled<-df$actual
for (row in 2:nrow(df)) {
if(!is.na(df$actual_filled[row-1])) {
df$actual_filled[row] <- df$delta[row] * df$actual_filled[row-1]
}
}
I'm new to R so it may not be the best solution!

R losing data with distinct()

Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.
Example data frame "a"
SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1
Coding:
a<-distinct(a,SiteID, .keep_all = TRUE)
I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.
(Thank you for updating with more representative sample data!)
a now has 20 rows, with 17 different SiteID values.
Three of those SiteIDs have multiple rows:
library(tidyverse)
a %>%
add_count(SiteID) %>%
filter(n > 1)
## A tibble: 6 x 5
# SiteID PYear Habitat num.1 n
# <chr> <int> <chr> <int> <int>
#1 002401F 2006 F NA 2 # Both have NA for num.1
#2 002401F 2016 F NA 2 # ""
#3 004101W 2007 W NA 2 # Drop
#4 004101W 2007 W 16 2 # Keep this one
#5 006601W 2007 W 2 2 # Keep this one
#6 006601W 2007 W NA 2 # Drop
If we want to prioritize the rows without NA in num.1, we can arrange by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct function will prioritize num.1's with a non-NA value.
(An alternative is also provided in case you want to keep the original sorting in a, but still moving NA values in num.1 to the end. In the is.na(num.1) term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)
a %>%
arrange(SiteID, num.1) %>%
#arrange(SiteID, is.na(num.1)) %>% # Alternative to preserve orig order
distinct(SiteID, .keep_all = TRUE)
SiteID PYear Habitat num.1
1 000901W 2011 W NA
2 001101W 2007 W NA
3 001801W 2005 W NA
4 002001W 2017 W NA
5 002401F 2006 F NA # Kept first appearing row, since both NA num.1
6 004001F 2006 F NA
7 004001W 2006 W NA
8 004101W 2007 W 16 # Kept non-NA row
9 004701F 2017 F NA
10 006201F 2008 F NA
11 006501F 2009 F NA
12 006601W 2007 W 2 # Kept non-NA row
13 006803F 2009 F NA
14 007310F 2018 F NA
15 007602W 2017 W NA
16 008103W 2011 W NA
17 008203F 2007 F 1
Import of sample data
a <- read.table(header = T, stringsAsFactors = F,
text = " SiteID PYear Habitat num.1
000901W 2011 W NA
001101W 2007 W NA
001801W 2005 W NA
002001W 2017 W NA
002401F 2006 F NA
002401F 2016 F NA
004001F 2006 F NA
004001W 2006 W NA
004101W 2007 W NA
004101W 2007 W 16
004701F 2017 F NA
006201F 2008 F NA
006501F 2009 F NA
006601W 2007 W 2
006601W 2007 W NA
006803F 2009 F NA
007310F 2018 F NA
007602W 2017 W NA
008103W 2011 W NA
008203F 2007 F 1")

how to get previous matching value

how to get value of "a" if value of b matches with most recent previous value e.g - row$3 of b matches with previous row$1 ,row$6 matches with row$4
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30.133,29,30.1223,33,17,33,11,17,14,13.913,14))
year a b *NEW*
2013 10 30.133 NA
2013 11 29 NA
2014 NA 30.1223 10
2014 13 33 NA
2014 22 17 NA
2015 NA 33 13
2015 19 11 NA
2015 NA 17 22
2016 10 14 NA
2016 15 13.913 10
2016 NA 14 15
Thanks
For OPs example case
One way could be is to use duplicated() function.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,33,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 33 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 13
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 NA
# 10 2016 15 14 10
# 11 2016 NA 14 15
General purpose usage
The above solution fails when the duplicates are not in sequential order. As per #DavidArenburg's advice. I have changed the fourth element df$b[4] <- 14. The general solution would require the usage of another handy function order() and should work for different possible cases.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,14,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# sort the matching column
df <- df[order(df$b),]
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# To original order
df <- df[order(as.integer(rownames(df))),]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 14 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 NA
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 13
# 10 2016 15 14 10
# 11 2016 NA 14 15
Here, the solution is based on the base package' functions. I am sure there should other ways of doing this using other packages.

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources