Avoid padding with NA when using dplyr::left_join - r

I am joining two dataframes using left_join from dplyr. Here is a MWE:
library(dplyr)
dfOne <- data.frame(1:10,
8*(1:10),
c(2,4,6,8,10,12,14,16,18,20) )
colnames(dfOne)<-c("one", "two", "three")
dfTwo <- data.frame(1:6,
8*(1:6),
c(2,4,6,8,10,12) )
colnames(dfTwo)<-c("one", "two", "three")
left_join(dfOne[c("one", "two")], dfTwo[c("two", "three")], by="two")
This gives the following output (as expected)
one two three
1 1 8 2
2 2 16 4
3 3 24 6
4 4 32 8
5 5 40 10
6 6 48 12
7 7 56 NA
8 8 64 NA
9 9 72 NA
10 10 80 NA
Column three is padded with NA at all rows where dfTwo$two doesn't exist in dfTwo$one. However, is it possible to use left_join in a way such that we avoid the NA-values and they are empty (NULL) instead?

I'm not sure I understand your question correctly, but if I am it might be helpful to understand that NA in R is the same as Null in SQL. If you want NA to appear as "" simply name your dataframe in the left join (for example "lj_df") and replace all the NA's. You could replace with "" with 0 or "Null" or anything else you like.
lj_df <- left_join(dfOne[c("one", "two")], dfTwo[c("two", "three")], by="two")
lj_df[is.na(lj_df)] <- ""

Related

Binding rows from list with meaningful duplicates in R [duplicate]

This question already has an answer here:
How to collapse many records into one while removing NA values
(1 answer)
Closed 2 years ago.
Guys I need to merge different data frames from a list by row and maintain some information contained in the duplicate rows. Each row contains daily observation of some variables (stock prices) and each of the data frames contains different time spans (years). From one data frame to the other some variables could change (columns - stocks inside the index). bind_rows from dplyr seems to do a great job at simply adding columns with the new variables and leaving NAs elsewhere.
The point is that some of the data frames contain the last day of the previous period (that is therefore already bind from the previous data frame) but they slightly differ in the variables shown (columns). I don't want to completely eliminate one of the duplicate rows because they both contain information I need and I would rather prefer to merge them. The duplicate rows contain either the same value (because refer to the same day) or one NA and one value (because refer to the different variables in the set). How can I do?
The problem could be synthetized in the following example:
library(dplyr)
df_1 <- data.frame(Date=c(1:4),A=c(20,30,20,30),B=c(15,16,15,16))
df_2 <- data.frame(Date=c(4:7),A=c(30,35,60,40),C=c(15,18,25,20))
dfs<-list(df_1,df_2)
bind_rows(dfs)
Outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 NA
5 4 30 NA 15
6 5 35 NA 18
7 6 60 NA 25
8 7 40 NA 20
Desired outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 15
5 5 35 NA 18
6 6 60 NA 25
7 7 40 NA 20
Instead of binding rows you can do a full join by Date and A column.
library(dplyr)
full_join(df_1, df_2, by = c('Date', 'A'))
#Thanks to #duckmayr for the suggestion.
# A B C
#1 20 15 NA
#2 30 16 NA
#3 20 15 NA
#4 30 16 15
#5 35 NA 18
#6 60 NA 25
#7 40 NA 20
which in base R, can be done as :
merge(df_1, df_2, by = c('Date', 'A'), all = TRUE)
If the data is in a list we can use Reduce
purrr::reduce(dfs, full_join, by = c('Date', 'A'))
Or
Reduce(function(x, y) merge(df_1, df_2, by = c('Date', 'A'), all = TRUE), dfs)

Count unique values in Raster Data in R

I have these Raster Datasets, which look like this
1 2 3 4 5
1 NA NA NA 10 NA
2 7 3 7 10 10
3 NA 3 7 3 3
4 9 9 NA 3 7
5 3 NA 7 NA NA
via
MyRaster1 <- raster("MyRaster_EUNIS1.tif")
head(MyRaster1)
I created that table.
Using unique(MyRaster1) I get 3 7 9 10.
What I need are the counts of these unique values in the raster dataset.
I have tried quite a few ways around, one way works, but is a lot of trouble and I can't get a loop to work for all the raster datasets I have.
Classes1 <- as.factor(unique(values(MyRaster1)))[!is.na(unique(values(MyRaster1)))]
val1 <- unique(MyRaster1)
Tab1 <- matrix(nrow = length(values(MyRaster1)), ncol = length(val))
colnames(Tab1) <- levels(unique(Classes1))
Tab1 <- Tab1[!is.na(Tab1[,1]),]
colSums(Tab1)
It seems to work properly, until I try to delete the NA values. When I use colSums before that, I get NA as result for each column, after I delete the NA values, I get 0.
This is my first time using R, so I'm a real novice. I've researched quite a lot, but since I hardly understand the language at all, this is the furthest I have gotten.
Thank you for your help.
Edit:
table(MyRaster1)
gives me this: Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
The best result would be:
3 7 9 10
6 5 2 3
But I'd also be ok with a different format which I could use in Excel.
Use raster::freq()
Here's an example for the first two rows of your data:
r <- raster(matrix(c(NA,NA,NA,10,NA,7,3,7,10,10), nrow = 2, ncol =5))
> freq(r)
value count
[1,] 3 1
[2,] 7 2
[3,] 10 3
[4,] NA 4
Note that the freq function rounds unless explicitly told not to:
https://www.rdocumentation.org/packages/raster/versions/3.0-7/topics/freq

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

Skip NA values using "FUN=first"

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.
A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]
Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))
Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")
Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Resources