Add columns in vector but not in df - r

I am trying to do the following and was wondering if there is an easier way to use dplyr to achieve this (I'm sure there is):
I want to compare the columns of a dataframe to a vector of names, and if the df does not contain a column corresponding to one of the names in the name vector, add that column to the df and populate its values with NAs.
E.g., in the MWE below:
df <- data.frame(cbind(c(1:6),c(11:16),c(10:15)))
colnames(df) <- c("A","B","C")
names <- c("A","B","C","D","E")
how do I use dplyr to create the two columns D and E (which are in names, but not in df) and populate it with NAs?

No need in dplyr, it's just a basic operation in base R. (Btw, try avoiding overriding built in functions such as names in the future. The reason names still works is because R looks in the base package NAMESPACE file instead in the global environment, but this is still a bad practice.)
df[setdiff(names, names(df))] <- NA
df
# A B C D E
# 1 1 11 10 NA NA
# 2 2 12 11 NA NA
# 3 3 13 12 NA NA
# 4 4 14 13 NA NA
# 5 5 15 14 NA NA
# 6 6 16 15 NA NA

Related

copy values from different columns based on conditions (r code)

I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!
Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).
The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1

R - counting with NA in dataframe [duplicate]

This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 4 years ago.
lets say that I have this dataframe in R
df <- read.table(text="
id a b c
1 42 3 2 NA
2 42 NA 6 NA
3 42 1 NA 7", header=TRUE)
I´d like to calculate all columns to one, so result should look like this.
id a b c d
1 42 3 2 NA 5
2 42 NA 6 NA 6
3 42 1 NA 7 8
My code below doesn´t work since there is that NA values. Please note that I have to choose columns that I want to count since in my real dataframe I have some columns that I don´t want count together.
df %>%
mutate(d = a + b + c)
You can use rowSums for this which has an na.rm parameter to drop NA values.
df %>% mutate(d=rowSums(tibble(a,b,c), na.rm=TRUE))
or without dplyr using just base R.
df$d <- rowSums(subset(df, select=c(a,b,c)), na.rm=TRUE)

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

Convert entire data frame into one long column (vector)

I want to turn the entire content of a numeric (incl. NA's) data frame into one column. What would be the smartest way of achieving the following?
>df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
>df
C1 C2 C3
1 1 4 NA
2 NA 5 8
3 3 NA 9
>x <- mysterious_operation(df)
>x
[1] 1 NA 3 4 5 NA NA 8 9
I want to calculate the mean of this vector, so ideally I'd want to remove the NA's within the mysterious_operation - the data frame I'm working on is very large so it will probably be a good idea.
Here's a couple ways with purrr:
# using invoke, a wrapper around do.call
purrr::invoke(c, df, use.names = FALSE)
# similar to unlist, reduce list of lists to a single vector
purrr::flatten_dbl(df)
Both return:
[1] 1 NA 3 4 5 NA NA 8 9
The mysterious operation you are looking for is called unlist:
> df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
> unlist(df, use.names = F)
[1] 1 NA 3 4 5 NA NA 8 9
We can use unlist and create a single column data.frame
df1 <- data.frame(col =unlist(df))
Just for fun. Of course unlist is the most appropriate function.
alternative
stack(df)[,1]
alternative
do.call(c,df)
do.call(c,c(df,use.names=F)) #unnamed version
Maybe they are more mysterious.

Avoid padding with NA when using dplyr::left_join

I am joining two dataframes using left_join from dplyr. Here is a MWE:
library(dplyr)
dfOne <- data.frame(1:10,
8*(1:10),
c(2,4,6,8,10,12,14,16,18,20) )
colnames(dfOne)<-c("one", "two", "three")
dfTwo <- data.frame(1:6,
8*(1:6),
c(2,4,6,8,10,12) )
colnames(dfTwo)<-c("one", "two", "three")
left_join(dfOne[c("one", "two")], dfTwo[c("two", "three")], by="two")
This gives the following output (as expected)
one two three
1 1 8 2
2 2 16 4
3 3 24 6
4 4 32 8
5 5 40 10
6 6 48 12
7 7 56 NA
8 8 64 NA
9 9 72 NA
10 10 80 NA
Column three is padded with NA at all rows where dfTwo$two doesn't exist in dfTwo$one. However, is it possible to use left_join in a way such that we avoid the NA-values and they are empty (NULL) instead?
I'm not sure I understand your question correctly, but if I am it might be helpful to understand that NA in R is the same as Null in SQL. If you want NA to appear as "" simply name your dataframe in the left join (for example "lj_df") and replace all the NA's. You could replace with "" with 0 or "Null" or anything else you like.
lj_df <- left_join(dfOne[c("one", "two")], dfTwo[c("two", "three")], by="two")
lj_df[is.na(lj_df)] <- ""

Resources