Appending Dataset in R - r

I have 2 datasets:
Data1:
Var1 Var2 Var3 Var4
10 10 2 3
9 2 8 3
6 4 4 8
7 3 10 8
Data2:
Var1 Var5 Var3 Var6
3 6 6 4
1 2 5 1
9 2 2 9
2 6 3 2
Now I want to append this 2 datasets
Final Data:
Var1 Var2 Var3 Var4 Var5 Var6
10 10 2 3
9 2 8 3
6 4 4 8
7 3 10 8
3 4 6 6
1 1 2 5
9 9 2 2
2 2 6 3
I can't use rbind to create this dataset. Can anybody please tell me the method to create this dataset? Also, suppose I want to append multiple (more than 2) datasets. What's the procedure?

I recommend the function rbind.fill of the plyr package:
library(plyr)
rbind.fill(Data1, Data2)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 10 10 2 3 NA NA
#2 9 2 8 3 NA NA
#3 6 4 4 8 NA NA
#4 7 3 10 8 NA NA
#5 3 NA 6 NA 6 4
#6 1 NA 5 NA 2 1
#7 9 NA 2 NA 2 9
#8 2 NA 3 NA 6 2
The major advantage of this technique is that it's not limited to two data frames, but allows combining any number of data frames.
If the data still needs to be read from disk, you can do something like:
file_list = list.files()
data_list = lapply(file_list, read.table)
data_combined = do.call("rbind.fill", data_list)

merge(Data1, Data2, all=TRUE, sort=FALSE)
Var1 Var3 Var2 Var4 Var5 Var6
1 10 2 10 3 NA NA
2 9 8 2 3 NA NA
3 6 4 4 8 NA NA
4 7 10 3 8 NA NA
5 3 6 NA NA 6 4
6 1 5 NA NA 2 1
7 9 2 NA NA 2 9
8 2 3 NA NA 6 2
EDIT: A way to combine multiple frames
As detailed here.
Combining more than 2 frames
Data3
Var1 Var3 Var5 Var6
1 2 6 4 1
2 10 1 6 1
3 1 6 3 1
4 9 5 5 7
We'll need to put your data into a list and use a nice package called reshape.
datalist <- list(Data1, Data2, Data3)
library(reshape)
merge_recurse(datalist)
Var1 Var3 Var2 Var4 Var5 Var6
1 10 2 10 3 NA NA
2 9 8 2 3 NA NA
3 6 4 4 8 NA NA
4 7 10 3 8 NA NA
5 3 6 NA NA 6 4
6 1 5 NA NA 2 1
7 9 2 NA NA 2 9
8 2 3 NA NA 6 2
9 2 6 NA NA 4 1
10 10 1 NA NA 6 1
11 1 6 NA NA 3 1
12 9 5 NA NA 5 7

# Open a new directory and keep only the data files to be combined
combinedfiles <- function(){
# nullVar: Creating a Null Variable using as.null function
nullVar <- function(x){
x <- getwd();
x <- as.null(x);
}
# readTab: Read file using read.table function
readTab <- function(y) {
read.table(y, header=TRUE, sep = " ")
}
objectcontent <- nullVar(x);
for (i in 1:length(list.files(getwd()))) {
y <- list.files(getwd())[i];
objectcontent <- rbind(objectcontent, readTab(y));
i <- i + 1
}
return(objectcontent)
}
#Then type the following in the console
combinedfiles()
a version using apply loops (which do not suffer from the rbind slowdown):
combined_files = function(file_path, extension = "csv") {
require(plyr)
file_list = list.files(file_path, pattern = extension)
data_list = lapply(file_list, read.table, header = TRUE, sep = " ")
combined_data = do.call("rbind.fill", data_list)
return(combined_data)
}

Try this:
data1 <- as.data.frame(read.table("data1", header=TRUE, sep=" "))
data2 <- as.data.frame(read.table("data2", header=TRUE, sep=" "))
merge(data1, data2, all=TRUE, all.x=TRUE, all.Y=TRUE)

Related

conditionally adding columns to a list of dataframes

I have a list of dataframes with either 2 or 4 columns.
a <- data.frame(a=1:10,
b=1:10,
c=1:10,
d=1:10)
b <- data.frame(a=1:10,
b=1:10)
list_of_df <- list(a,b)
I want to add 2 empty columns to each dataframe with only 2 columns.
I've tried this lapply approach:
lapply(list_of_df, function(x) ifelse(ncol(x) < 4,x%>%add_column(empty=NA),x <- x))
Which does not work unfortunately. How can I fix this?
I came up with something similar:
add_col <- function(x){
col_to_add <- 4 - ncol(x)
if(col_to_add == 0) return(x)
z <- rep(NA, nrow(x))
for (i in 1:col_to_add){
x <- cbind(x, z)
}
x
}
lapply(list_of_df, add_col)
I would use a for loop to avoid copying the whole list:
for (i in seq_along(list_of_df)) {
n_columns = ncol(list_of_df[[i]])
if (n_columns == 2L) {
list_of_df[[i]][c('empty1', 'empty2')] <- NA
}
}
Result:
> list_of_df
[[1]]
a b c d
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
a b empty1 empty2
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA
We could use bind_rows and then group_split and map from purrr to remove the id_Group column:
library(dplyr)
library(purrr)
bind_rows(list_of_df) %>%
group_split(id_Group =cumsum(a==1)) %>%
map(., ~ (.x %>% ungroup() %>%
select(-id_Group)))
[[1]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA

How can I make some row values NA if other is NA in R?

I have a dataframe with three columns Time, observed value (Obs.Value), and an interpolated value (Interp.Value). If the value of Obs.Value is NA then the value of Interp.Value should also be NA. I can make the whole row NA but I need to keep the Time value.
Here is the repex:
dat <- data.frame(matrix(ncol = 3, nrow = 10))
x <- c("Time", "Obs.Value", "Interp.Value")
colnames(dat) <- x
dat$Time <- seq(1,10,1)
dat$Obs.Value <- c(5,6,7,NA,NA,5,4,3,NA,2)
interp <- approx(dat$Time,dat$Obs.Value,dat$Time)
dat$Interp.Value <- round(interp$y,1)
Here is the code that makes the whole row NA
dat[with(dat, is.na(Obs.Value)|is.na("Interp.Value")),] <- NA
Here is what the output should look like:
Time Obs.Value Interp.Value
1 1 5 5
2 2 6 6
3 3 7 7
4 4 NA NA
5 5 NA NA
6 6 5 5
7 7 4 4
8 8 3 3
9 9 NA NA
10 10 2 2
dat$Interp.Value[is.na(dat$Obs.Value)] <- NA
dat
# Time Obs.Value Interp.Value
# 1 1 5 5
# 2 2 6 6
# 3 3 7 7
# 4 4 NA NA
# 5 5 NA NA
# 6 6 5 5
# 7 7 4 4
# 8 8 3 3
# 9 9 NA NA
# 10 10 2 2
Or if either column being NA is sufficient, then
dat[!complete.cases(dat[,-1]),-1] <- NA
If there is only one column to change #r2evans' answer is pretty straightforward and way to go. If there are more than one column that you want to change you can use across in dplyr.
library(dplyr)
dat %>%
mutate(across(-c(Time,Obs.Value), ~replace(., is.na(Obs.Value), NA)))
# Time Obs.Value Interp.Value
#1 1 5 5
#2 2 6 6
#3 3 7 7
#4 4 NA NA
#5 5 NA NA
#6 6 5 5
#7 7 4 4
#8 8 3 3
#9 9 NA NA
#10 10 2 2

trying to calculate sum of row with dataframe having NA values

I am trying to sum the row of values if any column have values but not working for me like below
df=data.frame(
x3=c(2,NA,3,5,4,6,NA,NA,3,3),
x4=c(0,NA,NA,6,5,6,NA,0,4,2))
df$summ <- ifelse(is.na(c(df[,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
library(data.table)
setDT(df)[!( is.na(x3) & is.na(x4)),summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse(is.na(x3) & is.na(x4), NA,
rowSums(df, na.rm = TRUE)))
df
# x3 x4 summ
#1 2 0 2
#2 NA NA NA
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#7 NA NA NA
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums(is.na(df[cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums(is.na(df)) == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
library(dplyr)
df %>%
rowwise() %>%
mutate(Count = ifelse(all(is.na(cur_data())), NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5

Impute missing variables but not at the beginning and the end?

Consider the following working example:
library(data.table)
library(imputeTS)
DT <- data.table(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 NA 4 6
7: 7 NA NA NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
I want to impute the missing values at different points within the time series using the na_interpolation from the imputeTS package. However, I do not want to impute missing values at the beginning or the end of the series which can be of various length (In my application replacing those values would not make sense).
When I run the following code to impute the series, however all the NAs get replaced:
DT[,(cols_to_impute_example) := lapply(.SD, na_interpolation), .SDcols = cols_to_impute_example]
> DT
time var1 var2 var3
1: 1 1 1 1
2: 2 2 1 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 6
8: 8 8 6 6
9: 9 9 7 6
10: 10 10 8 6
What I want to achieve is:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
a dplyr implementation:
we select the middle part of the df where we do the NA interpolation and then we bind it back.
library(imputeTS)
library(dplyr)
DT <- data_frame(
time = c(1:10),
var1 = c(1:5, NA, NA, 8:10),
var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
var3 = c(1:6, rep(NA, 4))
)
na_inter_middle<-function(row_start, row_end){
# extracts the first part of the df where no NA need to be replaced
DT[1:row_start,]->start
# middle part, interpolating NA values
DT[(row_start + 1):(nrow(DT) - row_end),]->middle
#end part
DT[(nrow(DT) - (row_end - 1) ):nrow(DT),]->end
start %>%
bind_rows(
middle %>%
mutate_all(na.interpolation)
) %>%
bind_rows(end)
}
na_inter_middle(2,3)
# A tibble: 10 x 4
time var1 var2 var3
<int> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 2 NA 2
3 3 3 1 3
4 4 4 2 4
5 5 5 3 5
6 6 5 4 6
7 7 5 4 6
8 8 8 6 NA
9 9 9 7 NA
10 10 10 8 NA
Maybe not so well known, you can also use additional parameters from approx in the na.interpolation function of imputeTS.
This one could be solved with:
library(imputeTS)
DT[,(2:4) := lapply(.SD, na_interpolation, yleft = NA , yright = NA), .SDcols = 2:4]
Here with yleft and yright you specify what to do with the trailing / leading NAs.
Which leads to the desired output:
time var1 var2 var3
1: 1 1 NA 1
2: 2 2 NA 2
3: 3 3 1 3
4: 4 4 2 4
5: 5 5 3 5
6: 6 6 4 6
7: 7 7 5 NA
8: 8 8 6 NA
9: 9 9 7 NA
10: 10 10 8 NA
Basically nearly all parameters that you find on the approx function description can also be given to the na.interpolation function as additional parameters for finetuning.
Library zoo offers a function for interpolation that allows more customization:
library(zoo)
DT[,(2:4) := lapply(.SD, na.approx, x = time, na.rm = FALSE), .SDcols = 2:4]

Insert NA-rows in data frame according to rownames of other data frame

I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3

Resources