R dataframe: how to "populate" missing data in df1 using df2 - r

I am trying to populate the missing values of df1 with df2.
Whenever there is a valid value for the same cell in both df, I need to keep the value as in df1.
If there is a column in df2 that is not present in df1, this new column (z) has to be added to df1.
This would be a simple example:
id <- c (1, 2, 3, 4, 5)
x <- c (10, NA, 20, 50, 70)
y <- c (3, 5, NA, 6, 9)
df1 <- data.frame(id, x, y)
id <- c ( 2, 3, 5)
x <- c (10, NA, NA)
z <- c (NA, 6, 7)
df2 <- data.frame(id, x, z)
I would like to obtain "df3":
id x y z
1 1 10 3 NA
2 2 10 5 NA
3 3 20 6 6
4 4 50 6 NA
5 5 70 9 7
I tried several "merge" options that didn't work.

A 'merge' option after several extract and replace steps could be
idx <- is.na(df1[df2$id,])
df1[df2$id,][idx] <- df2[idx]
out <- merge(df1, df2[, c("id", "z")], by = "id", all.x = TRUE)
Result
out
# id x y z
#1 1 10 3 NA
#2 2 10 5 NA
#3 3 20 6 6
#4 4 50 6 NA
#5 5 70 9 7

Related

Replacing values in data frame column based on another column

I have a data frame in R :
a b c d e
1 2 3 23 1
4 5 6 -Inf 2
7 8 9 2 8
10 11 12 -Inf NaN
and I'd like to replace all the values in column e with NA if the corresponding value in column d is -Inf
like this:
a b c d e
1 2 3 23 1
4 5 6 -Inf NA
7 8 9 2 8
10 11 12 -Inf NA
Any help is appreciated. I haven't been able to do it without loops, and its taking a long time for the full data frame.
ifelse is vectorize. We can use ifelse without using a loop.
dat$e <- ifelse(dat$d == -Inf, NA, dat$e)
DATA
dat <- read.table(text = "a b c d e
1 2 3 23 1
4 5 6 -Inf 2
7 8 9 2 8
10 11 12 -Inf NaN", header = TRUE)
Using data.table
library(data.table)
setDT(dat)[is.infinite(d), e := NA]
A solution with dplyr:
library(tidyverse)
df <- tribble(
~a, ~b, ~c, ~d, ~e,
1, 2, 3, 23, 1,
4, 5, 6, -Inf, 2,
7, 8, 9, 2, 8,
10, 11, 12, -Inf, NaN)
df1 <- df %>%
dplyr::mutate(e = case_when(d == -Inf ~ NA_real_,
TRUE ~ e)
)

Finding only unique value in each column in a d

I have the below data frame df1. (Edited to have different numbers of repeated value in the data frame.)
> dput(df1)
structure(list(...1 = c("a", "b", "c", "d", "e"), x = c(5, 10,
20, 20, 25), y = c(2, 6, 6, 6, 10), z = c(6, 2, 1, 8, 1)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
>df1
x y z
a 5 2 6
b 10 6 2
c 20 6 1
d 20 6 8
e 25 10 1
I would like to get a df2 which only has the unique values from each column 'x','y' and 'z'.
I tried:
df2<-apply(df1,2, unique)
df2 <- do.call(cbind, df2)
df2 <- as.data.frame(df2)
Desired output:
>df2
x y z
5 2 6
10 6 2
20 10 1
25 8
Tibbles can't have rownames so it creates a new column with it in your data. You can delete the first column and then use unique on all columns.
library(dplyr)
df1$...1 <- NULL
df1 %>% summarise(across(.fns = unique))
# x y z
# <dbl> <dbl> <dbl>
#1 5 2 6
#2 10 6 2
#3 20 8 1
#4 25 10 8
Or in base R :
df2 <- data.frame(sapply(df1, unique))
For unequal unique values in the column you could use :
tmp <- lapply(df1, unique)
data.frame(sapply(tmp, `[`, 1:max(lengths(tmp))))
# x y z
#1 5 2 6
#2 10 6 2
#3 20 10 1
#4 25 NA 8

R: Merge two data frames based on value in column and return all values of both data frames

Let's say I have the following dfs
df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4
Now I want to merge both dfs conditional of column "a" to give me the following df
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4
In my dataset i tried using
merge <- merge(x = df1, y = df2, by = "a", all = TRUE)
However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.
I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:
DATA:
df1 <- data.frame(a = c(1,4,9),
b = c(2,3,7),
c = c(3,3,3),
d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
b = c(2,2,2),
c = c(3,3,3),
d = c(4,4,4))
SOLUTION:
Row-bind df1and df2:
df3 <- rbind(df1, df2)
Remove the duplicate rows:
df3 <- df3[!duplicated(df3), ]
RESULT:
df3
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4
With tidyverse, we can do bind_rows and distinct
library(dplyr)
bind_rows(df1, df2) %>%
distinct
data
df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
it is possible so
dplyr::union(df1, df2)
here is another base R solution using rbind + %in%
dfout <- rbind(df1,subset(df2,!a %in% df1$a))
such that
> rbind(df1,subset(df2,!a %in% df1$a))
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
21 2 2 3 4
31 3 2 3 4

How to populate a column using multiple conditionals across 2 dataframes?

Im trying to populate a column with values based on two conditionals across two separate dataframes. So,
df1$day == df2$day & df1$hour == df2$hour then fill df1$X with df2$depth
I struggle because I am not asking it to populate it with a generic value (i.e. if x==y, then y2=1). I am trying to get it select values across multiple rows. A mock example:
df1 df2
day hour X day hour depth
1 10 NA 1 10 50
1 11 NA 1 11 10
2 5 NA 1 3 100
5 9 NA 5 9 50
6 20 NA 7 17 80
7 17 NA 10 4 65
Any help would be greatly appreciated.
An easier option is join from data.table
library(data.table)
setDT(df1)[df2, X := depth, on = .(day, hour)]
df1
# day hour X
#1: 1 10 50
#2: 1 11 10
#3: 2 5 NA
#4: 5 9 50
#5: 6 20 NA
#6: 7 17 80
In base R, we can use match
df1$X <- with(df1, df2$depth[match(paste(day, hour), paste(df2$day, df2$hour))])
data
df1<- data.frame(day = c(1, 1, 2, 5:7), hour = c(10:11, 5, 9, 20, 17),
X = NA_integer_)
df2 <- data.frame(day = c(1, 1, 1, 5, 7, 10), hour = c(10, 11, 3, 9,
17, 4), depth = c(50, 10, 100, 50, 80, 65))
Using dplyr, we can do a left_join and then rename the depth column as X
library(dplyr)
left_join(df1, df2, by = c("day", "hour")) %>%
select(-X) %>%
rename(X = depth)
# day hour X
#1 1 10 50
#2 1 11 10
#3 2 5 NA
#4 5 9 50
#5 6 20 NA
#6 7 17 80
If the X column is not always NA you could use coalesce.
left_join(df1, df2, by = c("day", "hour")) %>%
mutate(X = coalesce(depth, X)) %>%
select(names(df1))
Or in base R :
merge(df1, df2, all.x = TRUE)[-3]

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

Resources