Merging data frame and filling missing values [duplicate] - r

This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 2 years ago.
I want to merge the following 3 data frames and fill the missing values with -1. I think I should use the fct merge() but not exactly know how to do it.
> df1
Letter Values1
1 A 1
2 B 2
3 C 3
> df2
Letter Values2
1 A 0
2 C 5
3 D 9
> df3
Letter Values3
1 A -1
2 D 5
3 B -1
desire output would be:
Letter Values1 Values2 Values3
1 A 1 0 -1
2 B 2 -1 -1 # fill missing values with -1
3 C 3 5 -1
4 D -1 9 5
code:
> dput(df1)
structure(list(Letter = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Values1 = c(1, 2, 3)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(Letter = structure(1:3, .Label = c("A", "C", "D"
), class = "factor"), Values2 = c(0, 5, 9)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df3)
structure(list(Letter = structure(c(1L, 3L, 2L), .Label = c("A",
"B", "D"), class = "factor"), Values3 = c(-1, 5, -1)), class = "data.frame", row.names = c(NA,
-3L))

You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.
new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1
new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5
A tidyverse way with the same logic :
library(dplyr)
library(purrr)
list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))

Here's a dplyr solution
df1 %>%
full_join(df2, by = "Letter") %>%
full_join(df3, by = "Letter") %>%
mutate_if(is.numeric, function(x) replace_na(x, -1))
output:
Letter Values1 Values2 Values3
<chr> <dbl> <dbl> <dbl>
1 A 1 0 -1
2 B 2 -1 -1
3 C 3 5 -1
4 D -1 9 5

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

Verifyin if there's at least two columns have the same value in a specefic column

i have a data and i want to see if my variables they all have unique value in specefic row
let's say i want to analyze row D
my data
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
> TRUE (because all the three variables have unique value)
Second example
Name F S T
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 4
>False (because F and T have the same value in row D )
In base R do
f1 <- function(dat, ind) {
tmp <- unlist(dat[ind, -1])
length(unique(tmp)) == length(tmp)
}
-testing
> f1(df, 4)
[1] TRUE
> f1(df1, 4)
[1] FALSE
data
df <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = 3:6), class = "data.frame", row.names = c(NA, -4L))
df1 <- structure(list(Name = c("A", "B", "C", "D"), F = 1:4, S = 2:5,
T = c(3L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
You can use dplyr for this:
df %>%
summarize_at(c(2:ncol(.)), n_distinct) %>%
summarize(if_all(.fns = ~ .x == nrow(df)))

Assign a value to a column in R based on a percentage within each group

[]
1I need to create column C in a data frame where 30% of the rows within each group (column B) get a value 0.
How do I do this in R?
We may use rbinom after grouping by 'category' column. Specify the prob as a vector of values
library(dplyr)
df1 %>%
group_by(category) %>%
mutate(value = rbinom(n(), 1, c(0.7, 0.3))) %>%
ungroup
-output
# A tibble: 9 x 3
sno category value
<int> <chr> <int>
1 1 A 1
2 2 A 0
3 3 A 1
4 4 B 1
5 5 B 0
6 6 B 1
7 7 C 1
8 8 C 0
9 9 C 0
data
df1 <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B",
"B", "C", "C", "C")), class = "data.frame", row.names = c(NA,
-9L))
If your data already exist (assuming this is a simplified answer), and if you want the value to be randomly assigned to each group:
library(dplyr)
d <- data.frame(sno = 1:9,
category = rep(c("A", "B", "C"), each = 3))
d %>%
group_by(category) %>%
mutate(value = sample(c(rep(1, floor(n()*.7)), rep(0, n() - floor(n()*.7)))))
Base R
set.seed(42)
d$value <- ave(
rep(0, nrow(d)), d$category,
FUN = function(z) sample(0:1, size = length(z), prob = c(0.3, 0.7), replace = TRUE)
)
d
# sno category value
# 1 1 A 0
# 2 2 A 0
# 3 3 A 1
# 4 4 B 0
# 5 5 B 1
# 6 6 B 1
# 7 7 C 0
# 8 8 C 1
# 9 9 C 1
Data copied from Brigadeiro's answer:
d <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA, -9L))

How to swap row values in the same column of a data frame?

I have a data frame that looks like the following:
ID Loc
1 N
2 A
3 N
4 H
5 H
I would like to swap A and H in the column Loc while not touching rows that have values of N, such that I get:
ID Loc
1 N
2 H
3 N
4 A
5 A
This dataframe is the result of a pipe so I'm looking to see if it's possible to append this operation to the pipe.
You could try:
df$Loc <- chartr("AH", "HA", df$Loc)
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
We can try chaining together two calls to ifelse, for a base R option:
df <- data.frame(ID=c(1:5), Loc=c("N", "A", "N", "H", "H"), stringsAsFactors=FALSE)
df$Loc <- ifelse(df$Loc=="A", "H", ifelse(df$Loc=="H", "A", df$Loc))
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
If you have a factor, you could simply reverse those levels
l <- levels(df$Loc)
l[l %in% c("A", "N")] <- c("N", "A")
df
# ID Loc
# 1 1 A
# 2 2 N
# 3 3 A
# 4 4 H
# 5 5 H
Data:
df <- structure(list(ID = 1:5, Loc = structure(c(3L, 1L, 3L, 2L, 2L
), .Label = c("A", "H", "N"), class = "factor")), .Names = c("ID",
"Loc"), class = "data.frame", row.names = c(NA, -5L))

R - Merge list of three dataframes into single dataframe with ID in first column, next three columns show values [duplicate]

This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 5 years ago.
Here's my list of data frames:
[[1]]
ID Value
A 1
B 1
C 1
[[2]]
ID Value
A 1
D 1
E 1
[[3]]
ID Value
B 1
C 1
I'm after a single data frame with unique (non-redundant) IDs in the left hand column, replicates in columns, and NULL values as 0:
ID [1]Value [2]Value [3]Value
A 1 1 0
B 1 0 1
C 1 0 1
D 0 1 0
E 0 1 0
I've tried:
Reduce(function(x, y) merge(x, y, by=ID), datahere)
This provides a single list but without regards to where the original values come from, and duplicate IDs are repeated in new rows.
rbindlist(datahere, use.names=TRUE, fill=TRUE, idcol="Replicate")
This provides a single list with the [x]Value number as a new column called Replicate, but still it isn't in the structure I want as the ID column has redundancies.
What about something like this using dplyr/purrr:
require(tidyverse);
reduce(lst, full_join, by = "ID");
# ID Value.x Value.y Value
# 1 A 1 1 NA
# 2 B 1 NA 1
# 3 C 1 NA 1
# 4 D NA 1 NA
# 5 E NA 1 NA
Or with the NAs replaced with 0s:
reduce(lst, full_join, by = "ID") %>% replace(., is.na(.), 0);
# ID Value.x Value.y Value
#1 A 1 1 0
#2 B 1 0 1
#3 C 1 0 1
#4 D 0 1 0
#5 E 0 1 0
Sample data
options(stringsAsFactors = FALSE);
lst <- list(
data.frame(ID = c("A", "B", "C"), Value = c(1, 1, 1)),
data.frame(ID = c("A", "D", "E"), Value = c(1, 1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)))
You already have a nice answer but the typical way to do this is with tidyr::spread
Your data
A <- data.frame(ID=LETTERS[1:3], Value=1, stringsAsFactors=FALSE)
B <- data.frame(ID=LETTERS[c(1,4,5)], Value=1, stringsAsFactors=FALSE)
C <- data.frame(ID=LETTERS[c(2:3)], Value=1, stringsAsFactors=FALSE)
L <- list(A, B, C)
Solution
dplyr::bind_rows(L, .id="G") %>%
tidyr::spread(G, Value, fill=0)
# ID 1 2 3
# 1 A 1 1 0
# 2 B 1 0 1
# 3 C 1 0 1
# 4 D 0 1 0
# 5 E 0 1 0
With base R, we need to use all = TRUE in the merge
res <- Reduce(function(...) merge(..., all = TRUE, by="ID"), lst)
replace(res, is.na(res), 0)
# ID Value.x Value.y Value
#1 A 1 1 0
#2 B 1 0 1
#3 C 1 0 1
#4 D 0 1 0
#5 E 0 1 0
data
lst <- list(structure(list(ID = c("A", "B", "C"), Value = c(1, 1, 1)), .Names = c("ID",
"Value"), row.names = c(NA, -3L), class = "data.frame"), structure(list(
ID = c("A", "D", "E"), Value = c(1, 1, 1)), .Names = c("ID",
"Value"), row.names = c(NA, -3L), class = "data.frame"), structure(list(
ID = c("B", "C"), Value = c(1, 1)), .Names = c("ID", "Value"
), row.names = c(NA, -2L), class = "data.frame"))

Resources