I have two fairly complicated data.frames and managed to simplify the first step of my problem here. I have a reference table and another that contains my data as follows:
REFERENCE
ref <- structure(list(group = c("A", "B", "C"), position = c("a", "a",
"b")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
DATA
df <- structure(list(position = c("a", "a"), value = c(1, 1, 2), name = c("foo",
"bar")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
I used left_join(ref,df,by="position") %>% arrange(name) to obtain:
1 A a 1 foo
2 A a 1 bar
3 B a 1 foo
4 B a 1 bar
5 C b NA NA
The ideal output however is:
group position value name
<chr> <chr> <dbl> <chr>
1 A a 1 bar
2 B a 1 bar
3 C b 0 bar
4 A a 1 foo
5 B a 1 foo
6 C b 0 foo
I would like the name column to replace NA with the input from df and the value column's NA with 0. In the real df, I have more than foo in the name column
We could use crossing to get the combinations, then replace the 'value' column values to 0 where the 'position' columns are not equal
library(dplyr)
library(tidyr)
crossing(ref, df %>%
rename(position2 = position)) %>%
arrange(name) %>%
mutate(value = replace(value, position != position2 , 0)) %>%
select(-position2)
# A tibble: 6 x 4
# group position value name
# <chr> <chr> <dbl> <chr>
#1 A a 1 bar
#2 B a 1 bar
#3 C b 0 bar
#4 A a 1 foo
#5 B a 1 foo
#6 C b 0 foo
Related
[]
1I need to create column C in a data frame where 30% of the rows within each group (column B) get a value 0.
How do I do this in R?
We may use rbinom after grouping by 'category' column. Specify the prob as a vector of values
library(dplyr)
df1 %>%
group_by(category) %>%
mutate(value = rbinom(n(), 1, c(0.7, 0.3))) %>%
ungroup
-output
# A tibble: 9 x 3
sno category value
<int> <chr> <int>
1 1 A 1
2 2 A 0
3 3 A 1
4 4 B 1
5 5 B 0
6 6 B 1
7 7 C 1
8 8 C 0
9 9 C 0
data
df1 <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B",
"B", "C", "C", "C")), class = "data.frame", row.names = c(NA,
-9L))
If your data already exist (assuming this is a simplified answer), and if you want the value to be randomly assigned to each group:
library(dplyr)
d <- data.frame(sno = 1:9,
category = rep(c("A", "B", "C"), each = 3))
d %>%
group_by(category) %>%
mutate(value = sample(c(rep(1, floor(n()*.7)), rep(0, n() - floor(n()*.7)))))
Base R
set.seed(42)
d$value <- ave(
rep(0, nrow(d)), d$category,
FUN = function(z) sample(0:1, size = length(z), prob = c(0.3, 0.7), replace = TRUE)
)
d
# sno category value
# 1 1 A 0
# 2 2 A 0
# 3 3 A 1
# 4 4 B 0
# 5 5 B 1
# 6 6 B 1
# 7 7 C 0
# 8 8 C 1
# 9 9 C 1
Data copied from Brigadeiro's answer:
d <- structure(list(sno = 1:9, category = c("A", "A", "A", "B", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA, -9L))
I have a dataframe of the following type
ID case1 case2 case3 case4
1 A B C D
2 B A
3 E F
4 G C A
5 T
I need to change its format, to a long shape, similar as the below:
ID col1 col2
1 A B
1 A C
1 A D
1 B C
1 B D
1 C D
2 B A
3 E F
4 G C
4 G A
4 C A
5 T
As you can see, I need to maintain the ID and ignore empty columns. There are some cases like T that need to remain in the dataset, but without a col2.
I am honestly not sure how to approach this, so that is why there are no examples of what I have tried.
You can get the data in long format and create all combination of values for each ID if the number of rows is greater than 1 in that ID.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID, values_drop_na = TRUE) %>%
group_by(ID) %>%
summarise(value = if(n() > 1) list(setNames(as.data.frame(t(combn(value, 2))),
c('col1', 'col2')))
else list(data.frame(col1 = value[1], col2 = NA_character_))) %>%
unnest(value)
# A tibble: 12 x 3
# ID col1 col2
# <int> <chr> <chr>
# 1 1 A B
# 2 1 A C
# 3 1 A D
# 4 1 B C
# 5 1 B D
# 6 1 C D
# 7 2 B A
# 8 3 E F
# 9 4 G C
#10 4 G A
#11 4 C A
#12 5 T NA
data
df <- structure(list(ID = 1:5, case1 = c("A", "B", "E", "G", "T"),
case2 = c("B", "A", "F", "C", NA), case3 = c("C", NA, NA,
"A", NA), case4 = c("D", NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -5L))
This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 2 years ago.
I want to merge the following 3 data frames and fill the missing values with -1. I think I should use the fct merge() but not exactly know how to do it.
> df1
Letter Values1
1 A 1
2 B 2
3 C 3
> df2
Letter Values2
1 A 0
2 C 5
3 D 9
> df3
Letter Values3
1 A -1
2 D 5
3 B -1
desire output would be:
Letter Values1 Values2 Values3
1 A 1 0 -1
2 B 2 -1 -1 # fill missing values with -1
3 C 3 5 -1
4 D -1 9 5
code:
> dput(df1)
structure(list(Letter = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Values1 = c(1, 2, 3)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(Letter = structure(1:3, .Label = c("A", "C", "D"
), class = "factor"), Values2 = c(0, 5, 9)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df3)
structure(list(Letter = structure(c(1L, 3L, 2L), .Label = c("A",
"B", "D"), class = "factor"), Values3 = c(-1, 5, -1)), class = "data.frame", row.names = c(NA,
-3L))
You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.
new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1
new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5
A tidyverse way with the same logic :
library(dplyr)
library(purrr)
list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))
Here's a dplyr solution
df1 %>%
full_join(df2, by = "Letter") %>%
full_join(df3, by = "Letter") %>%
mutate_if(is.numeric, function(x) replace_na(x, -1))
output:
Letter Values1 Values2 Values3
<chr> <dbl> <dbl> <dbl>
1 A 1 0 -1
2 B 2 -1 -1
3 C 3 5 -1
4 D -1 9 5
Here is my toy dataframe:
structure(list(a = c(1, 2), b = c(3, 4), c = c(5, 6), d = c(7,
8)), .Names = c("a", "b", "c", "d"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to reorder and exclude one the columns and keep the others:
df %>% select(-a, d, everything())
I want my df to be :
d b c
7 3 5
8 4 6
I get the following:
b c d a
<dbl> <dbl> <dbl> <dbl>
1 3 5 7 1
2 4 6 8 2
Keep the -a at the last in the select. Even though, we removed a in the beginning the everythig() at the end is still checking the column names of the whole dataset
df%>%
select(d, everything(), -a)
# A tibble: 2 x 3
# d b c
# <dbl> <dbl> <dbl>
#1 7 3 5
#2 8 4 6
This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 5 years ago.
Here's my list of data frames:
[[1]]
ID Value
A 1
B 1
C 1
[[2]]
ID Value
A 1
D 1
E 1
[[3]]
ID Value
B 1
C 1
I'm after a single data frame with unique (non-redundant) IDs in the left hand column, replicates in columns, and NULL values as 0:
ID [1]Value [2]Value [3]Value
A 1 1 0
B 1 0 1
C 1 0 1
D 0 1 0
E 0 1 0
I've tried:
Reduce(function(x, y) merge(x, y, by=ID), datahere)
This provides a single list but without regards to where the original values come from, and duplicate IDs are repeated in new rows.
rbindlist(datahere, use.names=TRUE, fill=TRUE, idcol="Replicate")
This provides a single list with the [x]Value number as a new column called Replicate, but still it isn't in the structure I want as the ID column has redundancies.
What about something like this using dplyr/purrr:
require(tidyverse);
reduce(lst, full_join, by = "ID");
# ID Value.x Value.y Value
# 1 A 1 1 NA
# 2 B 1 NA 1
# 3 C 1 NA 1
# 4 D NA 1 NA
# 5 E NA 1 NA
Or with the NAs replaced with 0s:
reduce(lst, full_join, by = "ID") %>% replace(., is.na(.), 0);
# ID Value.x Value.y Value
#1 A 1 1 0
#2 B 1 0 1
#3 C 1 0 1
#4 D 0 1 0
#5 E 0 1 0
Sample data
options(stringsAsFactors = FALSE);
lst <- list(
data.frame(ID = c("A", "B", "C"), Value = c(1, 1, 1)),
data.frame(ID = c("A", "D", "E"), Value = c(1, 1, 1)),
data.frame(ID = c("B", "C"), Value = c(1, 1)))
You already have a nice answer but the typical way to do this is with tidyr::spread
Your data
A <- data.frame(ID=LETTERS[1:3], Value=1, stringsAsFactors=FALSE)
B <- data.frame(ID=LETTERS[c(1,4,5)], Value=1, stringsAsFactors=FALSE)
C <- data.frame(ID=LETTERS[c(2:3)], Value=1, stringsAsFactors=FALSE)
L <- list(A, B, C)
Solution
dplyr::bind_rows(L, .id="G") %>%
tidyr::spread(G, Value, fill=0)
# ID 1 2 3
# 1 A 1 1 0
# 2 B 1 0 1
# 3 C 1 0 1
# 4 D 0 1 0
# 5 E 0 1 0
With base R, we need to use all = TRUE in the merge
res <- Reduce(function(...) merge(..., all = TRUE, by="ID"), lst)
replace(res, is.na(res), 0)
# ID Value.x Value.y Value
#1 A 1 1 0
#2 B 1 0 1
#3 C 1 0 1
#4 D 0 1 0
#5 E 0 1 0
data
lst <- list(structure(list(ID = c("A", "B", "C"), Value = c(1, 1, 1)), .Names = c("ID",
"Value"), row.names = c(NA, -3L), class = "data.frame"), structure(list(
ID = c("A", "D", "E"), Value = c(1, 1, 1)), .Names = c("ID",
"Value"), row.names = c(NA, -3L), class = "data.frame"), structure(list(
ID = c("B", "C"), Value = c(1, 1)), .Names = c("ID", "Value"
), row.names = c(NA, -2L), class = "data.frame"))