Making explicit implicit missing values in nested levels [duplicate] - r

This question already has an answer here:
How to complete missing factor levels in data frame?
(1 answer)
Closed 3 years ago.
I am trying to complete my dataframe with missing levels.
Current output
id foo bar val
1 a x 7
2 a y 9
3 a z 6
4 b x 10
5 b y 4
6 b z 5
7 c y 2
Data
structure(list(id = c("1", "2", "3", "4", "5", "6", "7"), foo = c("a",
"a", "a", "b", "b", "b", "c"), bar = c("x", "y", "z", "x", "y",
"z", "y"), val = c("7", "9", "6", "10", "4", "5", "2")), .Names = c("id",
"foo", "bar", "val"), row.names = c(NA, -7L), class = "data.frame")
I would like to make explicit the missing nested levels of c with 0s for x and z. I could find a workaround with expand.grid but could not manage to obtain the desired output with tidyr.
Desired output :
id foo bar val
1 a x 7
2 a y 9
3 a z 6
4 b x 10
5 b y 4
6 b z 5
7 c x 0
8 c y 2
9 c z 0
Thanks in advance!

Given that you are looking for a tidyr solution, you should check out tidyr::complete (which does exactly what you are after):
library(tidyverse)
complete(df, foo, bar, fill = list(val = 0)) %>% select(-id)
#> # A tibble: 9 x 3
#> foo bar val
#> <chr> <chr> <chr>
#> 1 a x 7
#> 2 a y 9
#> 3 a z 6
#> 4 b x 10
#> 5 b y 4
#> 6 b z 5
#> 7 c x 0
#> 8 c y 2
#> 9 c z 0

Related

combining rows based on a condition in R

I am trying to remove some useless rows from the below df. There can be a type (1:5) per ID and yes_no variable to see if there is a variable recorded or not. As you can see, I would like to remove the 3rd and 5th rows as they have other rows with the same ID and type with a recorded value with yes_no = y.
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "3", "4", "4", "4", "5"), yes_no = c("n", "n", "n", "y", "n", "y", "y", "n"), value = c(NA, NA, NA, "2", NA, "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 n <NA>
1 3 y 2
1 4 n <NA>
1 4 y 5
1 4 y 6
1 5 n <NA>
The desired output is as follows:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "4", "4", "5"), yes_no = c("n", "n", "y", "y", "y", "n"), value = c(NA, NA, "2", "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 y 2
1 4 y 5
1 4 y 6
1 5 n <NA>
There are ID's other than 1 that have types 1:5 so looks like I have to group_by(ID). A dplyr solution would be great too.
Any help would be appreciated, thanks!
You may use an if condition to check if yes_no has any y value.
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>%
ungroup
# ID type yes_no value
# <chr> <chr> <chr> <chr>
#1 1 1 n NA
#2 1 2 n NA
#3 1 3 y 2
#4 1 4 y 5
#5 1 4 y 6
#6 1 5 n NA
A base R option using subset + ave
subset(
df,
ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y")
)
gives
ID type yes_no value
1 1 1 n <NA>
2 1 2 n <NA>
4 1 3 y 2
6 1 4 y 5
7 1 4 y 6
8 1 5 n <NA>
After grouping by 'ID', 'type', we may use an OR (|) condition to filter to filter the groups where 'y' is present or when all elements are not 'y'
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(yes_no == 'y'|all(yes_no != 'y')) %>%
ungroup
-output
# A tibble: 6 x 4
ID type yes_no value
<chr> <chr> <chr> <chr>
1 1 1 n <NA>
2 1 2 n <NA>
3 1 3 y 2
4 1 4 y 5
5 1 4 y 6
6 1 5 n <NA>

how to split a dataframe by specific rows in r

I have a data look like this:
data <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))
data
> data
# A tibble: 8 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 A B C D
5 10 20 30 40
6 10 20 30 40
7 B C D NA
8 200 300 400 NA
It was wrong bind by rows and I wanted to split the data into 3 sub data(d1, d2 and d3) such like this:
NOTE: In my real situation, d1, d2 and d3 have different nrow(). I set nrow(d1) = 3, nrow(d2) = 2 and nrow(d3) = 1 just for simplify the question in this example.
d1 <- data.frame(A = rep(1,3), B = rep(2,3), C = rep(3,3), D = rep(4,3))
d2 <- data.frame(A = rep(10,2), B = rep(20,2), C = rep(30,2), D = rep(40,2))
d3 <- data.frame( B = 200, C = 300, D = 400)
> d1
A B C D
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
> d2
A B C D
1 10 20 30 40
2 10 20 30 40
> d3
B C D
1 200 300 400
And then I could bind them correctly using bind_rows from dplyr
bind_rows(d1, d2, d3) %>% as_tibble()
# A tibble: 6 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 10 20 30 40
5 10 20 30 40
6 NA 200 300 400
The problem is that I am troubled by how to get the d1, d2 and d3 from data.
Any help will be highly appreciated!
Here is a tidyverse solution.
process_df takes a data frame and sets the column names and removes the first row.
process_df <- function(df, ...) {
df %>%
set_names(slice(., 1)) %>%
select(which(!is.na(names(.)))) %>%
slice(-1)
}
Add a header row that just contains the column names.
Use rowwise() and c_across() to get the values of all columns by row. Use this to identify which rows are header rows.
group_map will apply a function over each group and bind_rows will combine the results.
data %>%
add_row(!!!set_names(names(.)), .before = 1) %>%
rowwise() %>%
mutate(
group = all(is.na(c_across()) | c_across() %in% names(.))
) %>%
ungroup() %>%
mutate(group = cumsum(group)) %>%
group_by(group) %>%
group_map(process_df) %>%
bind_rows()
#> # A tibble: 6 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 1 2 3 4
#> 2 1 2 3 4
#> 3 1 2 3 4
#> 4 10 20 30 40
#> 5 10 20 30 40
#> 6 NA 200 300 400
Explanation of the usage of !!! in new_row
set_names(names(.)) creates a named vector that represents the row we want to add. However, add_row doesn't accept a named vector - it wants the values to be specified as arguments.
Here is a simplified example.
new_row <- c(speed = 1, dist = 2)
add_row doesn't accept a named vector, so this doesn't work.
cars %>% add_row(new_row, .before = TRUE)
# (Error)
!!! will unpack the vector as arguments to the function.
cars %>% add_row(!!!new_row, .before = TRUE)
# (Works)
!!! above essentially results in this:
cars %>% add_row(speed = 1, dist = 2, .before = TRUE)
Does this work:
data
# A tibble: 5 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 A B C D
3 10 20 30 40
4 B C D NA
5 200 300 400 NA
data <- rbind(LETTERS[1:4],data)
data
# A tibble: 6 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
3 A B C D
4 10 20 30 40
5 B C D NA
6 200 300 400 NA
split(data, rep(1:ceiling(nrow(data)/2), each = 2))
$`1`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
$`2`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 10 20 30 40
$`3`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 B C D NA
2 200 300 400 NA
Base R solution:
Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2)))
Including pushing separate data.frames to Global Environment:
list2env(setNames(Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2))),
paste0('d', seq_len(ceiling(nrow(df) / 2)))), .GlobalEnv)
Tidyverse Solution:
library(tidyverse)
df %>%
rbind(names(df), .) %>%
split(cumsum(seq_len(nrow(.)) %% 2)) %>%
Map(function(x){setNames(x[2,], x[1,])[,complete.cases(t(x))]}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Note solution adjusted to reflect edit to the question:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1]); y[,!is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
New solution including push to Global Env:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
dflist <- Map(function(x) {
y <-
setNames(t(x[, -1, drop = FALSE]), x[, 1])
y[, !is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
list2env(setNames(dflist, paste0('d', names(dflist))), .GlobalEnv)
Adjusted Tidyverse solution:
df %>%
rbind(names(.), .) %>%
t() %>%
data.frame() %>%
type.convert() %>%
split.default(cumsum(!sapply(., is.integer))) %>%
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1])
data.frame(y[,!is.na(colSums(y)), drop = FALSE])}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Data:
df <- structure(list(A = c("1", "A", "10", "B", "200"), B = c("2", "B", "20", "C", "300"), C = c("3", "C", "30", "D", "400"), D = c("4","D", "40", NA, NA)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
Updated Data:
df <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))

arrange a complicated data set in R

I have a large data set:
> ncol(d) [1] 1680 nrow(d) [1] 12
that it looks like this:
a b c e f g
3 2 5 1 3 6
a b c d e g
1 7 8 4 5 8
a c d e f h #in this row b does not exist
5 10 4 7 5 10
And I need that it looks like this:
a b c d e f g h
3 2 5 0 3 6 10 8
1 7 8 4 5 0 8 0
5 0 10 4 7 5 0 10 #and all the other columns ...
Since my data is really long and I have many corrections like this one to do over all the data set, it is hard to do it by hand. I would like to know if there is any way to do this using some sort of automatic way, like a logic function or a loop.
Any idea is welcome
Regards
Here's a possible approach using data.table:
library(data.table)
melt(
setDT(
setnames(
data.table::transpose(df1),
paste(rep(1:(nrow(df1)/2), each = 2), c("name", "value"), sep = "_"))),
measure = patterns("name", "value"))[
, dcast(.SD, variable ~ value1, value.var = "value2", fill = 0)]
# variable a b c d e f g h
# 1: 1 3 2 5 0 1 3 6 0
# 2: 2 1 7 8 4 5 0 8 0
# 3: 3 5 0 10 4 7 5 0 10
We could get the alternate rows with recycling logical vector, construct a data.frame and pivot it to wide format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
sub1 <- df1[c(TRUE, FALSE),]
sub2 <- df1[c(FALSE, TRUE),]
tibble(ind = c(row(sub1)), col1 = factor(unlist(sub1), levels = letters[1:8]),
col2 = as.integer(unlist(sub2))) %>%
pivot_wider(names_from = col1, values_from = col2,
values_fill = list(col2 = 0)) %>%
select(-ind)
#A tibble: 3 x 8
# a b c d e f g h
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 3 2 5 0 1 3 6 0
#2 1 7 8 4 5 0 8 0
#3 5 0 10 4 7 5 0 10
Or using base R with reshape
out <- reshape(
data.frame(ind = c(row(sub1)),
col1 = factor(unlist(sub1), levels = letters[1:8]),
col2 = as.integer(unlist(sub2))),
idvar = 'ind', direction = 'wide', timevar = 'col1')[-1]
names(out) <- sub("col2\\.", "", names(out))
out[is.na(out)] <- 0
row.names(out) <- NULL
out
# a b c d e f g h
#1 3 2 5 0 1 3 6 0
#2 1 7 8 4 5 0 8 0
#3 5 0 10 4 7 5 0 10
data
df1 <- structure(list(v1 = c("a", "3", "a", "1", "a", "5"), v2 = c("b",
"2", "b", "7", "c", "10"), v3 = c("c", "5", "c", "8", "d", "4"
), v4 = c("e", "1", "d", "4", "e", "7"), v5 = c("f", "3", "e",
"5", "f", "5"), v6 = c("g", "6", "g", "8", "h", "10")), class = "data.frame",
row.names = c(NA,
-6L))

how can I regroup multiple categorical variables into a new variable

I have a data.frame (df) with 2 columns (A, B):
A B
1 a TCRB
2 a TCRG
3 a TCRB
4 b TCRB
5 b TCRG
6 c TCRB
7 c TCRB
8 c TCRB
9 c TCRB
10 d TCRG
11 d TCRG
12 d TCRG
I want to create a new column "C" as bellow that tells me whether each unique variable in "A" has both TCRB and TCRG or either one of them (0= TCRB only, 1= TCRG only, 2= both) as follows:
A: a b c d
C: 2 2 0 1
Greatly appreciate any help!
Here's an approach with dplyr:
library(dplyr)
df %>%
group_by(A) %>%
dplyr::summarise(C = case_when("TCRB" %in% B & "TCRG" %in% B ~ 2,
"TCRB" %in% B ~ 0,
"TCRG" %in% B ~ 1,
TRUE ~ NA_real_))
# A tibble: 4 x 2
A C
<fct> <dbl>
1 a 2
2 b 2
3 c 0
4 d 1
An option with n_distinct
library(dplyr)
df %>%
group_by(A) %>%
summarise(C = n_distinct(B) *!all(B == 'TCRB'))
# A tibble: 4 x 2
# A C
# <chr> <int>
#1 a 2
#2 b 2
#3 c 0
#4 d 1
data
df <- structure(list(A = c("a", "a", "a", "b", "b", "c", "c", "c",
"c", "d", "d", "d"), B = c("TCRB", "TCRG", "TCRB", "TCRB", "TCRG",
"TCRB", "TCRB", "TCRB", "TCRB", "TCRG", "TCRG", "TCRG")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
In Base R, we can use aggregate :
aggregate(B~A, df, function(x) {
if(all(c('TCRB', 'TCRG') %in% x)) 2
else if(any(x == 'TCRG')) 1
else if(any(x == 'TCRB')) 0
else NA
})
# A B
#1 a 2
#2 b 2
#3 c 0
#4 d 1

replacing multiple values in data frame in R

I want to reassign multiple different character strings with the same value in a single call. However, the following code only replaces some of values in each variable.
dat <-data.frame(x=c(rep("1=x",4),rep("b",4)),y=c(rep("1=z",4),rep("b",4)))
dat[] <- sapply(dat[], as.character)
dat[dat == c("1=x", "1=y")]<- 1
such that I get:
dat
x y
1 1 1=z
2 1=x 1=z
3 1 1=z
4 1=x 1=z
5 b b
6 b b
7 b b
8 b b
when I want is the following:
dat
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b
With dplyr:
library(dplyr)
dat <- mutate_all(dat, funs(replace(., grepl("1=", .), 1)))
With Base R:
dat[] <- lapply(dat, function(x) replace(x, grepl("1=", x), 1))
Result:
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b
Data:
dat <- structure(list(x = c("1=x", "1=x", "1=x", "1=x", "b", "b", "b",
"b"), y = c("1=z", "1=z", "1=z", "1=z", "b", "b", "b", "b")), .Names = c("x",
"y"), row.names = c(NA, -8L), class = "data.frame")
Another Base R option if you want to make an explicit replacement of certain strings would be:
dat[] <- lapply(dat,function(x) ifelse(x %in% c("1=x", "1=z"), 1, x))
Result:
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b
Data:
dat <- structure(list(x = c("1", "1", "1", "1", "b", "b", "b", "b"),
y = c("1", "1", "1", "1", "b", "b", "b", "b")), row.names = c(NA,
-8L), class = "data.frame")

Resources