tidyr spread does not aggregate data - r

I have data of the following:
> data <- data.frame(unique=1:9, grouping=rep(c('a', 'b', 'c'), each=3), value=sample(1:30, 9))
> data
unique grouping value
1 1 a 15
2 2 a 21
3 3 a 26
4 4 b 8
5 5 b 6
6 6 b 4
7 7 c 17
8 8 c 1
9 9 c 3
I would like to create a table that looks like this:
a b c
1 15 8 17
2 21 6 1
3 26 6 3
I am using tidyr::spread and not getting the correct result:
> data %>% spread(grouping, value)
unique a b c
1 1 15 NA NA
2 2 21 NA NA
3 3 26 NA NA
4 4 NA 8 NA
5 5 NA 6 NA
6 6 NA 4 NA
7 7 NA NA 17
8 8 NA NA 1
9 9 NA NA 3
Or
> data %>% select(grouping, value) %>% spread(grouping, value)
Error: Duplicate identifiers for rows (1, 2, 3), (4, 5, 6), (7, 8, 9)
Is there a way to do this also when one group (c) has a different length than the others?

We need to create a sequence column to avoid the duplicate identifiers row Error.
library(tidyr)
library(dplyr)
data %>%
group_by(grouping) %>%
mutate(id = row_number()) %>%
select(-unique) %>%
spread(grouping, value) %>%
select(-id)
# a b c
# (int) (int) (int)
#1 15 8 17
#2 21 6 1
#3 26 4 3

Related

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

replacing missing value with non-values in grouped data using tidyverse

For each id, I am trying to replace missing values with data that is available.
library(tidyverse)
df <- data.frame(id=c(1,1,1,2,2,2,3),
a=c(NA, NA, 10, NA, 12, NA, 10),
b=c(10, NA, NA, NA, 13,NA, NA))
> df
id a b
1 1 NA 10
2 1 NA NA
3 1 10 NA
4 2 NA NA
5 2 12 13
6 2 NA NA
7 3 10 NA
I have tried:
df %>%
dplyr::group_by(id) %>%
dplyr::mutate_at(vars(a:b), fill(., direction="up"))
and get the following error:
Error: 1 components of `...` had unexpected names.
We detected these problematic arguments:
* `direction`
Did you misspecify an argument?
Desired output:
id a b
1 1 10 10
2 1 10 NA
3 1 10 NA
4 2 12 13
5 2 12 13
6 2 12 13
7 3 10 NA
We dont' use fill with mutate_at. According to ?fill
data - A data frame. and
... - A selection of columns. If empty, nothing happens. You can supply bare variable names, select all variables between x and z with x:z, exclude y with -y. F
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(a:b, .direction = 'up')
# A tibble: 7 x 3
# Groups: id [3]
# id a b
# <dbl> <dbl> <dbl>
#1 1 10 10
#2 1 10 NA
#3 1 10 NA
#4 2 12 13
#5 2 12 13
#6 2 NA NA
#7 3 10 NA

R: creating multiple new variables based on conditions of selection of other variables with similar names

I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA

Generating multiple column to sort the data out in R

I have a database including names, codes and rooms as follows:
Name1 Code1 R1
A A 12 1
A B 13 2
A C 15 5
A B 8 4
A C 13 2
A D 17 1
A B 16 7
I want to generate columns for the repeated names like this:
Name1 Code1 R1 Name2 Code2 R2 Name3 Cod3 R3
A A 12 1
A B 13 2
A C 15 5
A B 8 4 A B 8 4
A C 13 2 A C 13 2
A D 17 1
A B 16 7 A B 16 7
I have googled to find a solution, but I could not find or may be I have missed something. Would it be possible for you to help me. Some names (Name1) has been repeated 5 times and i did not add it.So I I have Name2 Code2 R2; Name3, Code3, R3...
Sample data:
df <- read.table(stringsAsFactors = F, header = T, text = "
Name1a Name1b Code1 R1
1 A A 12 1
2 A B 13 2
3 A C 15 5
4 A B 8 4
5 A C 13 2
6 A D 17 1
7 A B 16 7") %>%
tidyr::unite(Name1, Name1a, Name1b)
Edit: Orig answer was in packed format, but OP would like the first set of columns repeated for all lines, and 2nd and third appearances showing up in the row they originally appeared in.
Here's an approach using dplyr and tidyr.
# Keep track of original rows, label repeats, and make it long format
df_order <- df %>%
mutate(orig_row = row_number()) %>%
group_by(Name1) %>% mutate(repeat_no = row_number()) %>% ungroup() %>%
gather(col_type, value, Code1:R1)
# Make one copy of all the rows to keep in first column
df_ones <- df_order %>%
mutate(repeat_no = 1) %>%
unite(col_rpt, repeat_no, col_type)
# Get the repeated rows to add on
df_repeats <- df_order %>%
filter(repeat_no > 1) %>%
unite(col_rpt, repeat_no, col_type)
# Combine the two and spread out
output <- df_ones %>%
bind_rows(df_repeats) %>%
spread(col_rpt, value) %>%
arrange(orig_row) %>%
select(-orig_row)
Output:
> output
# A tibble: 7 x 7
Name1 `1_Code1` `1_R1` `2_Code1` `2_R1` `3_Code1` `3_R1`
<chr> <int> <int> <int> <int> <int> <int>
1 A_A 12 1 NA NA NA NA
2 A_B 13 2 NA NA NA NA
3 A_C 15 5 NA NA NA NA
4 A_B 8 4 8 4 NA NA
5 A_C 13 2 13 2 NA NA
6 A_D 17 1 NA NA NA NA
7 A_B 16 7 NA NA 16 7

Resources