Using pivot_wider instead of spread for R dataframe [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
My understanding was that to an extend, pivot_wider(data, names_from, values_from) is roughly equivalent to spread(data, key, value). I'm not seeing that here.
library(tidyverse)
df = data.frame(
A = sample(letters[1:3], 20, replace=T),
B = sample(LETTERS[1:3], 20, replace=T)
)
df
#> A B
#> 1 c B
#> 2 b C
#> 3 b B
#> 4 b C
#> 5 c C
#> 6 b B
#> 7 c B
#> 8 b A
#> 9 c A
#> 10 b B
#> 11 c B
#> 12 b B
#> 13 b A
#> 14 c C
#> 15 c C
#> 16 a B
#> 17 c C
#> 18 b C
#> 19 b A
#> 20 a C
df %>% count(A,B)
#> A B n
#> 1 a B 1
#> 2 a C 1
#> 3 b A 3
#> 4 b B 4
#> 5 b C 3
#> 6 c A 1
#> 7 c B 3
#> 8 c C 4
df %>% count(A,B) %>% spread(key=B, value=n)
#> A B C
#> 1 NA 1 1
#> 2 3 4 3
#> 3 1 3 4
df %>% count(A,B) %>% pivot_wider(names_from=B, values_from=n)
#> Error: Failed to create output due to bad names.
#> * Choose another strategy with `names_repair`
Created on 2020-10-22 by the reprex package (v0.3.0)

pivot_wider is equivalent to spread however, it is also more stricter. You need to be more explicit in doing transformations.
library(dplyr)
library(tidyr)
df %>% count(A,B)
# A B n
#1 a A 3
#2 a B 1
#3 a C 1
#4 b A 2
#5 b B 4
#6 b C 2
#7 c A 2
#8 c B 4
#9 c C 1
Notice how A column above is silently overwritten when you use spread.
df %>% count(A,B) %>% spread(key=B, value=n)
# A B C
#1 3 1 1
#2 2 4 2
#3 2 4 1
pivot_wider doesn't allow that. It wants you to explicitly mention what you want to do. Since you already have an A column and in names_from you specify B as column name which has 'A' value in it so you'll have a another A column. Tibbles don't allow duplicate column names hence you get an error.
An option would be to rename the original A column to something else.
df %>%
count(A,B) %>%
rename(A1 = A) %>%
pivot_wider(names_from=B, values_from=n)
# A1 A B C
# <chr> <int> <int> <int>
#1 a 3 1 1
#2 b 2 4 2
#3 c 2 4 1

Related

How can I stack my dataset so each observation relates to all other observations but itself, within a group?

EDIT: I took one observation out from the data frame of the original post and changed some values so writing manually is easier. I am also adding the desired output, so my question is easier to read.
This is a continuation to a question I made in another post:
How can I stack my dataset so each observation relates to all other observations but itself?
In that post, I asked how can I make a row relate to all other observations but itself. I am trying to apply the answers to my dataset, but the issue is that I have a dataset with country-year-party. In my actual dataset, I want an observation to relate to every other observation within country-year.
Say for example I have a data frame with 2 countries (id1) A and B:
df <- data.frame(id1 = c("A","A","A","B","B","B"),
id2 = c("a", "b", "c", "a", "b", "c" ),
x1 = c(1,2,3,1,2,3))
df
id1 id2 x1
1 A a 1
2 A b 2
3 A c 3
4 B a 1
5 B b 2
6 B c 3
Each row in column id2 identifies one person a, b and c. I want each person to relate to every other person within country. So person a will be related to person b and c, but it has to be within country. I am trying the following codes:
df <- df %>% group_by(id1) %>% merge( df, by = NULL) %>%
filter(id2.x != id2.y)
or even:
df <- df %>% group_by(id2) %>%
left_join(df, df, by = character()) %>%
filter(id2.x != id2.y)
But it leads to the following result:
id1.x id2.x x1.x id1.y id2.y x1.y
1 A b 2 A a 1
2 A c 3 A a 1
3 B b 2 A a 1
4 B c 3 A a 1
5 A a 1 A b 2
6 A c 3 A b 2
7 B a 1 A b 2
8 B c 3 A b 2
9 A a 1 A c 3
10 A b 2 A c 3
11 B a 1 A c 3
12 B b 2 A c 3
13 A b 2 B a 1
14 A c 3 B a 1
15 B b 2 B a 1
16 B c 3 B a 1
17 A a 1 B b 2
18 A c 3 B b 2
19 B a 1 B b 2
20 B c 3 B b 2
21 A a 1 B c 3
22 A b 2 B c 3
23 B a 1 B c 3
24 B b 2 B c 3
Notice that in observation 3, person b in country B is related to person a in country A. This is what I am trying to avoid. I want person a to relate to b and c, but only within each country. How can i do that?
The desired output would be something like this:
id1.x id2.x x1.x id1.y id2.y x1.y
1 A a 1 A b 2
2 A a 1 A c 3
3 A b 2 A a 1
4 A b 2 A c 3
5 A c 3 A a 1
6 A c 3 A b 2
7 B a 1 B b 2
8 B a 1 B c 3
9 B b 2 B a 1
10 B b 2 B c 3
11 B c 3 B a 1
12 B c 3 B b 2
So, within each country A and B, each person a,b,c relates to each other but himself. I tried to clarify some questions and simplify my example, let me know if it is clear now and you need more clarification.
df %>%
group_by(id1)%>%
mutate(vals=map(row_number(), ~cur_data_all()[-.x,]))%>%
unnest(vals, names_sep = "_")
# A tibble: 12 × 6
# Groups: id1 [2]
id1 id2 x1 vals_id1 vals_id2 vals_x1
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 A a 1 A b 2
2 A a 1 A c 3
3 A b 2 A a 1
4 A b 2 A c 3
5 A c 3 A a 1
6 A c 3 A b 2
7 B a 1 B b 2
8 B a 1 B c 3
9 B b 2 B a 1
10 B b 2 B c 3
11 B c 3 B a 1
12 B c 3 B b 2
Here is a base R option:
df <- data.frame(id1 = c("A","A","A","A","B","B","B","B"),
id2 = c("a", "b", "c", "d", "a", "b", "c", "d"),
x1 = c(1,2,3,4, 5,6,7,8))
#base option
by(df, df$id1, \(x){
rws <- t(combn(seq(nrow(x)), 2))
cbind(x[rws[,1],], x[rws[,2],2:3]) |>
`colnames<-`(c("id1", "id2.x","x1.x", "id2.y", "x2.y"))
}) |>
do.call(what = rbind.data.frame)|>
`row.names<-`(NULL)
#> id1 id2.x x1.x id2.y x2.y
#> 1 A a 1 b 2
#> 2 A a 1 c 3
#> 3 A a 1 d 4
#> 4 A b 2 c 3
#> 5 A b 2 d 4
#> 6 A c 3 d 4
#> 7 B a 5 b 6
#> 8 B a 5 c 7
#> 9 B a 5 d 8
#> 10 B b 6 c 7
#> 11 B b 6 d 8
#> 12 B c 7 d 8
EDIT
here is a tidyverse option
library(tidyverse)
full_join(df, df, by = "id1") |>
filter(id2.x != id2.y)
#> id1 id2.x x1.x id2.y x1.y
#> 1 A a 1 b 2
#> 2 A a 1 c 3
#> 3 A a 1 d 4
#> 4 A b 2 a 1
#> 5 A b 2 c 3
#> 6 A b 2 d 4
#> 7 A c 3 a 1
#> 8 A c 3 b 2
#> 9 A c 3 d 4
#> 10 A d 4 a 1
#> 11 A d 4 b 2
#> 12 A d 4 c 3
#> 13 B a 5 b 6
#> 14 B a 5 c 7
#> 15 B a 5 d 8
#> 16 B b 6 a 5
#> 17 B b 6 c 7
#> 18 B b 6 d 8
#> 19 B c 7 a 5
#> 20 B c 7 b 6
#> 21 B c 7 d 8
#> 22 B d 8 a 5
#> 23 B d 8 b 6
#> 24 B d 8 c 7
Building on #RitchieSacramento’s solution from your previous question, you can use expand_grid() inside group_modify().
library(dplyr)
library(tidyr)
df %>%
group_by(id1) %>%
group_modify(~ expand_grid(.x, .x, .name_repair = make.unique)) %>%
ungroup() %>%
filter(id2 != id2.1)
# A tibble: 12 × 5
id1 id2 x1 id2.1 x1.1
<chr> <chr> <dbl> <chr> <dbl>
1 A a 1 b 2
2 A a 1 c 3
3 A b 2 a 1
4 A b 2 c 3
5 A c 3 a 1
6 A c 3 b 2
7 B a 1 b 2
8 B a 1 c 3
9 B b 2 a 1
10 B b 2 c 3
11 B c 3 a 1
12 B c 3 b 2

R apply a sum on dataset values according to all possible combination of character column

I have a dataset that looks like this
data.frame(A = c("a","b","c","d"),B= c(1,2,3,4))
OUTPUT
A B
a 1
b 2
c 3
d 4
I would like to get a new dataframe with the sum of the element in column B according to the possible combinations of 2 elements in column A, for example
comb_A sum_B
a b 3
b c 5
c d 7
a d 5
excetera...
I'm new to r, is there any way to do this? Thank you in advance
You may try in base R
df1 <- as.data.frame(t(combn(df$A, 2)))
data.frame(comb_A = paste(df1$V1, df1$V2), comb_B = df$B[match(df1$V1, df$A)] + df$B[match(df1$V2, df$A)])
comb_A comb_B
1 a b 3
2 a c 4
3 a d 5
4 b c 5
5 b d 6
6 c d 7
A possible solution in base R:
result <- data.frame(
expand.grid(comb_B = df$A, comb_A = df$A)[2:1],
sum = c(outer(df$B, df$B, \(x,y) x+y))
)
result <- result[result$comb_A != result$comb_B,]
result
#> comb_A comb_B sum
#> 2 a b 3
#> 3 a c 4
#> 4 a d 5
#> 5 b a 3
#> 7 b c 5
#> 8 b d 6
#> 9 c a 4
#> 10 c b 5
#> 12 c d 7
#> 13 d a 5
#> 14 d b 6
#> 15 d c 7
Here's one (albeit messy) way to do it.
library(tidyverse)
df <- data.frame(A = c("a","b","c","d"),B= c(1,2,3,4))
df %>%
expand(A, A) %>%
unite("comb_A", starts_with("A"), sep = " ") %>%
mutate(sum_B = map_dbl(
str_split(comb_A, " "),
~sum(df$B[match(.x, df$A)])
))
#> # A tibble: 16 × 2
#> comb_A sum_B
#> <chr> <dbl>
#> 1 a a 2
#> 2 a b 3
#> 3 a c 4
#> 4 a d 5
#> 5 b a 3
#> 6 b b 4
#> 7 b c 5
#> 8 b d 6
#> 9 c a 4
#> 10 c b 5
#> 11 c c 6
#> 12 c d 7
#> 13 d a 5
#> 14 d b 6
#> 15 d c 7
#> 16 d d 8
We can use combn like below
with(
df,
data.frame(
comb_A = combn(A, 2, list),
sum_B = combn(B, 2, sum)
)
)
which gives
comb_A sum_B
1 a, b 3
2 a, c 4
3 a, d 5
4 b, c 5
5 b, d 6
6 c, d 7

Keeping all NAs in dplyr distinct function

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

How would one use dplyr to recursively concatenate characters in a tibble until a character repeats

I'm trying to use dplyr to concatenate characters from prior tibble rows until a character repeats. Once a character repeats, we use the repeated character to start the same concatenation process again. Here is a reprex that shows the source data frame (df) my failed attempt to concatenate the characters (df1) and the desired result of the proposed concatenation process (df2).
In my attempt, it appears the concatenation process only takes place once when we create bf. Unfortunately, I'm not sure why this is the case. I'm still fairly new to dplyr, so I suspect I missing something very obvious. Also, if there is a better approach to solving this problem, I am happy to expand my horizon and knowledge.
library (tidyverse)
df <- tibble(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
cde =c("b","f","c","e","b","f","c","e","d","f","b","c","e","d"))
df
#> # A tibble: 14 x 2
#> id cde
#> <dbl> <chr>
#> 1 1 b
#> 2 2 f
#> 3 3 c
#> 4 4 e
#> 5 5 b
#> 6 6 f
#> 7 7 c
#> 8 8 e
#> 9 9 d
#> 10 10 f
#> 11 11 b
#> 12 12 c
#> 13 13 e
#> 14 14 d
df1 <- df %>%
mutate(cum_cde = "") %>%
mutate(cum_cde = if_else(id ==1,cde,cum_cde)) %>%
mutate(cum_cde = if_else(id > 1 & str_count(lag(cum_cde),(cde)) < 1,str_c(lag(cum_cde),cde,sep="",collapse=NULL),cde))
df1
#> # A tibble: 14 x 3
#> id cde cum_cde
#> <dbl> <chr> <chr>
#> 1 1 b b
#> 2 2 f bf
#> 3 3 c c
#> 4 4 e e
#> 5 5 b b
#> 6 6 f f
#> 7 7 c c
#> 8 8 e e
#> 9 9 d d
#> 10 10 f f
#> 11 11 b b
#> 12 12 c c
#> 13 13 e e
#> 14 14 d d
df2 <- tibble(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
cde =c("b","f","c","e","b","f","c","e","d","f","b","c","e","d"),
result = c("b","bf","bfc","bfce","b","bf","bfc","bfce","bfced","f","fb","fbc","fbce","fbced"))
df2
#> # A tibble: 14 x 3
#> id cde result
#> <dbl> <chr> <chr>
#> 1 1 b b
#> 2 2 f bf
#> 3 3 c bfc
#> 4 4 e bfce
#> 5 5 b b
#> 6 6 f bf
#> 7 7 c bfc
#> 8 8 e bfce
#> 9 9 d bfced
#> 10 10 f f
#> 11 11 b fb
#> 12 12 c fbc
#> 13 13 e fbce
#> 14 14 d fbced
<sup>Created on 2019-12-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
An option with for loop would be
library(stringr)
v1 <- character(nrow(df))
j <- 1
for(i in seq_len(nrow(df))) {
v1[i] <- paste(df$cde[unique(j:i)], collapse="")
if(str_count(v1[i], df$cde[i]) > 1) {
v1[i] <- df$cde[i]
j <- i
}
}
v1
#[1] "b" "bf" "bfc" "bfce"
#[5] "b" "bf" "bfc" "bfce" "bfced"
#[10]"f" "fb" "fbc" "fbce" "fbced"
Or using accumulate
library(purrr)
library(dplyr)
df %>%
group_by(grp = cummax(str_count(accumulate(cde, str_c), cde))) %>%
mutate(result = accumulate(cde, str_c)) %>%
ungroup %>%
select(-grp)
# A tibble: 14 x 3
# id cde result
# <dbl> <chr> <chr>
# 1 1 b b
# 2 2 f bf
# 3 3 c bfc
# 4 4 e bfce
# 5 5 b b
# 6 6 f bf
# 7 7 c bfc
# 8 8 e bfce
# 9 9 d bfced
#10 10 f f
#11 11 b fb
#12 12 c fbc
#13 13 e fbce
#14 14 d fbced

How to substitute NA by 0 in 20 columns?

I want to substitute NA by 0 in 20 columns. I found this approach for 2 columns, however I guess it's not optimal if the number of columns is 20. Is there any alternative and more compact solution?
mydata[,c("a", "c")] <-
apply(mydata[,c("a","c")], 2, function(x){replace(x, is.na(x), 0)})
UPDATE:
For simplicity lets take this data with 8 columns and substitute NAs in columns b, c, e, f and d
a b c d e f g d
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5
The result must be this one:
a b c d e f g d
1 0 0 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 0 t 5 5
The replace_na function from tidyr can be applied over a vector as well as a dataframe (http://tidyr.tidyverse.org/reference/replace_na.html).
Use it with a mutate_at variation from dplyr to apply it to multiple columns at the same time:
my_data %>% mutate_at(vars(b,c,e,f), replace_na, 0)
or
my_data %>% mutate_at(c('b','c','e','f'), replace_na, 0)
Here is a tidyverse way to replace NA with different values based on the data type of the column.
library(tidyverse)
dataset %>% mutate_if(is.numeric, replace_na, 0) %>%
mutate_if(is.character, replace_na, "")
Another option:
library(tidyr)
v <- c('b', 'c', 'e', 'f')
replace_na(df, as.list(setNames(rep(0, length(v)), v)))
Which gives:
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5
We can use NAer from qdap to convert the NA to 0. If there are multiple column, loop using lapply.
library(qdap)
nm1 <- c('b', 'c', 'e', 'f')
mydata[nm1] <- lapply(mydata[nm1], NAer)
mydata
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5
Or using dplyr
library(dplyr)
mydata %>%
mutate_each_(funs(replace(., which(is.na(.)), 0)), nm1)
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5
Another strategy using tidyr::replace_na()
library(tidyverse)
df <- read.table(header = T, text = 'a b c d e f g h
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5')
df %>%
mutate(across(everything(), ~replace_na(., 0)))
#> a b c d e f g h
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 0 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
Created on 2021-08-22 by the reprex package (v2.0.0)
Knowing that replace_na() accepts a named list for the replace argument, using purrr::map() is a good option here to reduce the amount of code. It is also possible to replace different values in each column using 'map2()'.
code:
library(data.table)
library(tidyverse)
tbl <-read_table("a b c d e f g d
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5")
#> Warning: Duplicated column names deduplicated: 'd' => 'd_1' [8]
nms <- c('b', 'c', 'e', 'f', 'g')
imap_dfc(tbl, ~ if(any(.y == nms)) replace_na(.x, 0) else .x)
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 NA 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
#using data.table
tblDT <- as.data.table(tbl)
#Further explanation here: https://stackoverflow.com/questions/16846380
tblDT[, (nms) := map(.SD, ~replace_na(., 0)), .SDcols = nms]
tblDT %>%
as_tibble()
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 NA 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
#to replace na's in every column:
tbl %>%
replace_na(map(., ~0))
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 0 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
Created on 2021-09-25 by the reprex package (v2.0.1)

Resources