I have data in a data frame where one column is a list. This is an example:
rand_lets <- function(){
sample(letters[1:26], runif(sample(1:10, 1), min=5, max=12))
}
example_data <- data.frame(ID = seq(1:5),
location = LETTERS[1:5],
observations = I(list(rand_lets(),
rand_lets(),
rand_lets(),
rand_lets(),
rand_lets())))
I am looking for an elegant tidyverse approach to unlist the list column so that each element in the list is separated into a new column. For example the first row would look like this:
ID location observations observations.1 observations.3 observations.3 observations.4 observations.5 observations.6 observations.7 observations.8 observations.9
1 A "y" "b" "m" "u" "x" "j" "t" "i" "v" "w"
Of course the lists entries may be different lengths so empty cells should be NA.
How could this be done?
If you want to keep your data in "long" format, you can do:
example_data %>% unnest(observations)
ID location observations
1 1 A e
2 1 A x
3 1 A w
...
44 5 E u
45 5 E o
46 5 E z
To spread the data to "wide" format, as in your example, you can do:
library(stringr)
example_data %>% unnest(observations) %>%
group_by(location) %>%
mutate(counter=paste0("Obs_", str_pad(1:n(),2,"left","0"))) %>%
spread(counter, observations)
ID location Obs_01 Obs_02 Obs_03 Obs_04 Obs_05 Obs_06 Obs_07 Obs_08 Obs_09 Obs_10 Obs_11
* <int> <fctr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A e x w c s j k t z <NA> <NA>
2 2 B k u d h z x <NA> <NA> <NA> <NA> <NA>
3 3 C v z m o s f n c r u b
4 4 D z i m s a v n r e t x
5 5 E f b g h a d u o z <NA> <NA>
Related
For some reason, I have a data in which a few columns are a set of data frame consist of one column. So, I want to "collapse" these columns of data frame into one data frame.
library(tidyverse)
df <- tibble(col1=1:5,
col2=tibble(newcol=LETTERS[1:5]),
col3=tibble(newcol2=LETTERS[6:10]))
df
# A tibble: 5 x 3
col1 col2$newcol col3$newcol2
<int> <chr> <chr>
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
I have tried unnest(), but, the function actually replicate data frame/tibble of col2 and col3 for each row of col1, which is not what I want.
df2 <- df %>% unnest(cols = c(col2, col3))
df2
# A tibble: 25 x 3
col1 col2 col3
<int> <chr> <chr>
1 1 A F
2 1 B G
3 1 C H
4 1 D I
5 1 E J
6 2 A F
7 2 B G
8 2 C H
9 2 D I
10 2 E J
# ... with 15 more rows
The result that I want is as below:
df3 <- tibble(col1=1:5,
newcol=LETTERS[1:5],
newcol2=LETTERS[6:10])
df3
# A tibble: 5 x 3
col1 newcol newcol2
<int> <chr> <chr>
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
Any idea how to do this? Any help is much appreciated.
it looks like you only want to change the column names or am I missing something here?
df<-df%>%mutate(col2=df$col2$newcol, col3=df$col3$newcol2)
After your comment, here you can find a more general version (might not be suitable for all use cases)
df1<-df%>%unnest(cols = c(1:3))%>%
group_by(col1)%>%
mutate(row=row_number())%>%
filter(row==col1)%>%
select(-row)
If I understand correct you have three dataframes each of them containing one column. Now you want to bring them all in one dataframe together. Then cbind is an option.
df3 <- cbind(df, col2, col3)
Output:
col1 newcol newcol2
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
Let's say I've got some data:
data <- tibble(A = c("a", "b", "c", "d"),
B = c("e", "f", "g", NA_character_),
C = c("h", "i", NA_character_, NA_character_))
Which looks like this:
# A tibble: 4 x 3
A B C
<chr> <chr> <chr>
1 a e h
2 b f i
3 c g NA
4 d NA NA
What I'd like to do is get the value that's furthest to the right into a new column:
# A tibble: 4 x 4
A B C D
<chr> <chr> <chr> <chr>
1 a e h h
2 b f i i
3 c g NA g
4 d NA NA d
I know I could do it with case_when and a bunch of logical !is.na(A) ~ A, statements, but say I've got a load of columns and that's not feasible. I feel like there probably is an easy way that I just don't know about and haven't been able to find. Thanks
coalesce would be more easier
library(dplyr)
data %>%
mutate(D = coalesce(C, B, A))
-output
# A tibble: 4 x 4
# A B C D
# <chr> <chr> <chr> <chr>
#1 a e h h
#2 b f i i
#3 c g <NA> g
#4 d <NA> <NA> d
Or if there are many column, rev the column names, convert to symbols and evaluate (!!!)
data %>%
mutate(D = coalesce(!!! rlang::syms(rev(names(.)))))
I have a dataframe like this:
library(tidyverse)
a <- tibble(x=c("mother","father","brother","brother"),y=c("a","b","c","d"))
b <- tibble(x=c("mother","father","brother","brother"),z=c("e","f","g","h"))
I want to join these dataframes so that each "brother" occurs only once
I have tried fulljoin
ab <- full_join(a,b,by="x")
and obtained this:
# A tibble: 6 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother c g
4 brother c h
5 brother d g
6 brother d h
What I need is this:
ab <- tibble(x=c("mother","father","brother1","brother2"),y=c("a","b","c","d"),z=c("e","f","g","h"))
# A tibble: 4 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother1 c g
4 brother2 d h
Using dplyr you could do something like the following, which adds an extra variable person to identify each person within each group in x, and then joins by x and person:
library(dplyr)
a %>%
group_by(x) %>%
mutate(person = 1:n()) %>%
full_join(b %>%
group_by(x) %>%
mutate(person = 1:n()),
by = c("x", "person")
) %>%
select(x, person, y, z)
Which returns:
# A tibble: 4 x 4
# Groups: x [3]
x person y z
<chr> <int> <chr> <chr>
1 mother 1 a e
2 father 1 b f
3 brother 1 c g
4 brother 2 d h
Unfortunatelly, the first and second brotherare indistinguisheable form each other! How would R know that you want to join them that way, and not the reverse?
I would try to "remove duplicates" in the original data.frames by adding the "1" and "2" identifiers there.
I don't know tidyverse syntax, but if you never get more than two repetitions, you may want to try
a <- c("A", "B", "C", "C")
a[duplicated(a)] <- paste0(a[duplicated(a)], 2)
Is there a way to create a key without using rowwise()?
Any pointer is much appreciated.
df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))
df %>% rowwise %>% mutate(key=paste(sort(c(grp1, grp2)), collapse="")) %>% ungroup()
grp1 grp2 grp3 value key
<chr> <chr> <chr> <chr> <chr>
1 E k A -3.73984194875213 AE
2 D l B 3.25846392371014 BD
3 C m C 3.62405652088127 CC
4 B n D 6.41520621902784 BD
5 A o E 20.1892413026407 AE
Update: the tibble contains multiple character vectors, but the key should be generated from column grp1 and grp3.
using purrr::pmap_chr :
library(tidyverse)
df %>% mutate(key=pmap_chr(.[c("grp1","grp3")],~paste(sort(c(...)), collapse="")))
# # A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <chr> <chr>
# 1 E k A 22.0150932758833 AE
# 2 D l B 2.24725610156698 BD
# 3 C m C -6.2414882455089 CC
# 4 B n D 22.5699168856552 BD
# 5 A o E -6.21443670571301 AE
In base R you could do:
transform(df, key=mapply(function(...) paste(sort(c(...)), collapse=""), grp1, grp3)
Here is a vectorized option using pmin/pmap. Take the min/max for each row of columns 'grp1', 'grp3' with pmin/pmax and concatenate together (str_c)
library(dplyr)
library(stringr)
df %>%
mutate(key = str_c(pmin(grp1, grp3), pmax(grp1, grp3)))
# A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <dbl> <chr>
#1 E k A 24.7 AE
#2 D l B 5.66 BD
#3 C m C 16.3 CC
#4 B n D 5.88 BD
#5 A o E -9.22 AE
data
df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))
NOTE: cbind converts to matrix and matrix can hold only a single class. By converting to tibble with as_tibble doesn't change the class automatically. Instead, use tibble/data.frame directly instead of cbind route
Another way is to use mutate, without rowwise, but with a vectorised version of your function, like this:
library(dplyr)
# create a function and vectorise it
f = function(x, y) paste(sort(c(x, y)), collapse="")
f = Vectorize(f)
# use the function
df %>% mutate(key = f(grp1, grp3))
# # A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <chr> <chr>
# 1 E k A -4.41213449814982 AE
# 2 D l B 10.4314736952111 BD
# 3 C m C 5.69345098226371 CC
# 4 B n D 4.39266020802413 BD
# 5 A o E 22.0623810028979 AE
I am new to R. I struggle to find a suitable solution for the following problem:
My dataframe looks approximately like this:
ID Att
1 a
1 b
1 c
2 d
3 e
3 f
4 g
I would like to convert it into a new df of the following form:
ID Att_1 Att_2 ... Att_n
1 a b c
2 d N/A N/A
3 e f N/A
4 g N/A N/A
Where the number of columns is dependent on max counts of unique 'Att' in 'ID' (here three). The generation of the number of columns in the new dataframe (i.e. 'n') should be automated and dependent on the count of :
max_ID_count <- table(df$ID)
n <- max(max_ID_count)
Thanks a lot!
We can create a sequence column and then spread
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(rn = paste0("Att_", row_number())) %>%
spread(rn, Att)
# A tibble: 4 x 4
# Groups: ID [4]
# ID Att_1 Att_2 Att_3
# <int> <chr> <chr> <chr>
#1 1 a b c
#2 2 d <NA> <NA>
#3 3 e f <NA>
#4 4 g <NA> <NA>
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ID ~ paste0("Att_", rowid(ID)), value.var = "Att")