R-Tidying multiple columns containing data in lists - r

I have a dataset arranged such that the data is stored as a list of multiple observations within each 'cell'. See below:
partID | Var 1 | Var 2
1 | 1,2,3 | 4,5,6
2 | 7,8,9 | 1,2,3
I would like to get the data in a format more like this:
partID | Var 1 | Var 2
1 | 1 | 4
1 | 2 | 5
1 | 3 | 6
I've been trying various combinations of melt, unlist, and data.table but I haven't had much luck applying the various ways to expand the lists while simultaneously preserving multiple columns and their names. Am I reduced to looping through the dataset and binding the columns together?

If for each row, the cells have the same number of entries and they are strings, then this is what you can do, using data.table.
require(data.table)
DT<-data.table(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))
DT2<-DT[,list(Var1=unlist(strsplit(Var1,",")),Var2=unlist(strsplit(Var2,","))),by=partID]
You use strsplit() to split the strings by the commas. You use unlist() to make the entries into a vector, not a list.
If, on the other hand, each cell is already a list, then all you need to do is unlist().
require(data.table)
DT3<-data.table(partID=c(1,2),Var1=list(c(1,2,3),c(7,8,9)),Var2=list(c(4,5,6),c(1,2,3)))
DT4<-DT3[,list(Var1=unlist(Var1),Var2=unlist(Var2)),by=partID]
Either way, you get this:
partID Var1 Var2
1 1 4
1 2 5
1 3 6
2 7 1
2 8 2
2 9 3

We can do this easily with cSplit
library(splitstackshape)
cSplit(DT, c("Var1", "Var2"), ",", "long")
# partID Var1 Var2
#1: 1 1 4
#2: 1 2 5
#3: 1 3 6
#4: 2 7 1
#5: 2 8 2
#6: 2 9 3
data
DT<-data.frame(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))

The separate_rows() function in tidyr is the boss for observations with multiple delimited values...
# create data
library(tidyverse)
d <- data_frame(
partID = c(1, 2),
Var1 = c("1,2,3", "7,8,9"),
Var2 = c("4,5,6","1,2,3")
)
d
# # A tibble: 2 x 3
# partID Var1 Var2
# <dbl> <chr> <chr>
# 1 1 1,2,3 4,5,6
# 2 2 7,8,9 1,2,3
# tidy data
separate_rows(d, Var1, Var2, convert = TRUE)
# # A tibble: 6 x 3
# partID Var1 Var2
# <dbl> <int> <int>
# 1 1 1 4
# 2 1 2 5
# 3 1 3 6
# 4 2 7 1
# 5 2 8 2
# 6 2 9 3

You can also use dplyr and tidyr which provides the unnest function to expand the columns:
library(dplyr); library(tidyr);
df %>% mutate(Var.1 = strsplit(Var.1, ","), Var.2 = strsplit(Var.2, ",")) %>% unnest()
Source: local data frame [6 x 3]
partID Var.1 Var.2
(dbl) (chr) (chr)
1 1 1 4
2 1 2 5
3 1 3 6
4 2 7 1
5 2 8 2
6 2 9 3

Related

How to create new column of repeating sequence based on other column

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated
Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

Merging columns while ignoring NAs

I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?
Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2
With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)
An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.
Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))

R count() using dynamically generated list of variables/columns

If I have a tibble called observations with the following variables/columns:
category_1_red_length
category_1_red_width
category_1_red_depth
category_1_blue_length
category_1_blue_width
category_1_blue_depth
category_1_green_length
category_1_green_width
category_1_green_depth
category_2_red_length
category_2_red_width
category_2_red_depth
category_2_blue_length
category_2_blue_width
category_2_blue_depth
category_2_green_length
category_2_green_width
category_2_green_depth
Plus a load more. Is there a way to dynamically generate the following count()?
count(observations,
category_1_red_length,
category_1_red_width,
category_1_red_depth,
category_1_blue_length,
category_1_blue_width,
category_1_blue_depth,
category_1_green_length,
category_1_green_width,
category_1_green_depth,
category_2_red_length,
category_2_red_width,
category_2_red_depth,
category_2_blue_length,
category_2_blue_width,
category_2_blue_depth,
category_2_green_length,
category_2_green_width,
category_2_green_depth,
sort=TRUE)
I can create the list of columns I want to count with:
columns_to_count = list()
column_prefix = 'category'
aspects = c('red', 'blue', 'green')
dimensions = c('length', 'width', 'depth')
for (x in 1:2) {
for (aspect in aspects) {
for (dimension in dimensions) {
columns_to_count = append(columns_to_count, paste(column_prefix, x, aspect, dimension, sep='_'))
}
}
}
But then how do I pass my list of columns in columns_to_count to the count() function?
In my actual data set there are about 170 columns like this that I want to count so creating the list of columns without loops doesn't seem sensible.
Struggling to think of the name for what I'm trying to do so unable to find useful search results.
Thanks.
You can use non-standard evaluation using syms and !!!. For example, using mtcars dataset
library(dplyr)
library(rlang)
cols <- c('am', 'cyl')
mtcars %>% count(!!!syms(cols), sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
This is same as doing
mtcars %>% count(am, cyl, sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
You don't need to include names in cols one by one by hand. You can use regex if the column contains a specific pattern or use position to get appropriate column name.
You can use .dots to receive strings as variables:
count(observations, .dots=columns_to_count, sort=TRUE)
r$> d
V1 V2
1 1 4
2 2 5
3 3 6
r$> count(d, .dots=list('V1', 'V2'))
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1
r$> count(d, V1, V2)
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})

Retain rows up to first occurrence of a value in a column, by group. Groups without value allowed

I have a data frame like this one:
> df
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 1 b
6 2 a
7 2 a
8 2 b
9 3 a
10 3 a
I want to keep all rows for each group (id) up to the first occurrence of value 'b' in the type column. For groups without type 'b', I want to keep all their rows.
The resulting data frame should look like this:
> dfnew
id type
1 1 a
2 1 a
3 1 b
4 2 a
5 2 a
6 2 b
7 3 a
8 3 a
I tried the following code, but it retains additional rows that have the value 'a' beyond the first occurrence of 'b', and only excludes additional occurrences of 'b', which is not what I want. Look at row 4 in the following. I want to rid of it.
> df %>% group_by(id) %>% filter(cumsum(type == 'b') <= 1)
Source: local data frame [7 x 2]
Groups: id
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 2 a
6 2 a
7 2 b
8 3 a
9 3 a
You could combine match or which with slice or (as mentioned by #Richard) which.max
library(dplyr)
df %>%
group_by(id) %>%
slice(if(any(type == "b")) 1:which.max(type == "b") else row_number())
# Source: local data table [8 x 2]
# Groups: id
#
# id type
# 1 1 a
# 2 1 a
# 3 1 b
# 4 2 a
# 5 2 a
# 6 2 b
# 7 3 a
# 8 3 a
Or you could try it with data.table
library(data.table)
setDT(df)[, if(any(type == "b")) .SD[1:which.max(type == "b")] else .SD, by = id]
# id type
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 2 a
# 5: 2 a
# 6: 2 b
# 7: 3 a
# 8: 3 a

Resources