Splitting and creating 2 rows out of one in R data table - r

I have a dataset (dt) like this in R:
n id val
1 1&&2 10
2 3 20
3 4&&5 30
And what I want to get is
n id val
1 1 10
2 2 10
3 3 20
4 4 30
5 5 30
I know that to split ids I need to do something like this:
id_split <- strsplit(dt$id,"&&")
But how do I create new rows with the same val for ids which were initially together in a row?

You may cbind the splits to get a column which you cbind again to the val (recycling).
res <- do.call(rbind, Map(data.frame, id=lapply(strsplit(dat$id, "&&"), cbind),
val=dat$val))
res <- cbind(n=1:nrow(res), res)
res
# n id val
# 1 1 1 10
# 2 2 2 10
# 3 3 3 20
# 4 4 4 30
# 5 5 5 30

You can use the lengths from the split of id and expand your rows. Then set n to be the sequece of the length of your data frame, i.e.
l1 <- strsplit(as.character(df$id), '&&')
res_df <- transform(df[rep(seq_len(nrow(df)), lengths(l1)),],
id = unlist(l1),
n = seq_along(unlist(l1)))
which gives,
n id val
1 1 1 10
1.1 2 2 10
2 3 3 20
3 4 4 30
3.1 5 5 30
You can remove the rownames with rownames(res_df) <- NULL

A data.table solution.
library(data.table)
DT <- fread('n id val
1 1&&2 10
2 3 20
3 4&&5 30')
DT[,.(id=unlist(strsplit(id,split ="&&"))),by=.(n,val)][,n:=.I][]
#> n val id
#> 1: 1 10 1
#> 2: 2 10 2
#> 3: 3 20 3
#> 4: 4 30 4
#> 5: 5 30 5
Created on 2020-05-08 by the reprex package (v0.3.0)
Note:
A more rebosut solution is by = 1:nrow(DT). But you need to play around your other columns though.

If anyone looking for tidy solution,
dt %>%
separate(id, into = paste0("id", 1:2),sep = "&&") %>%
pivot_longer(cols = c(id1,id2), names_to = "id_name", values_to = "id") %>%
drop_na(id) %>%
select(n, id, val)
output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
Edit:
As suggested by #sotos, and completely missed by me. one liner solution
d %>% separate_rows(id, ,sep = "&&")
gives same output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30

tstrplit by id from data.table can do the job
library(data.table)
df <- setDT(df)[,.('id' = tstrsplit(id, "&&")), by = c('n','val')]
df[,'n' := seq(.N)]
df
n val id
1: 1 10 1
2: 2 10 2
3: 3 20 3
4: 4 30 4
5: 5 30 5

Related

How to create another column in a data frame based on repeated observations in another column?

So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3

How can I create a new column with mutate function in R that is a sequence of values of other columns in R?

I have a data frame that looks like this :
a
b
c
1
2
10
2
2
10
3
2
10
4
2
10
5
2
10
I want to create a column with mutate function of something else under the dplyr framework of functions (or base) that will be sequence from b to c (i.e from 2 to 10 with length the number of rows of this tibble or data frame)
Ideally my new data frame I want to like like this :
a
b
c
c
1
2
10
2
2
2
10
4
3
2
10
6
4
2
10
8
5
2
10
10
How can I do this with R using dplyr ?
library(tidyverse)
n=5
a = seq(1,n,length.out=n)
b = rep(2,n)
c = rep(10,n)
data = tibble(a,b,c)
We may do
library(dplyr)
data %>%
rowwise %>%
mutate(new = seq(b, c, length.out = n)[a]) %>%
ungroup
-output
# A tibble: 5 × 4
a b c new
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10
If you want this done "by group" for each a value (creating many new rows), we can create the sequence as a list column and then unnest it:
data %>%
mutate(result = map2(b, c, seq, length.out = n)) %>%
unnest(result)
# # A tibble: 25 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 1 2 10 4
# 3 1 2 10 6
# 4 1 2 10 8
# 5 1 2 10 10
# 6 2 2 10 2
# 7 2 2 10 4
# 8 2 2 10 6
# 9 2 2 10 8
# 10 2 2 10 10
# # … with 15 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want to keep the same number of rows and go from the first b value to the last c value, we can use seq directly in mutate:
data %>%
mutate(result = seq(from = first(b), to = last(c), length.out = n()))
# # A tibble: 5 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 2 2 10 4
# 3 3 2 10 6
# 4 4 2 10 8
# 5 5 2 10 10
This one?
library(dplyr)
df %>%
mutate(c1 = a*b)
a b c c1
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Calculate cumulative sum (cumsum) by group

With data frame:
df <- data.frame(id = rep(1:3, each = 5)
, hour = rep(1:5, 3)
, value = sample(1:15))
I want to add a cumulative sum column that matches the id:
df
id hour value csum
1 1 1 7 7
2 1 2 9 16
3 1 3 15 31
4 1 4 11 42
5 1 5 14 56
6 2 1 10 10
7 2 2 2 12
8 2 3 5 17
9 2 4 6 23
10 2 5 4 27
11 3 1 1 1
12 3 2 13 14
13 3 3 8 22
14 3 4 3 25
15 3 5 12 37
How can I do this efficiently? Thanks!
df$csum <- ave(df$value, df$id, FUN=cumsum)
ave is the "go-to" function if you want a by-group vector of equal length to an existing vector and it can be computed from those sub vectors alone. If you need by-group processing based on multiple "parallel" values, the base strategy is do.call(rbind, by(dfrm, grp, FUN)).
To add to the alternatives, data.table's syntax is nice:
library(data.table)
DT <- data.table(df, key = "id")
DT[, csum := cumsum(value), by = key(DT)]
Or, more compactly:
library(data.table)
setDT(df)[, csum := cumsum(value), id][]
The above will:
Convert the data.frame to a data.table by reference
Calculate the cumulative sum of value grouped by id and assign it by reference
Print (the last [] there) the result of the entire operation
"df" will now be a data.table with a "csum" column.
Using dplyr::
require(dplyr)
df %>% group_by(id) %>% mutate(csum = cumsum(value))
Using library plyr.
library(plyr)
ddply(df,.(id),transform,csum=cumsum(value))
Using base R
df <- data.frame(id = rep(1:3, each = 5)
, hour = rep(1:5, 3)
, value = sample(1:15))
transform(df , csum = ave(value , id , FUN = cumsum))
#> id hour value csum
#> 1 1 1 4 4
#> 2 1 2 12 16
#> 3 1 3 13 29
#> 4 1 4 6 35
#> 5 1 5 5 40
#> 6 2 1 15 15
#> 7 2 2 1 16
#> 8 2 3 2 18
#> 9 2 4 8 26
#> 10 2 5 9 35
#> 11 3 1 11 11
#> 12 3 2 7 18
#> 13 3 3 10 28
#> 14 3 4 3 31
#> 15 3 5 14 45
Created on 2022-06-05 by the reprex package (v2.0.1)

Resources