Unnest vector in dataframe but add list indices column - r

say I have a tibble such as this:
tibble(x=22:23, y=list(4:6,4:7))
# A tibble: 2 × 2
x y
<int> <list>
1 22 <int [3]>
2 23 <int [4]>
I would like to convert it into a new larger tibble by unnesting the lists (e.g. with unnest), which would give me a tibble with 7 rows. However, I want a new column added that tells me, for a given y-value in a row after unnesting, what the index of that y-value was when it was in list form. Here's what the above would look like after doing this:
# A tibble: 7 × 2
x y index
<int> <int> <int>
1 22 4 1
2 22 5 2
3 22 6 3
4 23 4 1
5 23 5 2
6 23 6 3
7 23 7 4

You can map over y column and bind the index for each element before unnesting:
df %>%
mutate(y = map(y, ~ data.frame(y=.x, index=seq_along(.x)))) %>%
unnest()
# A tibble: 7 x 3
# x y index
# <int> <int> <int>
#1 22 4 1
#2 22 5 2
#3 22 6 3
#4 23 4 1
#5 23 5 2
#6 23 6 3
#7 23 7 4

Here is another version with lengths
df %>%
mutate(index = lengths(y)) %>%
unnest(y) %>%
mutate(index = sequence(unique(index)))
# A tibble: 7 x 3
# x index y
# <int> <int> <int>
#1 22 1 4
#2 22 2 5
#3 22 3 6
#4 23 1 4
#5 23 2 5
#6 23 3 6
#7 23 4 7

By suing unnest and group_by
library(tidyr)
library(dplyr)
df %>%
unnest(y)%>%group_by(x)%>%mutate(index=row_number())
# A tibble: 7 x 3
# Groups: x [2]
x y index
<int> <int> <int>
1 22 4 1
2 22 5 2
3 22 6 3
4 23 4 1
5 23 5 2
6 23 6 3
7 23 7 4

You can also try rowwise and do.
library(tidyverse)
tibble(x=22:23, y=list(4:6,4:7)) %>%
rowwise() %>%
do(tibble(x=.$x, y=unlist(.$y), index=1:length(.$y)))

Related

R filter removes unexpected values

I want to filter out all the row_number >12 from a data frame like this:
head(dat1)
# A tibble: 6 × 7
date order_id product_id row_number shelf_number shelf_level position
<date> <chr> <chr> <dbl> <chr> <chr> <dbl>
1 2020-01-02 ES100025694747 000072489501 6 01 C 51
2 2020-01-02 ES100025694747 000058155401 2 39 B 51
3 2020-01-02 ES100025694747 000067694201 21 28 B 51
4 2020-01-02 ES100025699052 000057235001 9 05 B 31
5 2020-01-02 ES100025699052 000050456101 5 29 D 31
6 2020-01-02 ES100025699052 000067091601 2 17 D 11
The row_number orginally contains values like this:
dat1 %>% distinct(row_number)
# A tibble: 15 × 1
row_number
<dbl>
1 6
2 2
3 21
4 9
5 5
6 1
7 10
8 3
9 4
10 8
11 7
12 20
13 22
14 11
15 12
I filtered like this: dat1 <- dat1 %>% filter(row_number < '13')
The result: instead of keeping all values <13, it removes values from 2 to 9.
dat1 %>% distinct(row_number)
# A tibble: 4 × 1
row_number
<dbl>
1 1
2 10
3 11
4 12
What s wrong with my codes?

How to create new variables based on stacked values in columns in R

I have a data set that looks like this
ID a b c d source file
1 3 4 7 23 feb2010.txt
2 2 1 2 47 feb2010.txt
1 3 4 7 26 march2010.txt
2 2 1 2 33 march2010.txt
1 3 4 7 28 april2010.txt
2 2 1 2 32 april2010.txt
I'd like the column names to read
ID a b c Feb10 March10 April10
1 3 4 7 23 26 28
2 2 1 2 47 33 32
My actual data set has more than just 2 unique ids. It has thousands of unique ids. Any help as to how to change this would be very much appreciated as most of the code I have tried hasn't worked yet.
Thank you!
You can use pivot_wider() from the tidyverse.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
mutate(source_file = tools::file_path_sans_ext(source_file),
source_file = format(my(source_file), format = "%b%y")) %>%
pivot_wider(names_from = "source_file", values_from = "d")
Which gives you the following:
# A tibble: 2 x 7
ID a b c Feb10 Mar10 Apr10
<int> <int> <int> <int> <int> <int> <int>
1 1 3 4 7 23 26 28
2 2 2 1 2 47 33 32
Data:
df <- read.table(textConnection("ID a b c d source_file
1 3 4 7 23 feb2010.txt
2 2 1 2 47 feb2010.txt
1 3 4 7 26 march2010.txt
2 2 1 2 33 march2010.txt
1 3 4 7 28 april2010.txt
2 2 1 2 32 april2010.txt"), header = TRUE)
Using dcast from data.table
library(data.table)
dcast(setDT(df1), ID + a + b + c ~ trimws(source_file,
whitespace = "\\.txt"), value.var = "d")
Key: <ID, a, b, c>
ID a b c april2010 feb2010 march2010
<int> <int> <int> <int> <int> <int> <int>
1: 1 3 4 7 28 23 26
2: 2 2 1 2 32 47 33

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Split information from two columns, R, tidyverse

i've got some data in two columns:
# A tibble: 16 x 2
code niveau
<chr> <dbl>
1 A 1
2 1 2
3 2 2
4 3 2
5 4 2
6 5 2
7 B 1
8 6 2
9 7 2
My desired output is:
A tibble: 16 x 3
code niveau cat
<chr> <dbl> <chr>
1 A 1 A
2 1 2 A
3 2 2 A
4 3 2 A
5 4 2 A
6 5 2 A
7 B 1 B
8 6 2 B
I there a tidy way to convert these data without looping through it?
Here some dummy data:
data<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2))
desired_output<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2),
cat=c(rep('A', 6),rep('B', 5), rep('C', 5)))
Nicolas
Probably, you can create a new column cat and replace code values with NA where there is a number. We can then use fill to replace missing values with previous non-NA value.
library(dplyr)
data %>% mutate(cat = replace(code, grepl('\\d', code), NA)) %>% tidyr::fill(cat)
# A tibble: 16 x 3
# code niveau cat
# <chr> <dbl> <chr>
# 1 A 1 A
# 2 1 2 A
# 3 2 2 A
# 4 3 2 A
# 5 4 2 A
# 6 5 2 A
# 7 B 1 B
# 8 6 2 B
# 9 7 2 B
#10 8 2 B
#11 9 2 B
#12 C 1 C
#13 10 2 C
#14 11 2 C
#15 12 2 C
#16 13 2 C
We can use str_detect from stringr
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(cat = replace(code, str_detect(code, '\\d'), NA)) %>%
fill(cat)

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)
Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Resources