Splitting panel data rows - r

I have a dataset that has rows I would like to split. Is there a simple way to do this?
data = data.frame(id = 111, t1 = 277,t2 = 385, meds = 1)
I am trying to use a conditional to allow me to split rows and create an output similar to this data
data = data.frame(id = 111, t1 = c(277,366),t2 = c(365,385), meds = 1)

I think you can just do a little row-wise summary using dplyr
library(dplyr)
data %>%
rowwise() %>%
summarize(id,
t1 = if(t1 < 365 & t2 > 365) c(t1, 366) else t1,
t2 = if(t1 < 365 & t2 > 365) c(365, t2) else t2,
meds)
#> # A tibble: 2 x 4
#> id t1 t2 meds
#> <dbl> <dbl> <dbl> <dbl>
#> 1 111 277 365 1
#> 2 111 366 385 1

I used group_split function from dplyr:
## Loading the required libraries
library(dplyr)
library(tidyverse)
## Creating the dataframe
df <- data.frame(
t1= c(1:600),
t2= c(200:799)
)
## Conditional Column
df1 = df %>%
mutate(DataframeNo = ifelse(t1<365 & t2>365, "2 dfs","1 df" )) %>%
group_by(DataframeNo)
## Get the first Dataframe
group_split(df1)[[1]]
## Get the second Dataframe
group_split(df1)[[2]]
Output
> group_split(df1)[[1]]
# A tibble: 402 x 3
t1 t2 DataframeNo
<int> <int> <chr>
1 1 200 1 df
2 2 201 1 df
3 3 202 1 df
4 4 203 1 df
5 5 204 1 df
6 6 205 1 df
7 7 206 1 df
8 8 207 1 df
9 9 208 1 df
10 10 209 1 df
# ... with 392 more rows
> ## Get the second Dataframe
> group_split(df1)[[2]]
# A tibble: 198 x 3
t1 t2 DataframeNo
<int> <int> <chr>
1 167 366 2 dfs
2 168 367 2 dfs
3 169 368 2 dfs
4 170 369 2 dfs
5 171 370 2 dfs
6 172 371 2 dfs
7 173 372 2 dfs
8 174 373 2 dfs
9 175 374 2 dfs
10 176 375 2 dfs
# ... with 188 more rows

Related

Inexact joining data based on greater equal condition

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)
An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]
You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

How to split a dataframe into a list of dataframes based on distinct value ranges

I want to split a dataframe into a list of dataframes based on distinct ranges of a numeric variable.
ILLUSTRATIVE DATA:
set.seed(123)
df <- data.frame(
subject = LETTERS[1:10],
weight = sample(1:1000, 10)
)
df
subject weight
1 A 288
2 B 788
3 C 409
4 D 881
5 E 937
6 F 46
7 G 525
8 H 887
9 I 548
10 J 453
I'd like to have a list of 4 smaller dataframes based on these limits of the variable weight:
limits <- c(250, 500, 750, 1000)
That is, what I'm after, in the list of dataframes, is one dataframe where weight is in the range of 0-250, another where weight ranges between 251-500, another where the range is from 501-750, and so on--in other words, the ranges are distinct.
What I've tried so far is this dyplr solution, which outputs a list of 5 dataframes but with cumulative ranges:
limits <- c(250, 500, 750, 1000)
lapply(limits, function(x) {df %>% filter(weight <= x)})
[[1]]
[1] subject weight
<0 rows> (or 0-length row.names)
[[2]]
subject weight
1 F 46
[[3]]
subject weight
1 A 288
2 C 409
3 F 46
4 J 453
[[4]]
subject weight
1 A 288
2 C 409
3 F 46
4 G 525
5 I 548
6 J 453
[[5]]
subject weight
1 A 288
2 B 788
3 C 409
4 D 881
5 E 937
6 F 46
7 G 525
8 H 887
9 I 548
10 J 453
How could this code be fixed, or which other code can be used, so that a list of dataframes is obtained based on distinct weight ranges?
Perhaps:
library(dplyr)
df %>%
group_split(group = findInterval(weight, limits))
Output:
[4]>
[[1]]
# A tibble: 4 x 3
subject weight group
<fct> <int> <int>
1 C 179 0
2 E 195 0
3 H 118 0
4 J 229 0
[[2]]
# A tibble: 3 x 3
subject weight group
<fct> <int> <int>
1 A 415 1
2 B 463 1
3 I 299 1
[[3]]
# A tibble: 1 x 3
subject weight group
<fct> <int> <int>
1 D 526 2
[[4]]
# A tibble: 2 x 3
subject weight group
<fct> <int> <int>
1 F 938 3
2 G 818 3
Just use keep = FALSE as additional argument to group_split if you want to remove the group column in your output.
A base R one-liner can split the data by limits.
split(df, findInterval(df$weight, limits))
#$`0`
# subject weight
#3 C 179
#5 E 195
#8 H 118
#10 J 229
#
#$`1`
# subject weight
#1 A 415
#2 B 463
#9 I 299
#
#$`2`
# subject weight
#4 D 526
#
#$`3`
# subject weight
#6 F 938
#7 G 818

Get most frequently occurring factor level in dplyr piping structure

I'd like to be able to find the most frequently occurring level in a factor in a dataset while using dplyr's piping structure. I'm trying to create a new variable that contains the 'modal' factor level when being grouped by another variable.
This is an example of what I'm looking for:
df <- data.frame(cat = stringi::stri_rand_strings(100, 1, '[A-Z]'), num = floor(runif(100, min=0, max=500)))
df <- df %>%
dplyr::group_by(cat) %>%
dplyr::mutate(cat_mode = Mode(num))
Where "Mode" is a function that I'm looking for
Use table to count the items and then use which.max to find out the most frequent one:
df %>%
group_by(cat) %>%
mutate(cat_mode = names(which.max(table(num)))) %>%
head()
# A tibble: 6 x 3
# Groups: cat [4]
# cat num cat_mode
# <fctr> <dbl> <chr>
#1 Q 305 138
#2 W 34.0 212
#3 R 53.0 53
#4 D 395 5
#5 W 212 212
#6 Q 417 138
# ...
similar question to Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>%
group_by(cat) %>%
mutate(cat_mode = Mode(num))
# A tibble: 100 x 3
# Groups: cat [26]
cat num cat_mode
<fct> <dbl> <dbl>
1 S 25 25
2 V 86 478
3 R 335 335
4 S 288 25
5 S 330 25
6 Q 384 384
7 C 313 313
8 H 275 275
9 K 274 274
10 J 75 75
# ... with 90 more rows
To see for each factor
df %>%
group_by(cat) %>%
summarise(cat_mode = Mode(num))
A tibble: 26 x 2
cat cat_mode
<fct> <dbl>
1 A 480
2 B 380
3 C 313
4 D 253
5 E 202
6 F 52
7 G 182
8 H 275
9 I 356
10 J 75
# ... with 16 more rows

Differences between all possible pairs of rows for all columns within each level of factor

I want to build all possible pairs of rows in a dataframe within each level of a categorical variable name and then make the differences of these rows within each level of name for all non-factor variables: row 1 - row 2, row 1 - row 3, …
set.seed(9)
df <- data.frame(
ID = 1:10,
name = as.factor(rep(LETTERS, each = 4)[1:10]),
X1 = sample(1001, 10),
X2 = sample(1001, 10),
bool = sample(c(TRUE, FALSE), 10, replace = TRUE),
fruit = as.factor(sample(c("Apple", "Orange", "Kiwi"), 10, replace = TRUE))
)
This is what the sample looks like:
ID name X1 X2 bool fruit
1 1 A 222 118 FALSE Apple
2 2 A 25 9 TRUE Kiwi
3 3 A 207 883 TRUE Orange
4 4 A 216 301 TRUE Kiwi
5 5 B 443 492 FALSE Apple
6 6 B 134 499 FALSE Kiwi
7 7 B 389 401 TRUE Kiwi
8 8 B 368 972 TRUE Kiwi
9 9 C 665 356 FALSE Apple
10 10 C 985 488 FALSE Kiwi
I want to get a dataframe of 13 rows which looks like :
ID name X1 X2 bool fruit
1 1-2 A 197 109 -1 Apple
2 1-3 A 15 -765 -1 Kiwi
…
Note that the factor fruit should be unchanged. But it is a bonus, I want above all the X1 and X2 to be changed and the factor name to be kept.
I know I may use combn function but I do not see how to do it. I would prefer a solution with the dplyr package and the group_by function.
I've managed to create all differences for consecutives rows with dplyr using
varnotfac <- names(df)[!sapply(df, is.factor )] # remove factorial variable
# but not logical variable
library(dplyr)
diff <- df%>%
group_by(name) %>%
mutate_at(varnotfac, funs(. - lead(.))) %>% #
na.omit()
I could not find out how to keep all variables using filter_if / filter_at so I used select_at. So from #Axeman's answer
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
diff1<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~combn(., 2, base::diff))))) %>%
unnest()
Or with the outer function, it's way faster than combn
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
allpairs <- function(v){
y <- outer(v,v,'-')
z <- y[lower.tri(y)]
return(z)
}
diff2<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~allpairs(.))))) %>%
unnest()
)
One can check that the data.frame obtained are the same with
all.equal(diff1,diff2)
[1] TRUE
My sample looks different...
ID name X1 X2 bool
1 1 A 222 118 FALSE
2 2 A 25 9 TRUE
3 3 A 207 883 TRUE
4 4 A 216 301 TRUE
5 5 B 443 492 FALSE
6 6 B 134 499 FALSE
7 7 B 389 401 TRUE
8 8 B 368 972 TRUE
9 9 C 665 356 FALSE
10 10 C 985 488 FALSE
Using this, and looking here, we can do:
library(dplyr)
library(tidyr)
library(purrr)
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~as.numeric(dist(.)))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <dbl> <dbl> <dbl> <dbl>
1 A 1 197 109 1
2 A 2 15 765 1
3 A 3 6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 582 0
7 B 1 309 7 0
8 B 2 54 91 1
9 B 3 75 480 1
10 B 1 255 98 1
11 B 2 234 473 1
12 B 1 21 571 0
13 C 1 320 132 0
This is unsigned though. Alternatively:
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~combn(., 2, diff))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <int> <int> <int> <int>
1 A 1 -197 -109 1
2 A 2 -15 765 1
3 A 3 -6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 -582 0
7 B 1 -309 7 0
8 B 2 -54 -91 1
9 B 3 -75 480 1
10 B 1 255 -98 1
11 B 2 234 473 1
12 B 1 -21 571 0
13 C 1 320 132 0

Filter rows based on two criteria in dplyr

Sample data:
y <- c(sort(sample(0:100, 365,replace = T)),sort(sample(0:100, 365,replace = T)))
df <- data.frame(loc.id = rep(1:2,each = 365), day = rep(1:365,times = 2), y = y,ref.day = 250)
I want to select all the first row where y > 20, y > 40, y > 60 and y > 80 for each loc.id
df %>% group_by(loc.id) %>% dplyr::filter(any(y > 20)) %>% # additional check
dplyr::slice(unique(c(which.max(y > 20), which.max(y > 40),which.max(y > 60),which.max(y > 80)))) %>% ungroup()
# A tibble: 8 x 4
loc.id day y ref.day
<int> <int> <int> <dbl>
1 1 78 21 250
2 1 154 41 250
3 1 225 61 250
4 1 288 81 250
5 2 79 21 250
6 2 147 41 250
7 2 224 61 250
8 2 300 81 250
I want to include an additional statement which is that if after slicing day is > ref.day, then select the row where day is equal to ref.day instead.
In this case, it would look like:
# A tibble: 8 x 4
loc.id day y ref.day
<int> <int> <int> <dbl>
1 1 78 21 250
2 1 154 41 250
3 1 225 61 250
4 1 288 81 250 # this row will not be selected. Instead row where day == 250 will be here instead
5 2 79 21 250
6 2 147 41 250
7 2 224 61 250
8 2 300 81 250 # this row will not be selected. Instead row where day == 250 will be here instead

Resources