Splitting panel data rows

Splitting panel data rows - r

I have a dataset that has rows I would like to split. Is there a simple way to do this?
data = data.frame(id = 111, t1 = 277,t2 = 385, meds = 1)
I am trying to use a conditional to allow me to split rows and create an output similar to this data
data = data.frame(id = 111, t1 = c(277,366),t2 = c(365,385), meds = 1)

I think you can just do a little row-wise summary using dplyr
library(dplyr)
data %>%
rowwise() %>%
summarize(id,
t1 = if(t1 < 365 & t2 > 365) c(t1, 366) else t1,
t2 = if(t1 < 365 & t2 > 365) c(365, t2) else t2,
meds)
#> # A tibble: 2 x 4
#> id t1 t2 meds
#> <dbl> <dbl> <dbl> <dbl>
#> 1 111 277 365 1
#> 2 111 366 385 1

I used group_split function from dplyr:
## Loading the required libraries
library(dplyr)
library(tidyverse)
## Creating the dataframe
df <- data.frame(
t1= c(1:600),
t2= c(200:799)
)
## Conditional Column
df1 = df %>%
mutate(DataframeNo = ifelse(t1<365 & t2>365, "2 dfs","1 df" )) %>%
group_by(DataframeNo)
## Get the first Dataframe
group_split(df1)[[1]]
## Get the second Dataframe
group_split(df1)[[2]]
Output
> group_split(df1)[[1]]
# A tibble: 402 x 3
t1 t2 DataframeNo
<int> <int> <chr>
1 1 200 1 df
2 2 201 1 df
3 3 202 1 df
4 4 203 1 df
5 5 204 1 df
6 6 205 1 df
7 7 206 1 df
8 8 207 1 df
9 9 208 1 df
10 10 209 1 df
# ... with 392 more rows
> ## Get the second Dataframe
> group_split(df1)[[2]]
# A tibble: 198 x 3
t1 t2 DataframeNo
<int> <int> <chr>
1 167 366 2 dfs
2 168 367 2 dfs
3 169 368 2 dfs
4 170 369 2 dfs
5 171 370 2 dfs
6 172 371 2 dfs
7 173 372 2 dfs
8 174 373 2 dfs
9 175 374 2 dfs
10 176 375 2 dfs
# ... with 188 more rows

Related

Inexact joining data based on greater equal condition

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)

An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]

You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

How to split a dataframe into a list of dataframes based on distinct value ranges

I want to split a dataframe into a list of dataframes based on distinct ranges of a numeric variable.
ILLUSTRATIVE DATA:
set.seed(123)
df <- data.frame(
subject = LETTERS[1:10],
weight = sample(1:1000, 10)
)
df
subject weight
1 A 288
2 B 788
3 C 409
4 D 881
5 E 937
6 F 46
7 G 525
8 H 887
9 I 548
10 J 453
I'd like to have a list of 4 smaller dataframes based on these limits of the variable weight:
limits <- c(250, 500, 750, 1000)
That is, what I'm after, in the list of dataframes, is one dataframe where weight is in the range of 0-250, another where weight ranges between 251-500, another where the range is from 501-750, and so on--in other words, the ranges are distinct.
What I've tried so far is this dyplr solution, which outputs a list of 5 dataframes but with cumulative ranges:
limits <- c(250, 500, 750, 1000)
lapply(limits, function(x) {df %>% filter(weight <= x)})
[[1]]
[1] subject weight
<0 rows> (or 0-length row.names)
[[2]]
subject weight
1 F 46
[[3]]
subject weight
1 A 288
2 C 409
3 F 46
4 J 453
[[4]]
subject weight
1 A 288
2 C 409
3 F 46
4 G 525
5 I 548
6 J 453
[[5]]
subject weight
1 A 288
2 B 788
3 C 409
4 D 881
5 E 937
6 F 46
7 G 525
8 H 887
9 I 548
10 J 453
How could this code be fixed, or which other code can be used, so that a list of dataframes is obtained based on distinct weight ranges?

Perhaps:
library(dplyr)
df %>%
group_split(group = findInterval(weight, limits))
Output:
[4]>
[[1]]
# A tibble: 4 x 3
subject weight group
<fct> <int> <int>
1 C 179 0
2 E 195 0
3 H 118 0
4 J 229 0
[[2]]
# A tibble: 3 x 3
subject weight group
<fct> <int> <int>
1 A 415 1
2 B 463 1
3 I 299 1
[[3]]
# A tibble: 1 x 3
subject weight group
<fct> <int> <int>
1 D 526 2
[[4]]
# A tibble: 2 x 3
subject weight group
<fct> <int> <int>
1 F 938 3
2 G 818 3
Just use keep = FALSE as additional argument to group_split if you want to remove the group column in your output.

A base R one-liner can split the data by limits.
split(df, findInterval(df$weight, limits))
#$`0`
# subject weight
#3 C 179
#5 E 195
#8 H 118
#10 J 229
#
#$`1`
# subject weight
#1 A 415
#2 B 463
#9 I 299
#
#$`2`
# subject weight
#4 D 526
#
#$`3`
# subject weight
#6 F 938
#7 G 818

Get most frequently occurring factor level in dplyr piping structure

I'd like to be able to find the most frequently occurring level in a factor in a dataset while using dplyr's piping structure. I'm trying to create a new variable that contains the 'modal' factor level when being grouped by another variable.
This is an example of what I'm looking for:
df <- data.frame(cat = stringi::stri_rand_strings(100, 1, '[A-Z]'), num = floor(runif(100, min=0, max=500)))
df <- df %>%
dplyr::group_by(cat) %>%
dplyr::mutate(cat_mode = Mode(num))
Where "Mode" is a function that I'm looking for

Use table to count the items and then use which.max to find out the most frequent one:
df %>%
group_by(cat) %>%
mutate(cat_mode = names(which.max(table(num)))) %>%
head()
# A tibble: 6 x 3
# Groups: cat [4]
# cat num cat_mode
# <fctr> <dbl> <chr>
#1 Q 305 138
#2 W 34.0 212
#3 R 53.0 53
#4 D 395 5
#5 W 212 212
#6 Q 417 138
# ...

similar question to Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>%
group_by(cat) %>%
mutate(cat_mode = Mode(num))
# A tibble: 100 x 3
# Groups: cat [26]
cat num cat_mode
<fct> <dbl> <dbl>
1 S 25 25
2 V 86 478
3 R 335 335
4 S 288 25
5 S 330 25
6 Q 384 384
7 C 313 313
8 H 275 275
9 K 274 274
10 J 75 75
# ... with 90 more rows
To see for each factor
df %>%
group_by(cat) %>%
summarise(cat_mode = Mode(num))
A tibble: 26 x 2
cat cat_mode
<fct> <dbl>
1 A 480
2 B 380
3 C 313
4 D 253
5 E 202
6 F 52
7 G 182
8 H 275
9 I 356
10 J 75
# ... with 16 more rows

Differences between all possible pairs of rows for all columns within each level of factor

I want to build all possible pairs of rows in a dataframe within each level of a categorical variable name and then make the differences of these rows within each level of name for all non-factor variables: row 1 - row 2, row 1 - row 3, …
set.seed(9)
df <- data.frame(
ID = 1:10,
name = as.factor(rep(LETTERS, each = 4)[1:10]),
X1 = sample(1001, 10),
X2 = sample(1001, 10),
bool = sample(c(TRUE, FALSE), 10, replace = TRUE),
fruit = as.factor(sample(c("Apple", "Orange", "Kiwi"), 10, replace = TRUE))
)
This is what the sample looks like:
ID name X1 X2 bool fruit
1 1 A 222 118 FALSE Apple
2 2 A 25 9 TRUE Kiwi
3 3 A 207 883 TRUE Orange
4 4 A 216 301 TRUE Kiwi
5 5 B 443 492 FALSE Apple
6 6 B 134 499 FALSE Kiwi
7 7 B 389 401 TRUE Kiwi
8 8 B 368 972 TRUE Kiwi
9 9 C 665 356 FALSE Apple
10 10 C 985 488 FALSE Kiwi
I want to get a dataframe of 13 rows which looks like :
ID name X1 X2 bool fruit
1 1-2 A 197 109 -1 Apple
2 1-3 A 15 -765 -1 Kiwi
…
Note that the factor fruit should be unchanged. But it is a bonus, I want above all the X1 and X2 to be changed and the factor name to be kept.
I know I may use combn function but I do not see how to do it. I would prefer a solution with the dplyr package and the group_by function.
I've managed to create all differences for consecutives rows with dplyr using
varnotfac <- names(df)[!sapply(df, is.factor )] # remove factorial variable
# but not logical variable
library(dplyr)
diff <- df%>%
group_by(name) %>%
mutate_at(varnotfac, funs(. - lead(.))) %>% #
na.omit()

I could not find out how to keep all variables using filter_if / filter_at so I used select_at. So from #Axeman's answer
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
diff1<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~combn(., 2, base::diff))))) %>%
unnest()
Or with the outer function, it's way faster than combn
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
allpairs <- function(v){
y <- outer(v,v,'-')
z <- y[lower.tri(y)]
return(z)
}
diff2<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~allpairs(.))))) %>%
unnest()
)
One can check that the data.frame obtained are the same with
all.equal(diff1,diff2)
[1] TRUE

My sample looks different...
ID name X1 X2 bool
1 1 A 222 118 FALSE
2 2 A 25 9 TRUE
3 3 A 207 883 TRUE
4 4 A 216 301 TRUE
5 5 B 443 492 FALSE
6 6 B 134 499 FALSE
7 7 B 389 401 TRUE
8 8 B 368 972 TRUE
9 9 C 665 356 FALSE
10 10 C 985 488 FALSE
Using this, and looking here, we can do:
library(dplyr)
library(tidyr)
library(purrr)
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~as.numeric(dist(.)))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <dbl> <dbl> <dbl> <dbl>
1 A 1 197 109 1
2 A 2 15 765 1
3 A 3 6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 582 0
7 B 1 309 7 0
8 B 2 54 91 1
9 B 3 75 480 1
10 B 1 255 98 1
11 B 2 234 473 1
12 B 1 21 571 0
13 C 1 320 132 0
This is unsigned though. Alternatively:
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~combn(., 2, diff))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <int> <int> <int> <int>
1 A 1 -197 -109 1
2 A 2 -15 765 1
3 A 3 -6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 -582 0
7 B 1 -309 7 0
8 B 2 -54 -91 1
9 B 3 -75 480 1
10 B 1 255 -98 1
11 B 2 234 473 1
12 B 1 -21 571 0
13 C 1 320 132 0

Filter rows based on two criteria in dplyr

Sample data:
y <- c(sort(sample(0:100, 365,replace = T)),sort(sample(0:100, 365,replace = T)))
df <- data.frame(loc.id = rep(1:2,each = 365), day = rep(1:365,times = 2), y = y,ref.day = 250)
I want to select all the first row where y > 20, y > 40, y > 60 and y > 80 for each loc.id
df %>% group_by(loc.id) %>% dplyr::filter(any(y > 20)) %>% # additional check
dplyr::slice(unique(c(which.max(y > 20), which.max(y > 40),which.max(y > 60),which.max(y > 80)))) %>% ungroup()
# A tibble: 8 x 4
loc.id day y ref.day
<int> <int> <int> <dbl>
1 1 78 21 250
2 1 154 41 250
3 1 225 61 250
4 1 288 81 250
5 2 79 21 250
6 2 147 41 250
7 2 224 61 250
8 2 300 81 250
I want to include an additional statement which is that if after slicing day is > ref.day, then select the row where day is equal to ref.day instead.
In this case, it would look like:
# A tibble: 8 x 4
loc.id day y ref.day
<int> <int> <int> <dbl>
1 1 78 21 250
2 1 154 41 250
3 1 225 61 250
4 1 288 81 250 # this row will not be selected. Instead row where day == 250 will be here instead
5 2 79 21 250
6 2 147 41 250
7 2 224 61 250
8 2 300 81 250 # this row will not be selected. Instead row where day == 250 will be here instead

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Splitting panel data rows - r

Related

Inexact joining data based on greater equal condition

How to split a dataframe into a list of dataframes based on distinct value ranges

Get most frequently occurring factor level in dplyr piping structure

Differences between all possible pairs of rows for all columns within each level of factor

Filter rows based on two criteria in dplyr

Categories

Resources