R: Covernt a complex time series dataframe to long - r

This is for R
date <- seq(as.Date("2020/03/11"), as.Date("2020/03/16"), "day")
x_pos_a <- c(1, 5, 4, 9, 0)
x_pos_b <- c(2, 6, 9, 5, 4)
like so [...]
I have a timeseries dataframe with 69 time points. The rows in the dataframe are dates.
Four variables (pos, anx, ang, sad) have been measured from three populations (A, B, C). Three samples were drawn from each population (x, y, z). Currently, each combination of the variable, population, and sample forms a column in the dataframe. For example, "x_pos_A", "x_pos_B", "x_pos_C", x_anx_A"..."z_sad_b", "z_sad_c".
I want to reshape it in the following shape
"Date" "variables" "population" "sample" "value"
I have spend the last 3 hours searching for answers on the forum but have been unsuccessful.
Any help much appreciated!
Thanks

You can use pivot_longer from tidyr :
tidyr::pivot_longer(df,
cols = -date,
names_to = c('sample', 'variable', 'population'),
names_sep = '_')
# date sample variable population value
# <date> <chr> <chr> <chr> <dbl>
# 1 2020-03-11 x pos a 1
# 2 2020-03-11 x pos b 2
# 3 2020-03-12 x pos a 5
# 4 2020-03-12 x pos b 6
# 5 2020-03-13 x pos a 4
# 6 2020-03-13 x pos b 9
# 7 2020-03-14 x pos a 9
# 8 2020-03-14 x pos b 5
# 9 2020-03-15 x pos a 0
#10 2020-03-15 x pos b 4
data
date <- seq(as.Date("2020/03/11"), as.Date("2020/03/15"), "day")
x_pos_a <- c(1, 5, 4, 9, 0)
x_pos_b <- c(2, 6, 9, 5, 4)
df <- data.frame(date, x_pos_a, x_pos_b)

Related

Checking if columns in dataframe are "paired"

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

R: sequential ranking (1,1,1,2,2,2,etc) based on date of entry for each patient? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I'm trying to create a column that ranks each person based on their date of entry, but since everyone's date of entry is unique, it's been challenging.
here's a reprex:
df <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"))
What I want:
df_desired <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"),
day_at_facility = c(1, 2, 3, 1, 2, 1, 2, 3))
basically, i want to order the days at facility, but I need it to restart based on each unique ID. let me know if this is not clear.
(This is a dupe of something, haven't found it yet, but in the interim ...)
base R
ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
# [1] 1 2 3 1 2 1 2 3
so therefore
df$day_at_facility <- ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
dplyr
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(day_at_facility = row_number())
# # A tibble: 8 x 3
# # Groups: unique_id [3]
# unique_id date_of_entry day_at_facility
# <dbl> <chr> <int>
# 1 1 3-12-2001 1
# 2 1 3-13-2001 2
# 3 1 3-14-2001 3
# 4 2 4-1-2001 1
# 5 2 4-2-2001 2
# 6 3 3-28-2001 1
# 7 3 3-29-2001 2
# 8 3 3-30-2001 3

Combine 2 rows in 1 duplicating the values in r

i have this data
data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7))
and I want to manipulate it to transform it like this
data.frame(date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16', '2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8, 9, 1, 8),
b = c(6, 5, 7, 6, 5, 7))
How can i do it? Thanks!
You can use the tidyr gather function to get this.
-First assign the data frame as an object.
-Then, gather the start and end dates with their a and b values respectively. (by excluding a and b from gather with a minus "-" sign.) Change the name of the value column with "date". The output from the gather was like that.
df %>%
gather(key, value = "date", -a, -b)
a b key date
1 9 6 start_date 2020-03-02
2 1 5 start_date 2020-03-09
3 8 7 start_date 2020-03-16
4 9 6 end_date 2020-03-06
5 1 5 end_date 2020-03-13
6 8 7 end_date 2002-03-20
-For the last part, in order to get rid of the "key" column (start_date and end_date), select only the ones you wanted.
See the full code below:
df <- data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7)) #df assignment to an object
df1 <- df %>%
gather(key, value = "date", -a, -b) %>% #gathering dates
select(date, a, b) #choosing what is needed
-This full code brings this output:
date a b
1 2020-03-02 9 6
2 2020-03-09 1 5
3 2020-03-16 8 7
4 2020-03-06 9 6
5 2020-03-13 1 5
6 2002-03-20 8 7

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to split dataframes into different dataframes based on one column name say ## sensor_name ## values that starts with some prefix like "RI_", "AI_" in R so that I can have two dataframes one for RI and another for AI?
I have tried the following code but this works well when I pivot my dataframe.
map(set_names(c("RI", "AI","FI")),~select(temp_df,starts_with(.x),starts_with("time_stamp")))
I expect the output to have two different dataframes,
RI_df:
AI_df:
It would be great if anyone help me with this since I just started to work on R programming language.
An option is split from base R
lst1 <- split(df1, substr(df1$sensor_name, 1,2))
names(lst1) <- paste0(names(lst1), "_df")
If the prefix length is variable
lst1 <- split(df1, sub("_.*", "", df1$sensor_name))
Or using tidyverse
library(dplyr)
df1 %>%
group_split(grp = str_remove(sensor_name, "_.*"), keep = FALSE)
NOTE: It is not recommended to have multiple objects in the global env. For that reason, keep it in the list and do all thee analysis on that list itself
Another approach from base R
df <- data.frame(sensor_name=c("R1_111","R1_113","A1_124","A1_2444"),
A=c(1,2,24,4),B=c(2,2,1,2),C=c(3,4,4,2))
df[grepl("R1",df$sensor_name),]
sensor_name A B C
1 R1_111 1 2 3
2 R1_113 2 2 4
df[grepl("A1",df$sensor_name),]
sensor_name A B C
3 A1_124 24 1 4
4 A1_2444 4 2 2
Create a variable to identify each group. After that you can subset the data to separate the groups. Functions from the stringr package can extract the relevant text from the longer sensor name.
library(stringr)
library(dplyr)
# Sample data
X <- tibble(
sensor = c("RI_1", "RI_2", "AI_1", "AI_2"),
A = c(1, 2, 3, 4),
B = c(5, 6, 7, 8),
C = c(9, 10, 11, 12)
)
# Extract text to identify groups
X <- X %>%
mutate(prefix = str_replace(sensor, "_.*", ""))
# Subset for desired group
X %>% filter(prefix == "AI")
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
# Or, split all the groups
lapply(unique(X$prefix), function(x) {
X %>% filter(prefix == x)
})
[[1]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 RI_1 1 5 9 RI
2 RI_2 2 6 10 RI
[[2]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
Depending on what you are doing with these groups you may do better to use group_by() form the dplyr package

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2

Resources