In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events? - r

I have multiple series of timepoints. Some series have five timepoints, others have ten or fifteen timepoints. The series are in multiples of five because the event I am measuring is always five timepoints long; some recordings have multiple events in succession. For instance:
Series 1:
0
77
98
125
174
Series 2:
0
69
95
117
179
201
222
246
277
293
0 marks the beginning of each series. Series 1 is a single event, but Series 2 is two events in succession. The 6th timepoint in Series 2 is the start of the second event in that series.
I have an R dataframe that contains every timepoint in one column:
dd <- data.frame(
timepoint=c(0, 77, 98, 125, 174,
0, 69, 95, 117, 179, 201, 222, 246, 277, 293)
)
I need to know the duration from the start of each event to the 4th timepoint in each event. For the above data, that means:
Duration 1: 125 - 0 = 125
Duration 2: 179 - 0 = 179
Duration 3: 277 - 201 = 76
How can I write a simple piece of R code that will tell me the duration of that interval regardless of how many series or events there are, i.e. regardless of how many numbers are in the column?
I tried using diff() and seq_along(), but that seems only useful for every nth number, which doesn't work in this case.
diff(vec[seq_along(vec) %% 4 == 1])

This is maybe one way to do it with dplyr. We break up the data into "runs" which reset at each 0 and them we have the "sequences" which reset each 5 values.
dd %>%
group_by(run =cumsum(timepoint==0)) %>%
mutate(seq = (row_number()-1) %/% 5 + 1) %>%
group_by(run, seq) %>%
summarize(diff=timepoint[4]-timepoint[1])
# run seq diff
# <int> <dbl> <dbl>
# 1 1 1 125
# 2 2 1 117
# 3 2 2 76
It makes it somewhat easy to tie the value back to where it came from.
If you just wanted to use indexing, here's a helper function
diff4v1 <- function(x) {
idx <- (seq_along(x)-1) %% 5+1;
x[idx==4] - x[idx==1]
}
diff4v1(dd$timepoint)
# [1] 125 117 76

This is your data frame (hypothetical)
df = data.frame(series = round(rnorm(40, 100, 50)))
head(df)
series
1 16
2 35
3 75
4 125
5 190
6 85
And these are your differences
idx = c(1:nrow(df))
df[which(idx %% 5 == 4), "series"] - df[which(idx %% 5 == 1), "series"]
[1] 109 -38 -101 -47 34 -52 -63 -5

Related

I can differentiate between months using pairs in R

data(airquality)
a=airquality
convert_fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
a[,4]=
convert_fahr_to_kelvin(a[,4])
oz=a[,1]
sr=a[,2]
wv=a[,3]
te=a[,4]
pairs(~oz+sr+wv+te,
col = c("orange") ,
pch = c(18),
labels = c("Ozono", "Irradiancia Solar", "Velocidad del viento","Temperatura"),
main = "Diagramas de dispersión por parejas")
This is the graphic that i get
This is what i am doing but, actually i would like to differentiate between months, like 31 first numbers of my a matrix, from all columns, be color green, for example and this for each month, i tried to separate the numbers in groups using group:
group <- NA
group[sr[1:31]]<-1
group[sr[32:61]]<-2
group[sr[62:92]]<-3
group[sr[93:123]]<-4
group[sr[124:153]]<-5
group[sr[1:31]]
group[sr[32:61]]
group[sr[62:92]]
group[sr[93:123]]
group[sr[124:153]]
here the numbers repeated
But what i get is that if the numbers in each column are the same they get in the same group, and i have been trying to solve it in other ways but i don't finally get what i want.
It is easier to create a group with gl
group <- as.integer(gl(length(sr), 31, length(sr)))
table(group)
#group
#1 2 3 4 5
#31 31 31 31 29
In the OP's code, 'group' is initialized as NA of length 1. Then, it is assigned based on values of 'sr' instead of just
group <- integer(length(sr))
group[1:31] <- 1
group[32:61] <- 2
...
whereas if we use sr values as index
sr[1:31]
#[1] 190 118 149 313 NA NA 299 99 19 194 NA 256 290 274 65 334 307 78 322 44 8 320 25 92 66 266 NA 13 252 223 279
then group values that are changed to 1 are at positions 190, 118, 149, 313, ....

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

dplyr mutate column with nearest value in external list

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.
My data frame looks like this:
index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))
index blockID
1 1 100
2 2 120
3 3 132
4 4 133
5 5 201
6 6 207
7 7 210
8 8 238
9 9 240
10 10 256
I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:
blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256)
so the additional column should contain
100 (match),
120 (match),
130 (no match for 132--nearest value is 130),
130 (no match for 133--nearest value is 130),
201,
205 (no match for 207--nearest value is 205),
210,
238,
240,
256
Here's what I've tried:
df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))
I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?
You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).
With that here's a simple base R solution -
df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])
index blockID blockmatch
1 1 100 100
2 2 120 120
3 3 132 130
4 4 133 130
5 5 201 201
6 6 207 205
7 7 210 210
8 8 238 238
9 9 240 240
10 10 256 256
Here are a couple of ways with dplyr -
df %>%
rowwise() %>%
mutate(
blockmatch = blocklist[order(abs(blockID - blocklist))][1]
)
df %>%
mutate(
blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
)
Thanks to #Onyambu, here's a faster way -
df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]

Breaking down a timed sequence into episodes

I'm trying to break down a vector of event times into episodes. An episode must meet 2 criteria. 1) It consists of 3 or more events and 2) those events have inter-event times of 25 time units or less. My data is organized in a data frame as shown below.
So far, I figured out that I can find the difference between events with diff(EventTime). By creating a logical vector that corresponds to events that the 2nd inter-event criterion, I can use rle(EpisodeTimeCriterion) to get a the total number, and length of episodes.
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion
25 NA NA
75 50 TRUE
100 25 TRUE
101 1 TRUE
105 4 TRUE
157 52 FALSE
158 1 TRUE
160 2 TRUE
167 7 TRUE
169 2 TRUE
170 1 TRUE
175 5 TRUE
178 3 TRUE
278 100 FALSE
302 24 TRUE
308 6 TRUE
320 12 TRUE
322 459 FALSE
However, I would like to know the timing of the episodes and 'rle()' doesn’t let me do that.
Ideally I would like to generate a data frame that looks like this:
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
1 4 75 105
2 7 158 178
3 3 302 322
I know that this is probably a simple problem but being new to R, the only solution I can envision is some series of loops. Is there a way of doing this without loops? Or is there package that lends itself to this sort of analysis?
Thanks!
Edited for clarity. Added a desired outcome data fame and expanded the example data to make it clearer.
You've got the pieces you need. You really need to make a variable that gives each episode a number/name so you can group by it. rle(...)$length gives you run lengths, so just use rep to repeat a number that number of times:
runs <- rle(df$EpisodeTimeCriterion)$lengths # You don't need this extra variable, but it makes the code more readable
df$Episode <- rep(1:length(runs), runs)
so df looks like
> head(df)
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion Episode
1 25 NA NA 1
2 75 50 TRUE 2
3 100 25 TRUE 2
4 101 1 TRUE 2
5 105 4 TRUE 2
6 157 52 FALSE 3
Now use dplyr to summarize the data:
library(dplyr)
df2 <- df %>% filter(EpisodeTimeCriterion) %>% group_by(Episode) %>%
summarise(EventsPerEpisode = n(),
EpisodeStartTime = min(EventTime),
EpisodeEndTime = max(EventTime))
which returns
> df2
Source: local data frame [3 x 4]
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
(int) (int) (dbl) (dbl)
1 2 4 75 105
2 4 7 158 178
3 6 3 302 320
If you want your episode numbers to be integers starting with one, you can clean up with
df2$Episode <- 1:nrow(df2)
Data
If someone wants to play with the data, the results of dput(df) before running the above code:
df <- structure(list(EventTime = c(25, 75, 100, 101, 105, 157, 158,
160, 167, 169, 170, 175, 178, 278, 302, 308, 320, 322), TimeDifferenceBetweenNextEvent = c(NA,
50, 25, 1, 4, 52, 1, 2, 7, 2, 1, 5, 3, 100, 24, 6, 12, 459),
EpisodeTimeCriterion = c(NA, TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE)), .Names = c("EventTime", "TimeDifferenceBetweenNextEvent",
"EpisodeTimeCriterion"), row.names = c(NA, -18L), class = "data.frame")
Here is one approach I came up with using a combination of cut2 from Hmisc package and cumsum to label episodes into numbers:
library(Hmisc)
library(dplyr)
df$episodeCut <- cut2(df$TimeDifferenceBetweenNextEvent, c(26))
df$episode <- cumsum((df$episodeCut == '[ 1,26)' & lag(df$episodeCut) != '[ 1,26)') | df$episodeCut != '[ 1,26)')
Output is as follows:
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion episodeCut episode
1 25 50 FALSE [26,52] 1
2 75 25 TRUE [ 1,26) 2
3 100 1 TRUE [ 1,26) 2
4 101 4 TRUE [ 1,26) 2
5 105 52 TRUE [26,52] 3
6 157 52 FALSE [26,52] 4
As you can see, it tags rows 2, 3, 4 as belonging to a single episode.
Is this what you are looking for? Not sure from your description. So, my answer may be wrong.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources