Determine windows/epochs in time series and calculate average - r

I'm dealing with time series data from neurophysiological recordings which typically have 'markers' that mark the beginning of an event (e.g., stimulus being presented on the screen). I'm trying to subset particular windows/epochs based on certain markers, and then average those separate windows/epochs.
To illustrate this, below is really simple example (my actual datasets have millions of data points, so it'd be nice to have efficient solutions).
df <- data.frame(value = c(1:10, 101:110), #time series data
marker = c(NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA, #event markers
NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA))
start <- which(df$marker == "start") #indices 3 and 13 are the 'start' markers
end <- which(df$marker == 'end') #indices 8 and 18 are the 'end markers'
window1 <- df$value[start[1]:end[1]] #first window (indices 3 to 8)
window2 <- df$value[start[2]:end[2]] #second window (indices 13 to 18)
averageWindow <- (window1 + window2) / 2 #average of the two windows
Is this the most efficient way to go about doing this (I have nearly 1000 windows in my actual data and about 1 million rows)?

I am not sure if you want an average based on all windows or an average for each window. So I decided to produce both results. Using your start and end, I subsetted the data with lapply(). By this time, I removed irrelevant data. Then, I combined data frames in the list with rbindlist() and assigned ID to a new column. The final process was to get an average.
library(data.table)
start <- which(df$marker == "start") #indices 3 and 13 are the 'start' markers
end <- which(df$marker == 'end') #indices 8 and 18 are the 'end markers'
rbindlist(lapply(1:length(start), function(x){
df[start[x]:end[x], ]}), idcol = TRUE) -> temp
# An overall average
temp[, list(average = sum(value) / uniqueN(.id))]
# average
#1: 333
# An average for each window
temp[, list(average = sum(value) / .N), by = .id]
# .id average
#1: 1 5.5
#2: 2 105.5
Replying to the OP's message, I came up with the following code. I created ID for each of the 6 points and calculated an average for each point.
temp[, index := 1:.N, by = .id][,
list(average = sum(value) / .N), by = index]
# index average
#1: 1 53
#2: 2 54
#3: 3 55
#4: 4 56
#5: 5 57
#6: 6 58
DATA
df <- data.frame(value = c(1:10, 101:110), #time series data
marker = c(NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA, #event markers
NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA),
stringsAsFactors = FALSE)

Related

R - Coalesce dataframe while creating a new variable with column names [duplicate]

This question already has answers here:
Coalesce columns and create another column to specify source
(4 answers)
Closed 2 years ago.
I am using the dplyr::coalesce and dplyr::mutate to find all first non-missing values and stuff that into a new variable. However, I would like to also create a new variable with the information on which variable is used to infill the new variable.
Here is an example:
df <- dataframe(
St1 = c(1, NA, NA, NA),
St2 = c(NA, 3, NA, NA),
St3 = c(NA, NA, 12, NA),
St4 = c(NA, NA, NA, 4))
What I do :
df <- df %>%
mutate(df.coalesce = coalesce(St1, St2, St3, St4)) %>%
select(df.coalesce)
Result:
df.coalesce
1
3
12
4
Desired result:
Station df.coalesce
St.1 1
St.2 3
St.3 12
St.4 4
Is there a way to do that using the tidyverse grammar?
Thanks!
You can use max.col to get column name with non-NA value in each row and use do.call with coalesce to apply it to all the columns.
library(dplyr)
df %>%
transmute(Station = names(df)[max.col(replace(., is.na(.), 0))],
df.coalesce = do.call(coalesce, .))
# Station df.coalesce
#1 St1 1
#2 St2 3
#3 St3 12
#4 St4 4
You can find all the ids having NA and then remove them.
train <- read.csv (file = "file", sep = ",", na.strings=c("NA"))
id_na_Cols <- sapply(train,function(x)any(is.na(x)))
trainData <- train[,!(id_na_Cols)]
write.table (trainData, file = "file_new", sep = ",")
Afterwards you can load new data for further analysis.

How can I left-join two datasets multiple times with each time using a different variable as a key?

I am combining two dataframes using left_join as I need to keep all the rows from X and only matched ones from Y. However, there is no one column which provides a perfect match. Instead there is 1 column in X which partially matches 2 columns in Y, and a second column in X which partially matches a third column in Y. Only by matching on all 3 can I adequately merge the data since they all match on different combinations of rows (with some overlap).
My strategy so far has been to do three seperate left_join commands, each with a different match. Doing this enables me to match about 95% of the rows (the first match only gets about 70%). However, this leaves me with three versions of each column from Y in the merged data. I have tried using paste and a number of other ways to combine them, but either it hasn't worked or the new columns aren't helpful as they are concatenated (e.g. showing "12345 NA 12345" or "NA NA NA"). I need it to show only the first non-NA result as all non-NA results will be identical. So for the two examples above I would want to return only "12345" and "NA".
So I think I either need to figure out how to do a left_join on multiple columns (This doesn't work, but something like:
left_join(X, Y, by = (c("Column1" = Column1) OR c("Column1" = "Column2") OR c( Column 2, Y = "Column 3")).
Or, less elegantly, I just need to figure out how to reformat the merged/ pasted column to keep only the first non-NA result.
In the example below, X has 2 columns and 5 rows and Y has 4 columns and 7 rows (5 of which match the rows in X). The only way to merge X and Y fully is to match X$Column1 with Y$Column1 OR Y$Column2, OR X$Column2 with Y$Column3. In the real data there are around 50,000 rows in X and also lots of other columns in X and Y). The desired output from the below should be the five rows from X (which also have matches in Y), along with the corresponding values in Y$Column4.
X$Column1 = c(10, 150, 3550, 9421, 22000)
X$Column2 = c(Dog, Cat, Bird, Rat, Fox)
Y$Column1 = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$Column3 = c(NA, Badger, Dog, NA, NA, NA, Fox, Mouse)
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65, 20)
From the above I would want to return 5 rows (one for each of the rows in X), along with 3 columns (the 3 original columns of X plus the 4th column of Y). The other three columns in Y are only useful for matching. As in the above example, in my data there is no way of completing a full (or close to full) match without joining on all three matches. I have been searching ways to do this for ages with no luck but I'm quite new to R so sorry if I'm being stupid.
My code at the moment:
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_print"))
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_electronic"))
merged_pvga <- left_join(merged_pvga, sherpa, by = (c("journal_title" = "title")))
merged_pvga$id_all <- paste(merged_pvga$id.x, merged_pvga$id.y, merged_pvga$id)
merged_pvga$subject_all <- paste(merged_pvga$subject.x, merged_pvga$subject.y, merged_pvga$subject)
etc. for other columns
Here is an example of left-join data Y with X using X twice on variable ColumnA and then on variable ColumnB.
Note:
Y and X share two variables ColumnA, ColumnB, so that after each left-join, you want to combine the columns that are not used as the join-key (for example, after joining on ColumnA, combine ColumnB's of the two datasets).
Be sure to know want to do about potential overlap that may emerge by joining twice on two different variables. The example below prioritize the first left-join in that those already joined from X are excluded in the second left-join.
library(dplyr)
X = tibble(id_x=1:5)
Y = tibble(id_y=1:8)
X$ColumnA = c(10, 150, 3550, 9421, 22000)
X$ColumnB = c('Dog', 'Cat', 'Bird', 'Rat', 'Fox')
Y$ColumnA = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$ColumnB = c(NA, 'Badger', 'Dog', NA, NA, NA, 'Fox', 'Mouse')
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65)
replace_na_with_blank <- function(df, varnames) {
for (varname in varnames) {
df[is.na(df[[varname]]), varname] <- ""
}
return(df)
}
concat_columns <- function(df, v1, v2) {
idx_na <- df[[v1]]==""
df[[v1]][idx_na] <- paste(df[[v1]][idx_na], df[[v2]][idx_na], sep='')
df[[v2]] <- NULL
return(df)
}
concat_columns_num <- function(df, v1, v2) {
idx_na <- is.na(df[[v1]])
df[[v1]][idx_na] <- df[[v2]][idx_na]
df[[v2]] <- NULL
return(df)
}
merged_1 <- left_join(Y, X, by = c("ColumnA" = "ColumnA"), suffix=c("",".x"))
merged_1 = replace_na_with_blank(merged_1, c("ColumnB","ColumnB.x"))
merged_1 <- concat_columns(merged_1, "ColumnB", "ColumnB.x")
merged_1 # first merge indicator is "id_x.x"
merged_2 <- left_join(merged_1, X %>% filter(!(id_x %in% merged_1$id_x)),
by = c("ColumnB" = "ColumnB"), suffix=c("",".x"))
merged_2 <- concat_columns_num(merged_2, "ColumnA", "ColumnA.x")
merged_2 # second merge indicator is "id_x.x.x"
merge1 is
merge2 is

Tidy several variables with different keys at once in r

I have some trouble tidying my data. I have a table with 10 peptide sequences and I have recorded their abundance mean, standard deviation and coeficient of variance across three samples: Reference, ZAP02 and ZAP02_GA.
The initial table is a 10x10
example <- data.frame(
Sequence = c("YVVDTSK","EALDFFAR","VLGIDGGEGKEELFR","VLGIDGGEGK","DIPVPKPK","IGDYAGIK", "DWVQAVR","DNIEPILK","LLDGTVVSR","NQETSEEYQIK"),
Reference = c(1098144.12, 41276.04, 172023.14, 399734.69, 1242669.19, 1585792.75, 1676065.88, 2152511.00, 60473.17, 768250.31),
Reference_SD = c(48098.6407, 888.9603, 8572.5207, 2475.0947, 92398.6154, 287270.7919, 71968.6762, 73495.9717, 5610.4587, 52914.2146),
Reference_CV = c( 4.3799934, 2.1536957, 4.9833532, 0.6191844, 7.4354958, 18.1152797, 4.2939050, 3.4144296, 9.2776003, 6.8876268),
ZAP02_GA = c( NaN, NaN, 1788.838, NaN, 1298.561, NaN, NaN, 1926.935, NaN, NaN),
ZAP02_GA_SD = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
ZAP02_GA_CV = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
ZAP02 = c( NaN, NaN, 2286.836, NaN, 12303.839, NaN, 2535.902, 2806.022, NaN, NaN),
ZAP02_SD = c( NA, NA, 1393.2599, NA, NA, NA, NA, 218.3964, NA, NA),
ZAP02_CV = c(NA, NA, 60.925235, NA, NA, NA, NA, 7.783131, NA, NA))
I would like to tidy my data to have one column for the means, another for the SD values and another for the CV. At the end, I should have a table with 30 rows and 5 variables (Sequence, Sample, Abundance, Standard Deviation and CV).
I tried to use the gather() function for this purpose, but at the end I got a very long and confusing dataframe.
example_tidy <- example %>%
gather(Reference, ZAP02_GA, ZAP02,
key = "Sample",
value = "Abundance") %>%
gather(Reference_SD, ZAP02_GA_SD, ZAP02_SD,
key = "Sample",
value = "Standard deviation") %>%
gather(Reference_CV, ZAP02_GA_CV, ZAP02_CV,
key = "Sample",
value = "CV (%)")
To get what I want, I had to gather the means, sd and cv in separate dataframes and then column bind them. But this turn out tedious and time consuming.
example_mean <- example %>%
gather(Reference, ZAP02_GA, ZAP02,
key = "Sample",
value = "Abundance")
example_sd <- example %>%
gather(Reference_SD, ZAP02_GA_SD, ZAP02_SD,
key = "Sample",
value = "Standard deviation")
example_cv <- example %>%
gather(Reference_CV, ZAP02_GA_CV, ZAP02_CV,
key = "Sample",
value = "CV (%)")
example_tidy2 <- cbind(select(example_mean, Sequence, Sample, Abundance),
"Standard deviation" = example_sd$`Standard deviation`,
"CV (%)" = example_cv$`CV (%)`)
Is there a simpler way to do this? Can you do a gather() with several keys?
Thank you in advance for your help.
This is a typical case pivot_longer() can treat.
library(dplyr)
library(tidyr)
example %>%
rename_at(vars(-matches("Seq|SD|CV")), paste0, "_Abundance") %>%
pivot_longer(-Sequence, names_to = c("Sample", ".value"), names_pattern = "(.*)_(.*)")
# # A tibble: 30 x 5
# Sequence Sample Abundance SD CV
# <fct> <chr> <dbl> <dbl> <dbl>
# 1 YVVDTSK Reference 1098144. 48099. 4.38
# 2 YVVDTSK ZAP02_GA NaN NA NA
# 3 YVVDTSK ZAP02 NaN NA NA
# 4 EALDFFAR Reference 41276. 889. 2.15
# 5 EALDFFAR ZAP02_GA NaN NA NA
# 6 EALDFFAR ZAP02 NaN NA NA
# 7 VLGIDGGEGKEELFR Reference 172023. 8573. 4.98
# 8 VLGIDGGEGKEELFR ZAP02_GA 1789. NA NA
# 9 VLGIDGGEGKEELFR ZAP02 2287. 1393. 60.9
# 10 VLGIDGGEGK Reference 399735. 2475. 0.619
# … with 20 more rows
The term .value has special meaning in pivot_longer(). You can search ?pivot_longer for more details and practice its examples part.
Parentheses divide a string into multiple groups. The structure of names_pattern corresponds to the elements of names_to. Take ZAP02_SD for example. The first (.*) extracts ZAP02 and puts it into the Sample column. The second (.*) extracts SD and defines it as a new column, which is what .value works for.
The information about pattern matching can be found by searching "Regular expression" or "Regex" on google. Wikipedia of regular expression is a good resource for beginners. All the special symbols I use in my answer like "|", "(", ".", "*" are recorded and explained in it.
You can achieve the desired outputs without splitting each variable in a new dataframe, but I think there still needs to be an intermediary step involved - although my solution might not be the most elegant.
If your sample/variable names were more consistent, I would have used separate after the first gather to split e.g. Reference_CV into a Sample (Reference) and Measure (CV) column, but because means were not named and you have sample names containing underscores, I used regular expressions to select them.
First step gathers all the values (regardless what type of values they are) into one value column.
step1 <- gather(example, key = "Sample", value = "value", 2:10)
Then I create a "measure" column that gets filled based on information pulled from the sample name, and tidy up the "Sample" column to remove that information. (Here is where someone could chip in with a more elegant and widely applicable solution, but that's all I could come up with based on your naming conventions.)
step1 <- step1 %>% mutate(
measure = case_when(
grepl("_CV", Sample) ~ "CV",
grepl("_SD", Sample) ~ "SD",
!grepl("_CV", Sample) & !grepl("_SD", Sample) ~ "Abundance"
),
Sample = case_when(
grepl("Reference", Sample) ~ "Reference",
grepl("ZAP02_GA", Sample) ~ "ZAP02_GA",
grepl("ZAP02", Sample) ~ "ZAP02"
)
)
And finally I spread the resulting data frame to put the measures back into their own columns: Abundance, CV and SD.
output <- spread(step1, key = measure, value = value)
dim(output)
[1] 30 5
You can condense all of this in one long pipe, but I thought it would be easier to demonstrate the steps like this. Hope that helps!

Replacing NA values using a rolling window

How can I replace a NA value by the average of the previous non-NA and next non-NA values?
For example, I want to replace the first NA value by -0.873, and the 4th/5th by the average of -0.497+53.200.
Thanks!
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
=================== ADD ON ===================
Thank you all for answering the question! Sorry for the late response. This is only a part of a dataframe (10000 * 91) and I only took out the first 10 rows from the first column in order to simplify the question. I think David and MKR have the result that I am expected to have.
Here's a possible vectorized approach using base R (some steps could be probably improved but I have no time to look into it right now)
x <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
# Store a boolean vector of NA locaiotns for firther use
na_vals <- is.na(x)
# Find the NAs location compaed to the non-NAs
start_ind <- findInterval(which(na_vals), which(!na_vals))
# Createa right limit
end_ind <- start_ind + 1L
# Replace zero locations with NAs
start_ind[start_ind == 0L] <- NA_integer_
# Calculate the means and replace the NAs
x[na_vals] <- rowMeans(cbind(x[!na_vals][start_ind], x[!na_vals][end_ind]), na.rm = TRUE)
x
# [1] -0.8730 -0.8730 -0.4970 26.3515 26.3515 53.2000 39.6500 39.6500 39.6500 26.1000
This should work properly for NAs on both sides of the vector.
This function imputes values for NA in a vector based on the average of the non-NA values in a rolling window from the first element to the next element.
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
roll_impute <- function(x){
n <- length(x)
res <- x
for (i in seq_along(x)){
if (is.na(x[i])){
res[i] <- mean(rep_len(x, i+1), na.rm = TRUE )
}
}
if (is.na(x[n])) x[n] <- mean(x, na.rm = TRUE)
res
}
roll_impute(t)
# [1] -0.87300 -0.87300 -0.49700 -0.68500 17.27667 53.20000 17.27667 17.27667 19.48250
# [10] 26.10000
roll_impute() includes code that corrects the rolling window in the case that the final element is NA, so that the vector isn't recycled. This isn't the case in your example, but is needed in order to generalize the function. Any improvements on this function would be welcome :) It does use a for loop, but doesn't grow any vectors. No simple way to avoid the for loop and rely on the structure of the objects jumps to my mind right now.
One dplyr and tidyr based solution could be:
library(dplyr)
library(tidyr)
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
data.frame(t) %>%
mutate(last_nonNA = ifelse(!is.na(t), t, NA)) %>%
mutate(next_nonNA = ifelse(!is.na(t), t, NA)) %>%
fill(last_nonNA) %>%
fill(next_nonNA, .direction = "up") %>%
mutate(t = case_when(
!is.na(t) ~ t,
!is.na(last_nonNA) & !is.na(next_nonNA) ~ (last_nonNA + next_nonNA)/2,
is.na(last_nonNA) ~ next_nonNA,
is.na(next_nonNA) ~ last_nonNA
)
) %>%
select(t)
# t
# 1 -0.8730
# 2 -0.8730
# 3 -0.4970
# 4 26.3515
# 5 26.3515
# 6 53.2000
# 7 39.6500
# 8 39.6500
# 9 39.6500
# 10 26.1000
Note: It looks a bit complicated but it does the trick. One can achieve same thing via for loop.

R within() function: unexpected error when last value(s) are NA

I have encountered some unexpected behaviour when using the within() function in R. I (eventually!) tracked the cause to a situation where the last element(s) of particular columns in question in a data frame contain NA.
I have simplified the code to create a reproducible example. Obviously the real world application in which I encountered this is substantially more complex (data frame >500k rows 400 columns, >100 lines inside within(), etc.), and rather inconvenient to avoid using within().
This works as expected:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest$Bearing <- NA
fooTest$Bearing[which(fooTest$Group=="Cup")] <-
as.character(fooTest$CupComposition[which(fooTest$Group=="Cup")])
fooTest$Bearing[which(fooTest$Group=="Shell")] <-
as.character(fooTest$LinerComposition[which(fooTest$Group=="Shell")])
fooTest$Bearing
Whereas this (which should be equivalent) throws an error:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[which(Group=="Cup")] <-
as.character(CupComposition[which(Group=="Cup")])
Bearing[which(Group=="Shell")] <-
as.character(LinerComposition[which(Group=="Shell")])
})
The error message is
Error in [<-.data.frame(*tmp*, nl, value = list(Bearing = c("Polyethylene", :
replacement element 1 has 3 rows, need 5
The last two rows, in which Group is NA, are evidently not being included. NA rows in the middle of the data are OK.
A couple of questions:
The behaviour of within() is a bit unexpected; is this a bug? I am not very experienced, so am slightly reticent about filing bugs where it is likely to be my understanding that is deficient!
In this particular case, I expect there is a neater way to populate the "Bearing" column than the method I have employed. Suggestions welcome!
I tend to use "%in%" in this case; it handles NAs nicer:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[Group %in% "Cup"] <-
as.character(CupComposition[Group %in% "Cup"])
Bearing[Group %in% "Shell"] <-
as.character(LinerComposition[Group %in% "Shell"])
})
Regarding the error message using within, you can try:
within(fooTest, {Bearing <- NA
Bearing[Group=='Cup' & !is.na(Group)] <-
as.character(CupComposition)[Group=='Cup' & !is.na(Group)]
Bearing[Group=='Shell' & !is.na(Group)] <-
as.character(LinerComposition)[Group=='Shell' & !is.na(Group)]
})
It is not clear whether the Group column and all other columns are following some order. From the column names, I couldn't find a common pattern that helps in matching the elements in Group. Based on the example provided, you could also do (for the bigger dataset)
fooTest1 <- fooTest
fooTest1[] <- lapply(fooTest1, as.character)#convert the columns to character class
Un1 <- sort(unique(na.omit(fooTest1$Group)))
m1 <- do.call(cbind,Map(function(v, x,y)
ifelse(v==y & !is.na(v), x, NA) , list(fooTest1[,1]),
fooTest1[,-1], Un1))
indx1 <- which(!is.na(m1), arr.ind=TRUE)[,1]
fooTest1$Bearing <- NA
fooTest1$Bearing[indx1] <- m1[!is.na(m1)]
fooTest1
# Group CupComposition LinerComposition Bearing
#1 Shell Metal Polyethylene Polyethylene
#2 <NA> <NA> <NA> <NA>
#3 Cup Polyethylene <NA> Polyethylene
#4 <NA> <NA> <NA> <NA>
#5 <NA> Test Test <NA>

Resources