How can I replace a NA value by the average of the previous non-NA and next non-NA values?
For example, I want to replace the first NA value by -0.873, and the 4th/5th by the average of -0.497+53.200.
Thanks!
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
=================== ADD ON ===================
Thank you all for answering the question! Sorry for the late response. This is only a part of a dataframe (10000 * 91) and I only took out the first 10 rows from the first column in order to simplify the question. I think David and MKR have the result that I am expected to have.
Here's a possible vectorized approach using base R (some steps could be probably improved but I have no time to look into it right now)
x <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
# Store a boolean vector of NA locaiotns for firther use
na_vals <- is.na(x)
# Find the NAs location compaed to the non-NAs
start_ind <- findInterval(which(na_vals), which(!na_vals))
# Createa right limit
end_ind <- start_ind + 1L
# Replace zero locations with NAs
start_ind[start_ind == 0L] <- NA_integer_
# Calculate the means and replace the NAs
x[na_vals] <- rowMeans(cbind(x[!na_vals][start_ind], x[!na_vals][end_ind]), na.rm = TRUE)
x
# [1] -0.8730 -0.8730 -0.4970 26.3515 26.3515 53.2000 39.6500 39.6500 39.6500 26.1000
This should work properly for NAs on both sides of the vector.
This function imputes values for NA in a vector based on the average of the non-NA values in a rolling window from the first element to the next element.
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
roll_impute <- function(x){
n <- length(x)
res <- x
for (i in seq_along(x)){
if (is.na(x[i])){
res[i] <- mean(rep_len(x, i+1), na.rm = TRUE )
}
}
if (is.na(x[n])) x[n] <- mean(x, na.rm = TRUE)
res
}
roll_impute(t)
# [1] -0.87300 -0.87300 -0.49700 -0.68500 17.27667 53.20000 17.27667 17.27667 19.48250
# [10] 26.10000
roll_impute() includes code that corrects the rolling window in the case that the final element is NA, so that the vector isn't recycled. This isn't the case in your example, but is needed in order to generalize the function. Any improvements on this function would be welcome :) It does use a for loop, but doesn't grow any vectors. No simple way to avoid the for loop and rely on the structure of the objects jumps to my mind right now.
One dplyr and tidyr based solution could be:
library(dplyr)
library(tidyr)
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
data.frame(t) %>%
mutate(last_nonNA = ifelse(!is.na(t), t, NA)) %>%
mutate(next_nonNA = ifelse(!is.na(t), t, NA)) %>%
fill(last_nonNA) %>%
fill(next_nonNA, .direction = "up") %>%
mutate(t = case_when(
!is.na(t) ~ t,
!is.na(last_nonNA) & !is.na(next_nonNA) ~ (last_nonNA + next_nonNA)/2,
is.na(last_nonNA) ~ next_nonNA,
is.na(next_nonNA) ~ last_nonNA
)
) %>%
select(t)
# t
# 1 -0.8730
# 2 -0.8730
# 3 -0.4970
# 4 26.3515
# 5 26.3515
# 6 53.2000
# 7 39.6500
# 8 39.6500
# 9 39.6500
# 10 26.1000
Note: It looks a bit complicated but it does the trick. One can achieve same thing via for loop.
Related
The following code is designed to extract the first x observations of each column, which are time series spanning different periods. (or to erase everything else than the x first values in each column …)
The first values, can be numbers followed by NAs, as long as it is the beginning of the time series.
This is crucial that each value stay linked to its own place in the indexing (the first column 'Year')
# data example
df <- data.frame("Year" = 1791:1800,
"F1" = c(NA, NA, NA, 1.2,1.3, NA, NA, NA, NA, NA),
"F2" = c(NA, NA, 2.1, 2.2, 2.3, 2.4, 2.5, NA, NA, NA),
"F3" = c(NA, NA, NA, NA, NA, 0.1,0.2,0.3,0.4,0.5),
"F4" = c(NA, 3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9))
# Convert the dataframe to a list by column
long <- setNames(lapply(names(df)[-1], function(x) cbind(df[1], df[x])), names(df)[-1])
# and select only the first 3 elements after NAs in each column
mylist <- lapply(long, function(x){
head(na.omit(x), 3)
})
# or in a more concise writing ??
mylist2 <- lapply(df, function(x){
head(na.omit(cbind(df[[1]],x)), 3)
})
# Now ‘mylist’ (or ‘mylist2’) contains several vector of different lengths,
# not very appropriate for dataframe, let's switch to long format dataframe
mydata <- do.call(rbind, lapply(mylist, function(x){
require(reshape2)
melt(x, id.vars="Year")
})
)
# and switch back to regular spreadsheet format
library(tidyverse)
mydataCOL <- spread(mydata, key = "variable", value = "value")
write.table(mydataCOL, “sheet1.txt”)
This thing is complicated to apply to a list of dataframe (multiple excel files). Is there an easier way to achieve this ? To do such operations on each column of each dataframe of the list :)
I'm currently trying with 'nested' lapply() :
mylist <- lapply(d, function(x){
lapply(x, function(y){
head(na.omit(cbind(x[[1]],y)), 50)
})
})
but this is not the easiest way I guess... Thanks !
If you are using the tidyverse anyway, why not go all in with Hadley's stuff?
GetTop <- function(indf){
indf %>%
pivot_longer(-Year,names_to="F") %>%
na.omit() %>%
group_by(F) %>%
top_n(3,wt=-Year) %>%
pivot_wider(names_from="F")
}
Now if we can call it for one dataframe
> mytops <- GetTop(df)
If you have a list of these dataframes you can use lapply to do this to each one.
allmytop <- lapply(biglist,FUN=GetTop)
That will give you a list of dataframes. Seems like you also want to join them into one fat dataframe.
fatdf <- lapply(biglist,FUN=GetTop) %>% reduce(full_join,by="Year")
I have a table of numbers with many, varying significant digits. I need to find the exact matches for these digits across columns - e.g.
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
I want to know what values are shared between mz1 and mz2, mz2 and mz3, and finally between all three columns together.
So, comparing mz1 and mz2 should yield:
mz1_v_mz2
3.14222
456.2200001
And comparing all three:
mz_all
3.14222
I have cobbled together something that almost works, but the problem is that it's rounding somewhere and my output includes numbers that are similar but not the same, e.g. 3.14222 should not match with 3.14223. It also includes NAs in the output, which is not desired.
duplicates_across1 <- find_mz_matches[find_mz_matches$mz1
%in% find_mz_matches$mz2, ]
That should work to compare the first two columns, so I figured I would just take the output and do it again for the next comparison - compare the output of duplicates_across1 to find_mz_matches$mz3. For some reason, it doesn't catch the presence of 3.14222 between all three columns and I have no idea why.
duplicates_all <- duplicates_across1[duplicates_across1$mz1
%in% find_mz_matches$mz3, ]
Here is a very lapply-y answer that compares each combination and stores the results in a list. It should be flexible if you have more than 3 columns as well. Good luck!
# Creating all combinations of columns in a list (and flattening it with unlist)
combos <- unlist(lapply(2:ncol(find_mz_matches), combn, x = find_mz_matches, simplify = F), recursive = F)
# Checking for common elements
common_elements <- lapply(combos, function(x) Reduce(base::intersect, x))
# Renaming the elements
names(common_elements) <- sapply(lapply(combos, names), paste, collapse = "_")
common_elements
$mz1_mz2
[1] 3.14222 NA
$mz1_mz3
[1] 3.14222 254.11122
$mz2_mz3
[1] 3.14222 456.22000
$mz1_mz2_mz3
[1] 3.14222
See Reduce():
Reduce(intersect, find_mz_matches, accumulate = T)
Reduce(intersect, find_mz_matches, accumulate = T, right = T)
The argument accumulate is optional - it's only there to show you what's happening. For your use, you can take it out and it would result in 3.14222.
Reduce(intersect, find_mz_matches)
[1] 3.14222
I would do something like:
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
find_mz_matches$mz_allmz1mz2 <- ifelse(find_mz_matches$mz1 == find_mz_matches$mz2 ,find_mz_matches$mz1 , NA)
find_mz_matches$mz_allmz2mz3 <- ifelse(find_mz_matches$mz2 == find_mz_matches$mz3 ,find_mz_matches$mz2 , NA)
After that, I would combine the resulting columns if no conflict present.
I'm dealing with time series data from neurophysiological recordings which typically have 'markers' that mark the beginning of an event (e.g., stimulus being presented on the screen). I'm trying to subset particular windows/epochs based on certain markers, and then average those separate windows/epochs.
To illustrate this, below is really simple example (my actual datasets have millions of data points, so it'd be nice to have efficient solutions).
df <- data.frame(value = c(1:10, 101:110), #time series data
marker = c(NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA, #event markers
NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA))
start <- which(df$marker == "start") #indices 3 and 13 are the 'start' markers
end <- which(df$marker == 'end') #indices 8 and 18 are the 'end markers'
window1 <- df$value[start[1]:end[1]] #first window (indices 3 to 8)
window2 <- df$value[start[2]:end[2]] #second window (indices 13 to 18)
averageWindow <- (window1 + window2) / 2 #average of the two windows
Is this the most efficient way to go about doing this (I have nearly 1000 windows in my actual data and about 1 million rows)?
I am not sure if you want an average based on all windows or an average for each window. So I decided to produce both results. Using your start and end, I subsetted the data with lapply(). By this time, I removed irrelevant data. Then, I combined data frames in the list with rbindlist() and assigned ID to a new column. The final process was to get an average.
library(data.table)
start <- which(df$marker == "start") #indices 3 and 13 are the 'start' markers
end <- which(df$marker == 'end') #indices 8 and 18 are the 'end markers'
rbindlist(lapply(1:length(start), function(x){
df[start[x]:end[x], ]}), idcol = TRUE) -> temp
# An overall average
temp[, list(average = sum(value) / uniqueN(.id))]
# average
#1: 333
# An average for each window
temp[, list(average = sum(value) / .N), by = .id]
# .id average
#1: 1 5.5
#2: 2 105.5
Replying to the OP's message, I came up with the following code. I created ID for each of the 6 points and calculated an average for each point.
temp[, index := 1:.N, by = .id][,
list(average = sum(value) / .N), by = index]
# index average
#1: 1 53
#2: 2 54
#3: 3 55
#4: 4 56
#5: 5 57
#6: 6 58
DATA
df <- data.frame(value = c(1:10, 101:110), #time series data
marker = c(NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA, #event markers
NA, NA, 'start', NA, NA, NA, NA, 'end', NA, NA),
stringsAsFactors = FALSE)
I have encountered some unexpected behaviour when using the within() function in R. I (eventually!) tracked the cause to a situation where the last element(s) of particular columns in question in a data frame contain NA.
I have simplified the code to create a reproducible example. Obviously the real world application in which I encountered this is substantially more complex (data frame >500k rows 400 columns, >100 lines inside within(), etc.), and rather inconvenient to avoid using within().
This works as expected:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest$Bearing <- NA
fooTest$Bearing[which(fooTest$Group=="Cup")] <-
as.character(fooTest$CupComposition[which(fooTest$Group=="Cup")])
fooTest$Bearing[which(fooTest$Group=="Shell")] <-
as.character(fooTest$LinerComposition[which(fooTest$Group=="Shell")])
fooTest$Bearing
Whereas this (which should be equivalent) throws an error:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[which(Group=="Cup")] <-
as.character(CupComposition[which(Group=="Cup")])
Bearing[which(Group=="Shell")] <-
as.character(LinerComposition[which(Group=="Shell")])
})
The error message is
Error in [<-.data.frame(*tmp*, nl, value = list(Bearing = c("Polyethylene", :
replacement element 1 has 3 rows, need 5
The last two rows, in which Group is NA, are evidently not being included. NA rows in the middle of the data are OK.
A couple of questions:
The behaviour of within() is a bit unexpected; is this a bug? I am not very experienced, so am slightly reticent about filing bugs where it is likely to be my understanding that is deficient!
In this particular case, I expect there is a neater way to populate the "Bearing" column than the method I have employed. Suggestions welcome!
I tend to use "%in%" in this case; it handles NAs nicer:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[Group %in% "Cup"] <-
as.character(CupComposition[Group %in% "Cup"])
Bearing[Group %in% "Shell"] <-
as.character(LinerComposition[Group %in% "Shell"])
})
Regarding the error message using within, you can try:
within(fooTest, {Bearing <- NA
Bearing[Group=='Cup' & !is.na(Group)] <-
as.character(CupComposition)[Group=='Cup' & !is.na(Group)]
Bearing[Group=='Shell' & !is.na(Group)] <-
as.character(LinerComposition)[Group=='Shell' & !is.na(Group)]
})
It is not clear whether the Group column and all other columns are following some order. From the column names, I couldn't find a common pattern that helps in matching the elements in Group. Based on the example provided, you could also do (for the bigger dataset)
fooTest1 <- fooTest
fooTest1[] <- lapply(fooTest1, as.character)#convert the columns to character class
Un1 <- sort(unique(na.omit(fooTest1$Group)))
m1 <- do.call(cbind,Map(function(v, x,y)
ifelse(v==y & !is.na(v), x, NA) , list(fooTest1[,1]),
fooTest1[,-1], Un1))
indx1 <- which(!is.na(m1), arr.ind=TRUE)[,1]
fooTest1$Bearing <- NA
fooTest1$Bearing[indx1] <- m1[!is.na(m1)]
fooTest1
# Group CupComposition LinerComposition Bearing
#1 Shell Metal Polyethylene Polyethylene
#2 <NA> <NA> <NA> <NA>
#3 Cup Polyethylene <NA> Polyethylene
#4 <NA> <NA> <NA> <NA>
#5 <NA> Test Test <NA>
At some point in time, I encountered this problem...and solved it. However, as it is a recurring problem and I've now forgotten the solution, hopefully this question will offer clarification to others as well as me :)
I am creating a variable that is based answers to several questions. Each question can have three values: 1, 2, or NA. 1's and 2's are mutually exclusive for each observation.
I simply want to create a variable that is a composite of the choice coded with "1" for each person, and give it a value based on that code.
So let's say I have this df:
ID var1 var2 var3 var4
1 1 2 NA NA
2 NA NA 2 1
3 2 1 NA NA
4 2 NA 1 NA
I then try to recode based on the following statement:
df$var <-
ifelse(
as.numeric(df$var1) == 1,
"Gut instinct",
ifelse(
as.numeric(df$var2) == 1,
"Data",
ifelse(
as.numeric(df$var3) == 1,
"Science",
ifelse(
as.numeric(df$var4) == 1,
"Philosophy",
NA
)
)
)
)
However, this code only PARTIALLY codes based on the "ifelse". For example, df$var might have observation of 'Gut instinct' and 'Philosophy', but the codings for when var2 and var3==1 are still NA.
Any thoughts on why this might be happening?
An alternative that will be quicker than apply (using #MrFlick's data):
vals <- c("Gut", "Data", "Science", "Phil")
intm <- dd[-1]==1 & !is.na(dd[-1])
dd$resp <- NA
dd$resp[row(intm)[intm]] <- vals[col(intm)[intm]]
How much quicker? On 1 million rows:
#row/col assignment
user system elapsed
0.99 0.02 1.02
#apply
user system elapsed
11.98 0.04 12.30
And giving the same results when tried on identical datasets:
identical(flick$resp,latemail$resp)
#[1] TRUE
This is because ifelse (and ==) has special behavior for NA. Specifically, R doesn't want to tell you that NA is different from 1 (or anything else), because often NA is used to represent a value that could be anything, maybe even 1.
> 1 == NA
[1] NA
> ifelse(NA == 1, "yes", "no")
[1] NA
With your code, if an NA occurs before a 1 (like for ID 2), then that ifelse statement will just return NA, and the nested FALSE ifelse will never be called.
Here's a way to do with without the nested ifelse statements
#your data
dd<-data.frame(ID = 1:4,
var1 = c(1, NA, 2, 2),
var2 = c(2, NA, 1, NA),
var3 = c(NA, 2, NA, 2),
var4 = c(NA, 1, NA, NA)
)
resp <- c("Gut","Data","Sci","Phil")[apply(dd[,-1]==1,1,function(x) which(x)[1])]
cbind(dd, resp)
I use apply to scan across the rows to find the first 1 and use that index to subset the response values. Using which helps to deal with the NA values.
To answer your question it is due to the NAs in your data. This should sort your problem out
df <- data.frame( ID=1:4, var1= c(1, NA, 2, 2), var2= c(2, NA, 1, NA),
var3=c(NA,2,NA,2), var4=c(NA, 1, NA, NA))
df$var<-ifelse(as.numeric(df$var1)==1&!is.na(df$var1),"Gut instinct",
ifelse(as.numeric(df$var2)==1&!is.na(df$var2),"Data",
ifelse(as.numeric(df$var3)==1&!is.na(df$var3),"Science",
ifelse(as.numeric(df$var4)==1&!is.na(df$var4),"Philosophy",NA))))
However, I would find it easier to reshape the data into a 'matrix' rather than a table and do it using a vector.
data <- df
library(reshape2)
long <- melt(data, id.vars="ID")
long
This would give you a matrix. Convert the var titles to something more meaningful.
library(stringr)
long$variable <- str_replace(long$variable, "var1", "Gut Instinct")
long$variable <- str_replace(long$variable, "var2", "Data")
long$variable <- str_replace(long$variable, "var3", "Science")
long$variable <- str_replace(long$variable, "var4", "Philosophy")
And now you can decide what to do based on each result
long$var <- ifelse(long$value==1, long$variable, NA)
and convert it back to something like the original if you want it that way
reshape(data=long, timevar="ID",idvar=c("var", "variable"), v.names = "value", direction="wide")
HTH