dataframe where one column only has na values omitted - r

I have a data frame "accdata".
dim(accdata)
[1] 6496 188
One of the variables - "VAL" is of interest to me. I must calculate the number of instances where VAL is equal to 24.
I tried a few functions that returned error messages. After some research it seems I need to remove the NA values from VAL first.
I would try something like nonaaccdaa <- na.omit(accdata) except this removes instances of NA in any variable, not just VAL.
I tried nonaval <- na.omit(accdata[accdata$VAL]) but when I then checked the number of rows using nrow the result was null. I had expected a value between 1 and 6,496.
Whats up here?

This should do the trick:
sum(accdata$VAL == 24, na.rm=TRUE)

Related

Replace missing datapoints with the average of 2 other distant observations when there are multiple missing observations

I have a dataset of net hourly animal movements but there are several occasions where observers were periodically absent. I wish to replace the missing datapoints (in a new column) with the average of the same time period 24 hours before and after the missing datapoint.
Example data:
#Data Creation
Day1<- rep(1,24)
Day2<- rep(2,24)
Day3<- rep(3,24)
Day<- c(Day1,Day2,Day3)
Hour<- rep(0:23,3)
Net <- round(rnorm(length(Day),mean = 2))
Dat<- data.frame(Day= Day,Hour= Hour,Net= Net)
#Populate missing observations
Dat[27,3]<- NA
Dat[31,3]<- NA
Dat
I initially applied a function (below) that would locate a single missing value and then index the missing datapoint to locate and take the average of the rows 24 hours before and after the missing point.
Dat$new.net<- sapply(Dat[,3],function(x)
if_else(is.na(x), mean(c(Dat[which(is.na(Dat),arr.ind = T)[1]-24,3],Dat[which(is.na(Dat),arr.ind = T)[1]+24,3])),x))
I cannot find a way to make the function I used for 1 missing value work for multiple missing occasions, producing a unique average for each missing value. Currently the code only uses the average for the first missing value due to the "Dat[which(is.na(Dat),arr.ind = T)[1]"
How can I alter my code to work for multiple missing values, or is there a more elegant solution?
PS. I know I will have issues if there are missing values in the first or final 23 hours. I will cross that bridge when I get there.
Any help will be greatly appreciated!
We could get the index of NA values and then subtract 24, add 24, to each of the elements, get the rowMeans after cbinding and assign it to missing index
ind <- which(is.na(Dat[[3]]))
ind_minus <- ind - 24
ind_minus[ind_minus < 1] <- NA
ind_plus <- ind + 24
nd_plus[ind_plus > nrow(Dat)] <- NA
Dat[[3]][ind] <- rowMeans(cbind(Dat[[3]][ind_minus], Dat[[3]][ind_plus]),
na.rm = TRUE)

Affect value in R dataframe without checking if the index is empty

df = data.frame(A=c(1,1),B=c(2,2))
df$C = NA
df[is.na(df$B),]$C=5
Each time I want to affect a new value and the indexes found out to be empty like here is.na(df$B) , R raised raises an error replacement has 1 row, data has 0.
Is there a way that R just doesnt affect anything in these case instead of raising an error ?
We can do this in a single line instead of assigning 'C' to NA and then subsetting the data.frame. The below code will assign 5 to 'C' where there are NA elements in 'B' or else it will be NA
df$C[is.na(df$B)] <- 5

How to split a dataframe with missing values?

I've got a dataframe that I need to split based on the values in one of the columns - most of them are either 0 or 1, but a couple are NA, which I can't get to form a subset. This is what I've done:
all <- read.csv("XXX.csv")
splitted <- split(all, all$case_con)
dim(splitted[[1]]) #--> gives me 185
dim(splitted[[2]]) #--> gives me 180
but all contained 403 rows, which means that 38 NA values were left out and I don't know how to form a similar subset to the ones above with them. Any suggestions?
Try this:
splitted<-c(split(all, all$case_con,list(subset(all, is.na(case_con))))
This should tack on the data frame subset with the NAs as the last one in the list...
list(split(all, all$cases_con), split(all, is.na(all$cases_con)))
I think it would be work. Ty

Why do I get "number of items to replace is not a multiple of replacement length"

I have a dataframe combi including two variables DT and OD.
I have a few missing values NA in both DT and OD but not necessary the same record.
I then try to replace missing values in DT with OD if OD not is missing but retrieve the warning "number of items to replace is not a multiple of replacement length". I can see it means a mismatch in length, but I dont understand why two columns in the same dataframe can have different length. More seriously the output is no fully correct (see below)
combi$DT[is.na(combi$DT) & ! is.na(combi$OD) ] <- combi$OD
Output
id DT OD
67 2010-12-12 2010-12-12
68 NA NA
69 NA 2010-12-12
70 NA NA
I would have expected DT to be 2010-12-12 for id=69 (Dates are POSIXct).
There must be something I dont understand of length in dataframes. Anybody can help?
Because the number of items to replace is not a multiple of replacement length. The number of items to replace is the number of rows where is.na(combi$DT) & !is.na(combi$OD) which is less than the number of rows in combi (and thus the length of the replacement).
You should use ifelse:
combi$DT <- ifelse(is.na(combi$DT) & !is.na(combi$OD), combi$OD, combi$DT)
N.B. the & !is.na(combi$OD) is redundant: if both are NA, the replacement will be NA. So you can just use
combi$DT <- ifelse(is.na(combi$DT), combi$OD, combi$DT)
The warning is produced because you are trying to assign all combi$OD to the places where combi$DT is NA. For example if you have 100 rows of 2 variables with 5 NAs, then you are telling it to replace those 5 NAs of variable1 with the 100 values of variable2. Hence the warning. Try this instead,
combi$DT[is.na(combi$DT) & !is.na(combi$OD)] <- combi$OD[is.na(combi$DT) & !is.na(combi$OD)]

correlation loop keep getting NA

Despite using two complete columns where every element is numeric and no numbers are missing for rows 2 thru 570, I find it impossible to get a result other than NA when using a loop to find a rolling 24-week correlation between the two columns.
rolling.correlation <- NULL
temp <- NULL
for(i in 2:547){
temp <- cor(training.set$return.SPY[i:i+23],training.set$return.TLT[i:i+23])
rolling.correlation <- c(rolling.correlation, temp)
} #end "for" loop
rolling.correlation
The cor()command works fine for [2:25], [i:25], or [2:i] but R doesn't understand when I say [i:i+23]
I want R to calculate a correlation for rows 2 thru 25, then 3 thru 26, ..., 547 thru 570. The result should be a vector of length 546 which has numeric values for each correlation. Instead I'm getting a vector of 546 NAs. How can I fix this? Thanks for your help.
Look what happens when you do
5:5+2
# [1] 7
Note that : has a higher operator precedence than + which means 5:5+2 is the same as (5:5)+2 when you really want 5:(5+2). Use
i:(i+23)

Resources