Replacing NA with mean using loop in R - r

I have to solve this problem using loop in R (I am aware that you can do it much more easily without loops, but it is for school...).
So I have vector with NAs like this:
trades<-sample(1:500,150,T)
trades<-trades[order(trades)]
trades[sample(10:140,25)]<-NA
and I have to create a FOR loop that will replace NAs with mean from 2 numbers before the NA and 2 numbers that come after the NA.
This I am able to do, with loop like this:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T) {
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]), na.rm = T)
}
}
But there is another part to the homework. If there is NA within the 2 previous or 2 following numbers, then you have to replace the NA with mean from 4 previous numbers and 4 following numbers (I presume with removing the NAs). But I just am not able to crack it... I have the best results with this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T && is.na(trades[c(i-1:2)]==T || is.na(trades[c(i+1:2)]==T))) {
trades[i] <- mean(c(trades[c(i-1:4)], trades[c(i+1:4)]), na.rm = T)
}else if (is.na(trades[i])==T){
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]))
}
}
But it still misses some NAs.
Thank you for your help in advance.

We can use na.approx from zoo
library(zoo)
na.approx(trades)

Here is another solution using a loop. I did shortcut some code by using lead and lag from dplyr. First we use 2 recursive functions to calculate the lead and lag sums. Then we use conditional statements to determine if there are any missing data. Lastly, we fill the missing data using either the output of the recursive or the sum of the previous and following 4 (with NA removed). I would note that this is not the way that I would go about this issue, but I tried it out with a loop as requested.
library(dplyr)
r.lag <- function(x, n){
if (n == 1) return(lag(x = x, n = 1))
else return( lag(x = x, n = n) + r.lag(x = x, n = n-1))
}
r.lead <- function(x, n){
if (n == 1) return(lead(x = x, n = 1))
else return( lead(x = x, n = n) + r.lead(x = x, n = n-1))
}
lead.vec <- r.lead(trades, 2)
lag.vec <- r.lag(trades, 2)
output <- vector(length = length(trades))
for(i in 1:length(trades)){
if(!is.na(trades[[i]])){
output[[i]] <- trades[[i]]
}
else if(is.na(trades[[i]]) & !is.na(lead.vec[[i]]) & !is.na(lag.vec[[i]])){
output[[i]] <- (lead.vec[[i]] + lag.vec[[i]])/4
}
else
output[[i]] <- mean(
c(trades[[i-4]], trades[[i-3]], trades[[i-2]], trades[[i-1]],
trades[[i+4]], trades[[i+3]], trades[[i+2]], trades[[i+1]]),
na.rm = T
)
}
tibble(
original = trades,
filled = output
)
#> # A tibble: 150 x 2
#> original filled
#> <int> <dbl>
#> 1 7 7
#> 2 7 7
#> 3 12 12
#> 4 18 18
#> 5 30 30
#> 6 31 31
#> 7 36 36
#> 8 NA 40
#> 9 43 43
#> 10 50 50
#> # … with 140 more rows

So it seems that posting to StackOverflow helped me solve the problem.
trades<-sample(1:500,25,T)
trades<-trades[order(trades)]
trades[sample(1:25,5)]<-NA
which gives us:
[1] NA 20 24 30 NA 77 188 217 238 252 264 273 296 NA 326 346 362 368 NA NA 432 451 465 465 490
and if you run this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])== T) {
test1 <- c(trades[c(i+1:2)])
if (any(is.na(test1))==T) {
test2 <- c(trades[abs(c(i-1:4))], trades[c(i+1:4)])
trades[i] <- round(mean(test2, na.rm = T),0)
}else {
test3 <- c(trades[abs(c(i-1:2))], trades[c(i+1:2)])
trades[i] <- round(mean(test3, na.rm = T),0)
}
}
}
it changes the NAs to this:
[1] 22 20 24 30 80 77 188 217 238 252 264 273 296 310 326 346 362 368 387 410 432 451 465 465 490
So it works pretty much as expected.
Thank you for all your help.

Related

How to know if a number is in a determinated interval in R

I have a dataset with 3 columns: Default, Height and Weight.
I made a binning of the variables and almacenated it (I have to do it this way) in a list. Every binning has a woe associated, but now I want to put those woes in the original Dataframe depending in which buckets are my observations:
For example, the data frame
df1 <- data.frame(default=sample(c(0,1), replace=TRUE, size=100, prob=c(0.9,0.1)),
height=sample(150:180, 100, replace=T),
weight=sample(50:80,100,replace=T))
> head(df1)
# default height weight
# 1 0 172 54
# 2 0 169 71
# 3 0 164 61
# 4 0 156 55
# 5 0 180 66
# 6 0 162 63
The bins (I will just show the first one)
bins <- lapply(c("height","weight"), function(x) woe.binning(df1, "default", x,
min.perc.total=0.05,
min.perc.class=0.05,event.class=1,
stop.limit = 0.05)[2])
# [[1]]
# [[1]][[1]]
# woe cutpoints.final cutpoints.final[-1] iv.total.final 0 1 col.perc.a col.perc.b iv.bins
# (-Inf,156] -46.58742 -Inf 156 0.1050725 21 5 0.24137931 0.38461538 0.0667299967
# (156,168] 23.91074 156 168 0.1050725 34 4 0.39080460 0.30769231 0.0198727638
# (168,169] -10.91993 168 169 0.1050725 6 1 0.06896552 0.07692308 0.0008689599
# (169, Inf] 25.85255 169 Inf 0.1050725 26 3 0.29885057 0.23076923 0.0176007627
# Missing NA Inf Missing 0.1050725 0 0 0.00000000 0.00000000
Now I want to see in with bins is my data.
My desired output is something similar to this
# default height weight woe_height woe_weight
# 1 0 160 54 23.91074 -8.180032
# 2 0 140 71 -46.58742 -7.640947
Is there any way to do it? The main problem I see here is that the intervals (a,b) are strings. I was thinking about use substr() or something similar to separate the strings in logical options, but I dont think that would work, and its not very elegant.
Any help will be welcome, thanks in advance.
Does this work fine for you?
apply_woe_binning <- function(df, x){
# woe binning
w <- woe.binning(df, "default", x,
min.perc.total=0.05,
min.perc.class=0.05,
event.class=1,
stop.limit = 0.05)[[2]]
# create new column name
new_col <- paste("woe", x, sep = "_")
# define cuts
cuts <- cut(df[[x]], w$cutpoints.final)
# add new column
df[[new_col]] <- w[cuts, "woe", drop = TRUE]
df
}
# one by one
df2 <- apply_woe_binning(df1, "height")
df2 <- apply_woe_binning(df2, "weight")
# in a functional
df2 <- Reduce(function(y, x) apply_woe_binning(df = y, x = x),
c("height","weight"),
init = df1)

Finding nearest matching points

What I would like to do is for the red points find the nearest equivalent blue dot on the other side of the abline (i.e. 1,5 find 5,1).
Data:
https://1drv.ms/f/s!Asb7WztvacfOuesIq4evh0jjvejZ4Q
Edit: to open data do readRDS("path/to/data")
So what I have tried is to find the difference between the x and y coordinates, rank them and then find the min value going down the ranks for both x and y. The results and pretty bad. The thing I'm struggling with is finding a way to find nearest match of tuples.
My attempt:
find_nearest <- function(query, subject){
weight_df <- data.frame(ID=query$ID)
#find difference of first, then second, rank and find match in both going from top to bottom
tmp_df <- query
for(i in 1:nrow(subject)){
first_order <- order(abs(query$mean_score_n-subject$mean_score_n[i]))
second_order <- order(abs(query$mean_score_p-subject$mean_score_p[i]))
tmp_df$order_1[first_order] <- seq(1, nrow(tmp_df))
tmp_df$order_2[second_order] <- seq(1, nrow(tmp_df))
weight_df[,i+1] <- tmp_df$order_1 + tmp_df$order_2
}
rownames(weight_df) <- weight_df$ID
weight_df$ID <- NULL
print(dim(weight_df))
nearest_match <- list()
count <- 1
subject_ids <- NA
query_ids <- NA
while(ncol(weight_df) > 0 & count <= ncol(weight_df)){
pos <- which(weight_df == min(weight_df, na.rm = TRUE), arr.ind = TRUE)
if(length(unique(rownames(pos))) > 1){
for(i in nrow(pos)){
#if subject/query already used then mask and find another
if(subject$ID[pos[i,2]] %in% subject_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else if(query$ID[pos[i,1]] %in% query_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[i,2]])
query_ids <- c(query_ids, query$ID[pos[i,1]])
nearest_match[[count]] <- data.frame(query=query[pos[i,1],]$ID, subject=subject[pos[i,2],]$ID)
#mask
weight_df[pos[i,1],pos[i,2]] <- NA
count <- count + 1
}
}
}else if(nrow(pos) > 1){
#if subject/query already used then mask and find another
if(subject$ID[pos[1,2]] %in% subject_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else if(query$ID[pos[1,1]] %in% query_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[1,1]])
query_ids <- c(query_ids, query$ID[pos[1,1]])
nearest_match[[count]] <- data.frame(query=query[pos[1,1],]$ID, subject=subject[pos[1,2],]$ID)
#mask
weight_df[pos[1,1],pos[1,2]] <- NA
count <- count + 1
}
}else{
#if subject/query already used then mask and find another
if(subject$ID[pos[2]] %in% subject_ids){
weight_df[pos[1],pos[2]] <- NA
}else if(query$ID[pos[1]] %in% query_ids){
weight_df[pos[1],pos[2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[2]])
query_ids <- c(query_ids, query$ID[pos[1]])
nearest_match[[count]] <- data.frame(query=query[pos[1],]$ID, subject=subject[pos[2],]$ID)
#mask
weight_df[pos[1],pos[2]] <- NA
count <- count + 1
}
}
}
out <- plyr::ldply(nearest_match, rbind)
out <- merge(out, data.frame(subject=subject$ID,
mean_score_p_n=subject$mean_score_p,
mean_score_n_n= subject$mean_score_n), by="subject", all.x=TRUE)
out <- merge(out, data.frame(query=query$ID,
mean_score_p_p=query$mean_score_p,
mean_score_n_p= query$mean_score_n), by="query", all.x=TRUE)
return(out)
}
Edit: is this what the solution looks like for you?
ggplot() +
geom_point(data=B[out,], aes(x=mean_score_p, y= mean_score_n, color="red")) +
geom_point(data=A, aes(x=mean_score_p, y=mean_score_n, color="blue")) +
geom_abline(intercept = 0, slope = 1)
Let
query <- readRDS("query.dms")
subject <- readRDS("subject.dms")
kA <- nrow(subject)
kB <- nrow(query)
A <- as.matrix(subject[, 2:3])
B <- as.matrix(query[, 2:3])
where we want to find the closest "reverse" point (row) in B to each point in A.
Solution permitting non-unique results
Then, assuming that you are using the Euclidean distance,
D <- as.matrix(dist(rbind(A, B[, 2:1])))[(1 + kA):(kA + kB), 1:kA]
unname(apply(D, 2, which.min))
# [1] 268 183 350 284 21 360 132 287 100 298 58 56 170 70 47 305 353
# [18] 43 266 198 58 215 198 389 412 321 255 181 79 340 292 268 198 54
# [35] 390 38 376 47 19 94 244 18 168 201 160 194 114 247 287 273 182
# [52] 87 94 87 192 63 160 244 101 298 62
are the corresponding row numbers in B. The trick was to switch the coordinates of the points in B by using B[, 2:1].
Solution with unique results
out <- vector("numeric", length = kA)
colnames(D) <- 1:ncol(D)
rownames(D) <- 1:nrow(D)
while(any(out == 0))
for(i in 1:nrow(D)) {
aux <- apply(D, 2, which.min)
if(i %in% aux) {
win <- which(aux == i)[which.min(D[i, aux == i])]
out[as.numeric(names(win))] <- as.numeric(rownames(D)[i])
D <- D[-i, -win, drop = FALSE]
}
}
out
# [1] 268 183 350 284 21 360 132 213 100 298 22 56 170 70 128 305 353
# [18] 43 266 198 58 215 294 389 412 321 255 181 79 340 292 20 347 54
# [35] 390 38 376 47 19 94 73 18 168 201 160 194 114 247 287 273 182
# [52] 87 365 158 192 63 211 244 101 68 62
whereas
all(table(res) == 1)
# [1] TRUE
confirms uniqueness. The solution is not the most efficient, but on your dataset it takes only a couple of seconds. It takes some time because it keeps going over all the available points in B checking if it is the closest one to any of the points in A. If so, the corresponding point in B is assigned to the closest one in A. Then both the point in A and the point in B are eliminated from the distance matrix. The loop goes until every point in A has some match in B.

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Finding local maxima and minima in R

I'm trying to create a function to find a "maxima" and "minima". I have the following data:
y
157
144
80
106
124
46
207
188
190
208
143
170
162
178
155
163
162
149
135
160
149
147
133
146
126
120
151
74
122
145
160
155
173
126
172
93
I have tried this function to find "maxima"
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
maks <- localMaxima(x)
And funtion to find "minima"
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
mins <- localMinima(x)
And the result is not 100% right
maks = 1 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34 36
The result should
maks = 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34
Finding local maxima and minima in R comes close, but doesn't quite fit.
How can I fix this?
Thanks you very much
You could define two functions like the below which produce the vectors you need:
library(data.table)
#shift lags or leads a vector by a certain amount defined as the second argument
#the default is to lag a vector.
#The rationale behind the below code is that each local minimum's adjucent
#values will be greater than itself. The opposite is true for a local
#maximum. I think this is what you are trying to achieve and one way to do
#it is the following code
maximums <- function(x) which(x - shift(x, 1) > 0 & x - shift(x, 1, type='lead') > 0)
minimums <- function(x) which(x - shift(x, 1) < 0 & x - shift(x, 1, type='lead') < 0)
Output:
> maximums(y)
[1] 5 7 10 12 14 16 20 24 27 31 33 35
> minimums(y)
[1] 3 6 8 11 13 15 19 23 26 28 32 34
this is a function i wrote a while back (and it's more general than you need). it finds peaks in sequential data x, where i define a peak as a local maxima with m points either side of it having lower value than it (so bigger m leads to more stringent criteria for peak finding):
find_peaks <- function (x, m = 3){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
so for your case m = 1:
find_peaks(x, m = 1)
#[1] 5 7 10 12 14 16 20 24 27 31 33 35
and for the minima:
find_peaks(-x, m = 1)
#[1] 3 6 8 11 13 15 19 23 26 28 32 34

replacing specific elements of a vector

I am trying to make a user-defined function below using the R
wrkexpcode.into.month <- function(vec) {
tmp.vec <- vec
tmp.vec[tmp.vec == 0 | tmp.vec == 9] <- NA
tmp.vec[tmp.vec == 1] <- 4
tmp.vec[tmp.vec == 2] <- 13
tmp.vec[tmp.vec == 3] <- 31
tmp.vec[tmp.vec == 4] <- 78
tmp.vec[tmp.vec == 5] <- 174
tmp.vec[tmp.vec == 6] <- 240
return (tmp.vec)
}
but when I execute with a simple command like
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
the result comes like
[1] 31 13 13 31 78 31 174 240 78
but I expect the result like
[1] 31 13 13 31 **4** 31 174 240 78
How can I fix this?
You have to carefully follow the flow of your function, evaluating what the values are. You are expecting 1 to be replaced by 4 based on tmp.vec[tmp.vec == 1] <- 4, however in tmp.vec[tmp.vec == 4] <- 78 later down the road, the 4 is replaced by a 78. This is caused by replacing the values in tmp.vec and using tmp.vec for determining what needs to be replaced. Like #MattewPlourde said, you need to base the replacement on vec:
tmp.vec[vec == 1] <- 4
Although I would simply replace the code by:
wrkexpcode.into.month <- function(vec) {
translation_vector = c('0' = NA, '1' = 4, '2' = 13, '3' = 31,
'4' = 78, '5' = 174, '6' = 240, '9' = NA)
return(translation_vector[as.character(vec)])
}
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
# 3 2 2 3 1 3 5 6 4
# 31 13 13 31 4 31 174 240 78
See also a blogpost I wrote recently about this kind of operation.
It think it will be much easier to use one of the many recode functions that are designed for such purposes instead of hard-coding it. It's just a one-liner then, e.g.
library(likert)
x <- c(3,2,2,3,1,3,5,6,4)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
[1] 31 13 13 31 4 31 174 240 78
And if desired, wrap it into a function, e.g.
wrkexpcode.into.month <- function(x)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
wrkexpcode.into.month(x)
[1] 31 13 13 31 4 31 174 240 78
You could create matrix pointing the input value (column1) to the desired output value (column2)
table=matrix(c(0,1,2,3,4,5,6,9,NA,4,13,31,78,174,240,NA),ncol=2)
And using sapply on the vector c(3,2,2,3,1,3,5,6,4)
sapply(c(3,2,2,3,1,3,5,6,4), function(x) table[which(table[,1] == x),2] )
to give you the desired output too

Resources