Replacing Nulls for one Variable based off another - r

I have a dataset consisting of measured variables and categorical variables based off these measurements. i.e X1 is measured variable and Y1 will either be 0 or 1 based off the measurement in X1.
There was a lot of Null values in the X1 variable, which I have replaced already. I am now trying to replace the corresponding Y1 variable based off the new value in X1.
So what I'm trying to do with the below code is say if there is a Null in Y1, check if the corresponding X1 value is less than 34.5. If so give that Y1 0, otherwise 1.
Data$Y1[is.na(Data$Y1)] <- ifelse(Data$X1 <34.5, 0, 1)
Error i get:
Warning message:
In x[...] <- m :
number of items to replace is not a multiple of replacement length

a simple loop should do the trick
for (i in 1:nrow(Data){
if (is.na(Data$Y1[i])==TRUE){
Data$Y1[i] <- ifelse(Data$X1[i] <34.5, 0, 1)
}
}
It may not be the most sufficient way but the logic is pretty clear and runs fairly fast when your dataset isn't huge

Related

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

R: how to conditionally replace rows in data frame with randomly sampled rows from another data frame?

I need to conditionally replace rows in a data frame (x) with rows selected at random from another data frame (y).Some of the rows between the two data frames are the same and so data frame x will contain rows with repeated information. What sort of base r code would I need to achieve this?
I am writing an agent based model in r where rows can be thought of as vectors of attributes pertaining to an agent and columns are attribute types. For agents to transmit their attributes they need to send rows from one data frame (population) to another, but according to conditional learning rules. These rules need to be: conditionally replace values in row n in data frame x if attribute in column 10 for that row is value 1 or more and if probability s is greater than a randomly selected number between 0 and 1. Probability s is itself an adjustable parameter that can take any value from 0 to 1.
I have tried IF function in the code below, but I am new to r and have made a mistake somewhere with it as I get this warning:
"missing value where TRUE/FALSE needed"
I reckon that I have not specified what should happen to a row if the conditions are not satisfied.
I cannot think of an alternative method of achieving my aim.
Note: agent.dat is data frame x and top_ten_percent is data frame y.
s = 0.7
N = nrow(agent.dat)
copy <- runif(N) #to generate a random probability for each row in agent.dat
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 & copy < s){
agent.dat <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
The agent.dat data frame should have rows that are replaced with values from rows in the top_ten_percent data frame if the randomly selected value of copy between 0 and 1 for that row is less than the value of parameter s and if the value for that row in column 10 is 1 or more. For each row I need to replace the first 10 columns of agent.dat with the first 10 columns of top_ten_percent (excluding column 11 i.e. copy value).
Assistance with this problem is greatly appreciated.
So you just need to change a few things.
You need to get a particular value for copy for each iteration of the for loop (use: copy[i]).
You also need to make the & in the if statement an && (Boolean operators && and ||)
Then you need to replace a particular row (and columns 1 through 10) in agent.dat, instead of the whole thing (agent.dat[i,1:10])
So, the final code should look like:
copy <- runif(N)
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 && copy[i] < s){
agent.dat[i,1:10] <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
This should fix your errors, assuming your data structure fits your code:
copy <- runif(nrow(agent.dat))
s <- 0.7
for (i in 1:nrow(agent.dat)){
if(agent.dat[i,10] >= 1 & copy[i] < s){
agent.dat[i,] <- top_ten_percent[sample(1:nrow(top_ten_percent), 1), ]
}
}

Loop of regressions on input range - How can I avoid the for loop and improve performance?

I am currently backtesting a strategy which involves an lm() regression and a probit glm() regression. I have a dataframe named forBacktest with 200 rows (1 for each day to backtest) and 9 columns : the first 8 (x1 to x8) are the explanatory variables and the last one (x9) is the real value (which I am trying to explain in the regression). To do the regression, I have an other dataframe named temp which has like 1000 rows (one for each day) and a lot of columns, some of which are the x1 to x8 values and also the x9 value.
But the tricky part is that I do not just generate a regression-model and then a loop for predict because I select a part of the dataframe temp based on the values of x1 which I split in 8 different ranges and then, according to the value x1 of the dataframe forBacktest, I do a regression with a part of temp with x1 in a given range.
So what I do is that for each one of the 200 rows, I take x1 and if x1 is between 0 and 1 (for example) then I create a part of temp where all the x1 are between 0 and 1, then I make a regression to explain x9 with x1, x2, ... x9 (just x1+x2+..., there is no x1:x2, x1^2,...) and then I use the predict function with the dataframe forBacketst. If I predict a positive value and if x9 is positive then I increment a counter success by one (idem if both are negative), but if one is positive and the other negative, then success stays the same. Then I take the next row and so on. At the end of the 200 rows, I now have an average of the successes which I return. In fact, I have two averages : one for the lm regression and the other for the glm regression (same methodology, I just take sign(x9) for the variable to explain).
So my question is: how can I efficiently do that in R, if possible without a big for loop with 200 iterations where for each iteration, it creates a part of the dataframe, makes the regressions, predict the two values, add them to a counter and so on? (this is currently my solution but I find it too slow and not very R-like)
My code looks like that :
backtest<-function() {
for (i in 1:dim(forBacktest)[1]) {
x1 <- forBacktest[i,1]: x2 <- forBacktest[i,2] ... x9 <- forBacktest[i,9]
a <- ifelse(x1>1.5,1.45,ifelse(x1>1,0.95,....
b <- ifelse(x1>1.5,100,ifelse(x1>1,1.55,....
temp2 <- temp[(temp$x1>=a/100)&(temp$x1<=b/100),]
df <- dataframe(temp$x1,temp$x2,...temp$x9)
reg <- lm(temp$x9~.,data=df)
df2 <- data.frame(x1,x2,...x9)
rReg <- predict(reg,df2)
trueOrFalse <- ifelse(sign(rReg*x9)>0,1,0)
success <- success+trueOrFalse
}
success
}
The code you have written is way much complicated. Things could be much much simpler..
Use the cut() and the by() function.
breaks <- 0:8 #this is the range by which you want to divide your data
divider <- cut(forBackTest$x1,breaks)
subsetDat <- by(forBackTest,INDICES = divider,data.frame) # this creates 8 dataframes
reg <- lapply(subsetDat,lm,formula=x9~.)
'reg' will now contain all the 8 lm objects corresponding to the 8 ranges. To predict for all these ranges use lapply() with reg and the temp dataframe. It will return you the predicted values for eight ranges
Few things to keep in mind:
The method suggested above is simpler and easier to read. It will be
faster than your for loops, but as the size of data frame increases,
it could get slower.
The by function takes a dataframe, and applies the function specified (data.frame()) to the subsetted dataframe specified by INDICES and returns a list. So new dataframes are created and this could take up a lot of space if the size of dataframe is large.
*apply() is much faster than for loops. See here to know more about them. The apply family comes handy for these kind of operations

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

Resources