Replacing NAs with random decimals in a particular column in R - r

I am trying to replace NAs with random decimals in a particular column in R. However, R generates random decimals with the same trailing fraction and just changes the part before the decimal. The following are the methods I tried:
df_LT$ATC[is.na(df_LT$ATC)] <- sample(seq(10.2354897,23.78954214), size=sum(is.na(df_LT$ATC)), replace=T)
dplyr
df_LT <- df_LT %>%mutate_at(vars(df_LT$ATC), ~replace_na(., sample(10.2354897:23.78954214, size=sum(is.na(ATC)), replace=T)))
Data looks as below
A ATC
1 11.2356879
2 42.58974164
3 NA
4 34.25382343
5 NA
Now, wherever there is a NA in the ATC column I want to add a decimal like the others but in the range 10:23. Hope this explanation will help.
I may be missing something very obvious. Thanks for the help in advance.

You are using seq or the colon operator : to create your samples, which means you are sampling from following sequence:
seq(10.2354897, 23.78954214)
# [1] 10.23549 11.23549 12.23549 13.23549 14.23549 ....
So the starting value is increased by 1 in each step, leaving the numbers after the decimal points fixed.
If you want to sample random number within the range of these two limits you can do:
runif(n = 1, min = 10.2354897, max = 23.78954214)
So for your example this translates to:
df_LT$ATC[is.na(df_LT$ATC)] <-
runif(n = sum(is.na(df_LT$ATC)), 10.2354897, 23.78954214)
If you want to add a condition you can do:
df_LT$ATC <-
ifelse(is.na(df_LT$ATC) & df_LT$A == 3,
runif(n = nrow(df_LT), 10.2354897, 23.78954214),
df_LT$ATC)
This checks whether ATC is missing and also whether A is equal to 3. If this is fulfille the missing value is replaced with a random number, otherwise the original value (missin or not) is returned.

Related

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

How get numbers from given range with given condition in R

I want to get all numbers that are greater than 0 and lesser than 1e6 and does not contain digit 4. How is that possible, please?
My try was:
library(prob)
A <- c(0:(1e6-1))
V <- subset(A, /*I don't know what to put here*/)
But I don't know how to state that I want all numbers that does not contain digit 4....
You could use grep to find out indices with numbers containing 4 and remove them with negative subsetting.
A = 0:1e6
V = A[-grep(4,A)]

Counting consecutive repeats, and returning the maximum value in each in each string of repeats if over a threshold

I am working with long strings of repeating 1's and 0's representing the presence of a phenomenon as a function of depth. If this phenomenon is flagged for over 1m, it is deemed significant enough to use for further analyses, if not it could be due to experimental error.
I ultimately need to get a total thickness displaying this phenomenon at each location (if over 1m).
In a dummy data set the input and expected output would look like this:
#Depth from 0m to 10m with 0.5m readings
depth <- seq(0, 10, 0.5)
#Phenomenon found = 1, not = 0
phenomflag <- c(1,0,1,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,1,0)
What I would like as an output is a vector with: 4, 5 (which gets converted back to 2m and 2.5m)
I have attempted to solve this problem using
y <- rle(phenomflag)
z <- y$length[y$values ==1]
but once I have my count, I have no idea how to:
a) Isolate 1 maximum number from each group of consecutive repeats.
b) Restrict to consecutive strings longer than (x) - this might be easier after a.
Thanks in advance.
count posted a good solution in the comments section.
y <- y <- rle(repeating series of 1's and 0's)
x <- cbind(y$lengths,y$values) ; x[which(x[,1]>=3 & x[,2]==1)]
This results in just the values that repeat more than a threshold of 2, and just the maximum.

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

Drop row in a data frame if value has less than 2 decimal places

In one column of a data frame, I have values for longitude. For example:
df<-data.frame(long=c(-169.42000,144.80000,7.41139,-63.07000,-62.21000,14.48333,56.99900))
I want to keep rows which have at least three decimal places (i.e three non-zero values immediately after the decimal point) and delete all others. So rows 1,2,4 and 5 would be deleted from df in the example above.
So far I've tried usinggrep to extract the rows I want to keep:
new.df<-df[-grep("000$",df$long),]
However this has deleted all rows. Any ideas? I'm new to using grep so there may be glaring errors that I've not picked up on!
Many thanks!
I wouldn't use regex for this.
tol <- .Machine$double.eps ^ 0.5
#use tol <- 0.001 to get the same result as with the regex for numbers like 0.9901
discard <- df$long-trunc(df$long*100)/100 < tol
df[!discard, , drop=FALSE]
# long
# 3 7.41139
# 6 14.48333
# 7 56.99900
You have to modify your regular expression slightly. The following one select all values with three non-zero numbers after the decimal point:
new.df <- df[grep("\\.[1-9][1-9][1-9]", df$long), ]

Resources