NA if the value is enclosed between NA - r

I'm trying to clean my data. Let's imagine that we've got a vector of 20 values with several NAs:
set.seed(1234)
x <- rnorm(20, mean = 10, sd = 5) %>% round
x[c(6, 8, 12, 16, 19)] <- NA
So it looks smth like this:
> 4 11 15 -2 12 NA 7 NA 7 6 8 NA 6 10 15 NA 7 5 NA 22
I need to replace those values which are enclosed with NA with NA). E.g. 7 from my vector should be NA cause previous and next values are NA. I can do it with ifelse statement and some dplyr functions:
library(dplyr)
ifelse(is.na(lag(x))&is.na(lead(x)), NA, x)
> 4 11 15 -2 12 NA NA NA 7 6 8 NA 6 10 15 NA 7 5 NA NA
The question is how can I replace two values enclosed with NA. 7 and 5 for example? I was trying to duplicate the condition, i.e. make lag(lag(x)) and lead(lead(x)) but I get a mess.
ifelse(is.na(lag(x))&is.na(lead(x)) | is.na(lead(lead(x)))&is.na(lag(lag(x))), NA, x)
> 4 11 15 -2 12 NA NA NA 7 NA 8 NA 6 NA 15 NA 7 5 NA NA

We can group per NA and count the length of each group. If it has length 3, then that means that the group consist of NA, value, value. We simply replace those values with NA.
i1 <- cumsum(is.na(x))
x[ave(i1, i1, FUN = function(i)length(i)) == 3] <- NA
#[1] 4 11 15 -2 12 NA 7 NA 7 6 8 NA 6 10 15 NA NA NA NA 22

Related

How to put the column entry value matching the row number in data frame in R

I have generated random data like this.
data <- replicate(10,sample(0:9,10,rep=FALSE))
ind <- which(data %in% sample(data, 5))
#now replace those indices in data with NA
data[ind]<-NA
#here is our vector with 15 random NAs
data = as.data.frame(data)
rownames(data) = 1:10
colnames(data) = 1:10
data
which results in a data frame like this. How can I reorder the entry value such that if the entry is numeric, then the value will be placed in a (row number - 1), and NA will be put in any rows where there is no value matching the (row number -1). The data I want, for example, the first column, should look like this
.
How can I do this? I have no clue at all. We can order decreasing or increasing and put NA in the last order, but that is not what I want.
You can make a helper function to assign values to indices at (values + 1), then apply the function over all columns:
fx <- function(x) {
vals <- x[!is.na(x)]
pos <- vals + 1
out <- rep(NA, length(x))
out[pos] <- vals
out
}
as.data.frame(sapply(data, fx))
1 2 3 4 5 6 7 8 9 10
1 NA 0 NA 0 0 0 0 NA 0 0
2 NA NA NA 1 1 NA NA NA NA NA
3 2 NA 2 2 NA NA NA NA 2 NA
4 3 NA 3 3 NA NA 3 NA 3 3
5 4 4 4 4 NA 4 NA 4 4 NA
6 5 5 NA 5 NA NA 5 5 5 NA
7 NA 6 6 NA 6 NA NA 6 NA NA
8 7 NA 7 7 NA 7 7 NA NA 7
9 NA NA NA NA 8 8 8 8 8 8
10 9 9 NA NA 9 NA NA 9 NA 9
Starting data:
set.seed(13)
data <- replicate(10, sample(
c(0:9, rep(NA, 10)),
10,
replace = FALSE
))
data <- as.data.frame(data)
colnames(data) <- 1:10
data
1 2 3 4 5 6 7 8 9 10
1 2 NA NA 2 NA NA 0 NA 3 7
2 4 NA NA 4 NA NA NA NA 2 9
3 9 9 NA 3 9 4 NA 6 4 0
4 NA NA NA 1 6 NA NA 4 NA NA
5 5 6 3 0 NA NA 5 8 8 NA
6 NA NA 7 NA NA NA 7 NA 5 3
7 3 4 6 NA 1 0 NA 5 NA NA
8 NA NA NA 7 0 7 NA NA 0 NA
9 NA 0 4 NA 8 8 8 9 NA 8
10 7 5 2 5 NA NA 3 NA NA NA

Is there a way to ignore NA values in a sample function in R?

I would like to randomly select two non-repeating values from each row of my dataframe and insert these values into two columns at the end of the dataframe at the same row.
I'm using the sample, the problem though is that there is some missing data. I would like to find a way to use sample ignoring the missing data.
I tried to specify the na.rm command, but it is not working.
What can I do?
Let a vector be x like this
x <- c(NA, 3, 4, 5, NA)
Now subset x with its non NA values only and sample on that subset.
sample(x[!is.na(x)], 1)
Suppose we have the following data.frame:
set.seed(3)
data <- as.data.frame(matrix(sample(c(1:30,rep(NA,20)),replace = TRUE,size = 24),ncol = 3))
data
V1 V2 V3
1 5 20 29
2 12 10 NA
3 NA NA NA
4 NA NA 5
5 NA NA NA
6 NA 8 NA
7 NA NA 9
8 8 2 9
We can see there are sometimes when there are enough values to sample, but other times not. To get around these edge cases, we can write a custom function:
sample.function <- function(x){
if(sum(!is.na(x)) == 0) {c(NA,NA)}
else if(sum(!is.na(x)) == 1) {c(x[!is.na(x)],NA)}
else {sample(x[!is.na(x)],size = 2)}}
If there are no non-NA values, the function returns c(NA,NA). If there is only one non-NA value, it returns that value and NA. If there are two or more, it uses the sample function on x which is subset to not include any NA values.
Then we can use the apply function to apply our custom sample.function to our data. Apply binds the results column wise, so we can transpose it with t().
t(apply(data,1,sample.function))
[,1] [,2]
[1,] 20 29
[2,] 10 12
[3,] NA NA
[4,] 5 NA
[5,] NA NA
[6,] 8 NA
[7,] 9 NA
[8,] 2 9
Now add it to the original data:
setNames(cbind(data,t(apply(data,1,sample.function))),c("V1","V2","V3","Sample1","Sample2"))
V1 V2 V3 Sample1 Sample2
1 5 20 29 5 29
2 12 10 NA 12 10
3 NA NA NA NA NA
4 NA NA 5 5 NA
5 NA NA NA NA NA
6 NA 8 NA 8 NA
7 NA NA 9 9 NA
8 8 2 9 9 8

Using if_else in combination with data.table .SD

I need to create new columns in a data.table based on criteria set relative to some of the existing columns. I encountered some problems with missing data, however. Specifically, for each person a few datapoints are missing. For some individuals though the entire data of a questionnaire is missing (see column p == 3 or 4 in example data below). In such cases (= entire data of a questionnaire missing) I would like data.table to enter NA in the output for this particular person. I have tried resolving this using if_else from the dplyrpackage. However, data.table returns NaN or 0 instead of NAas a result even when all data of a person is missing (i.e. when column p is 3 or 4).
This is my current script, which only partially produces the desired output (i.e. correct output for p== 1 or 2, but not for p== 3 or 4).
library(data.table)
library(dplyr)
# Create example datatable
set.seed(4)
p <- c(rep(1, 5), rep(2, 5), rep(3, 5), rep(4, 5))
time1 <- as.integer(c(sample(1:20, 5, replace=TRUE), sample(21:40, 5, replace=TRUE), rep("NA",10)))
closeness1 <- as.integer(c(NA, NA, sample(c(1:40,NA), 7, replace=TRUE), NA, rep("NA",10)))
dt <- data.table::data.table(p, time1, closeness1)
# Compute new columns
dt[, c("mean1", "sum1") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.integer(NA), .SD[time1 <= 10, sum(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
The following script produces the output I would want to see. However, this is obviously just for illustrative purposes and I would need to know how to modify the above script to produce the desired outcome:
# Select rows from original data that were as intended
p12 <- dplyr::filter(dt, p %in% c(1,2))
# Create new data.table with corrected output
p <- c(rep(3, 5), rep(4, 5))
time1 <- as.integer(rep("NA",10))
closeness1 <- as.integer(rep("NA",10))
mean1 <- as.integer(rep("NA",10))
sum1 <- as.integer(rep("NA",10))
dt.des <- data.table::data.table(p, time1, closeness1, mean1, sum1)
# Desired output
dsrd.opt <- dplyr::bind_rows(p12, dt.des)
dsrd.opt
p time1 closeness1 mean1 sum1
1 1 12 NA 21.5 43
2 1 1 NA 21.5 43
3 1 6 31 21.5 43
4 1 6 12 21.5 43
5 1 17 5 21.5 43
6 2 26 40 NaN 0
7 2 35 18 NaN 0
8 2 39 19 NaN 0
9 2 39 40 NaN 0
10 2 22 NA NaN 0
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
Edit:
It looks like I simplified the above example too much. I basically need to compute the mean of closeness1 based on two separate conditions, once for time1 <= 10 and once for time1 > 10 & time1 <= 21. The respective output should then be saved in two new columns. I have updated the example script accordingly, see below:
dt[, c("mean1", "mean2") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 > 10 & time1 <= 21, mean(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
Updated example output
dsrd.opt
p time1 closeness1 mean1 mean2
1 1 12 NA 21.5 5
2 1 1 NA 21.5 5
3 1 6 31 21.5 5
4 1 6 12 21.5 5
5 1 17 5 21.5 5
6 2 26 40 NaN NaN
7 2 35 18 NaN NaN
8 2 39 19 NaN NaN
9 2 39 40 NaN NaN
10 2 22 NA NaN NaN
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
If I understood you correctly, I would suggest to use a simple left join. I think this is pretty straigthforward and produces the desired result.
dt_result <- merge(x = dt
, y = dt[time1 <= 10, .(mean1 = mean(closeness1, na.rm = TRUE)
, sum1 = sum(closeness1, na.rm = TRUE)), by = list(p)]
, by.x = "p"
, by.y = "p"
, all.x = TRUE
)
> dt_result
p time1 closeness1 mean1 sum1
1: 1 12 NA 21.5 43
2: 1 1 NA 21.5 43
3: 1 6 31 21.5 43
4: 1 6 12 21.5 43
5: 1 17 5 21.5 43
6: 2 26 40 NA NA
7: 2 35 18 NA NA
8: 2 39 19 NA NA
9: 2 39 40 NA NA
10: 2 22 NA NA NA
11: 3 NA NA NA NA
12: 3 NA NA NA NA
13: 3 NA NA NA NA
14: 3 NA NA NA NA
15: 3 NA NA NA NA
16: 4 NA NA NA NA
17: 4 NA NA NA NA
18: 4 NA NA NA NA
19: 4 NA NA NA NA
20: 4 NA NA NA NA

Replacing changing values columnwise in a DF

I have a dataframe that looks like this:
x1 y1 z1 x2 y2 z2
1 6 7 8 5 4 10
2 7 8 9 6 5 11
3 8 9 10 7 6 12
4 9 10 11 8 7 13
5 10 11 12 9 8 14
6 11 12 13 10 9 15
Now I want to change the values in x1 and x2 according to this rule: Every value in x1 or in x2 that is greater than 8 should be subtracted by eight, every value in x1 or x2 that is smaller that is 8 or smaller should be replaced by NA. Additionally, if a value in x1 or x2 is replaced by NA y1/y2 and z1/z2 should be also set to NA. The dataframe should look like this.
x1 y1 z1 x2 y2 z2
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 10 11 NA NA NA
5 2 11 12 1 8 14
6 3 12 13 2 9 15
The code to generate the dataframe
df1<-data.frame("x1"=6:11,"y1"=7:12,"z1"=8:13,"x2"=5:10,"y2"=4:9,"z2"=10:15)
We create two indexes based for 'x1' and 'x2' and assign the values based on those index
i1 <- df1$x1 <=8 #x1 index
i2 <- df1$x2 <=8 #x2 index
nm1 <- grep("1$", names(df1)) #column index for suffix 1 in column names
nm2 <- grep("2$", names(df1)) #column index for suffix 2 in column names
df1[i1,nm1] <- NA #set the values for suffix 1 columns to NA
df1[i2, nm2] <- NA #set the values for suffix 2 columns to NA
df1[c('x1', 'x2')] <- df1[c('x1', 'x2')] - 8 #subtract 8 from the 'x' columns
df1
# x1 y1 z1 x2 y2 z2
#1 NA NA NA NA NA NA
#2 NA NA NA NA NA NA
#3 NA NA NA NA NA NA
#4 1 10 11 NA NA NA
#5 2 11 12 1 8 14
#6 3 12 13 2 9 15
We have a condition in two variables, and then a series of reactions in case of this conditions are TRUE.
# Activate the condition for x1 and x2
df1$x1 <- ifelse(df1$x1 > 8, df1$x1 - 8, NA)
df1$x2 <- ifelse(df1$x2 > 8, df1$x2 - 8, NA)
# Reaction of other variables to a external condition
df1$y1 <- ifelse(df1$x1 > 8, NA, df1$y1)
df1$y2 <- ifelse(df1$x2 > 8, NA, df1$y2)
# Reaction of other variables to a external condition
df1$z1 <- ifelse(df1$x1 > 8, NA, df1$z1)
df1$z2 <- ifelse(df1$x2 > 8, NA, df1$z2)
library(dplyr)
df[,c("x1","x2")] <- sapply(df[,c("x1","x2")],function(x)ifelse(x>8,x-8,NA))
df %>%
mutate(y1=replace(y1,which(x1%in%NA),NA))%>%
mutate(z1=replace(z1,which(x1%in%NA),NA))%>%
mutate(y2=replace(y2,which(x2%in%NA),NA))%>%
mutate(z2=replace(z2,which(x2%in%NA),NA))
x1 y1 z1 x2 y2 z2
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 10 11 NA NA NA
5 2 11 12 1 8 14
6 3 12 13 2 9 15

Search and replace entries in a dataframe in two columns

I have a certain data set in which there are few missing values.
the dataset looks like the following:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 NA 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 NA 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 NA 2
3 10 NA NA 5 1 NA 5
4 5 NA NA 6 2 NA 3
4 11 25 10 NA NA 2.5 NA
My data is in the above mentioned format. Column a is a kind of time period which is in sequence and has multiple codes corresponding to it.
Column b just shows an item. This item either has a repeated entry in time or has an unique value.
Column g and h are just the columns made by dividing column c0/d0 = g and c1/d1 = h. Out here, column g holds more importance.
Now, since it is clear that there are few NA and some of the column b entries are duplicate whereas rest are unique.
I have to perform the following steps in order to compute the NA's in column 'g':
I have to find in the 'column b' that is the entry repetitive or has an unique value.Eg : Entry 6 and 5 are repeated, whereas 7,8 9,10 and 11 are unique.
Once it has been found, next step is to that whether there is some value in 'column g' already for the item or not.
If there is, then we need to take average of the repetaed value in 'column g' if it's other than NA, like for item 5, I can find that the values are 2 and 2.5 and hence the average of 2.25 should be place in 'column g' for the repeated 5 value at a=4.
Now, if there is a repeated value but still column g is NA, then I can simply take the 'column h' value as value of 'column g'.
For the non repetitive items, like 9,10,7, etc. since they are unique, just replace the column g entry by column h.
The final output should be as follows:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 4 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 1 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 2 2
3 10 NA NA 5 1 5 5
4 5 NA NA 6 2 2.25 3
4 11 25 10 NA NA 2.5 NA
Request you to help me out with it. In case, you have any question in understanding the question, do let me know or even if some more details are required.
Your desired output is inconsistent. You have one row missing, column h has been altered and hence column g at the seventh row looks inconsistent too.
Either-way, following your description, I would do this in two steps.
First subset your data only by b instances that have dupes and alternate NAs by the mean of the rest of the group
replace all the NAs left by column h
I'd suggest data.table as it allows comfortable operations on subsets
library(data.table)
setDT(df)[duplicated(b) | duplicated(b, fromLast = TRUE), # operate only on the dupes
g := replace(g, is.na(g), mean(g, na.rm = TRUE)), by = b] # replace NA by group
df[is.na(g), g := as.double(h)] # subset by NAs and replace with corresponding values in h
df
# a b c0 d0 c1 d1 g h
# 1: 1 5 20 10 NA NA 2.00 NA
# 2: 1 6 NA NA 8 2 4.00 4
# 3: 2 5 25 10 NA NA 2.50 NA
# 4: 2 7 NA NA 2 2 1.00 1
# 5: 2 8 50 10 NA NA 5.00 NA
# 6: 3 9 10 10 NA NA 1.00 NA
# 7: 3 6 NA NA 8 2 4.00 4
# 8: 3 10 NA NA 5 1 5.00 5
# 9: 4 5 NA NA 6 2 2.25 3
# 10: 4 11 25 10 NA NA 2.50 NA
We can reduce it to "one" step once we recognize that when grouped by b, duplicates imply that there are more than one row grouped. Therefore, the condition to replace the NA values in g by the mean of its group (that are not NA) is if:
the number of rows grouped by b is greater than one and not all of g in the group is NA
Otherwise, replace the NA values in g with h:
library(data.table)
setDT(df)[, g := if (.N > 1 & !all(is.na(g))) {
replace(g, is.na(g), mean(g, na.rm = TRUE))
} else {
replace(g, is.na(g), as.double(h))
}, by=b][]
## a b c0 d0 c1 d1 g h
## 1: 1 5 20 10 NA NA 2.00 NA
## 2: 1 6 NA NA 8 2 4.00 4
## 3: 2 5 25 10 NA NA 2.50 NA
## 4: 2 7 NA NA 2 2 1.00 1
## 5: 2 8 50 10 NA NA 5.00 NA
## 6: 3 9 10 10 NA NA 1.00 NA
## 7: 3 6 NA NA 8 2 4.00 4
## 8: 3 10 NA NA 5 1 5.00 5
## 9: 4 5 NA NA 6 2 2.25 3
##10: 4 11 25 10 NA NA 2.50 NA

Resources