Delete NA data ,but with certain condition in R - r

Let's see example data
nad=structure(list(x1 = 1:5, x2 = c(NA, 2L, 2L, NA, 34L), x3 = c(NA,
1L, NA, NA, NA), x4 = c(NA, 2L, 5L, NA, NA), x5 = c(NA, 3L, NA,
NA, NA), x6 = c(NA, 4L, NA, NA, NA)), .Names = c("x1", "x2",
"x3", "x4", "x5", "x6"), class = "data.frame", row.names = c(NA,
-5L))
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
3 3 2 NA 5 NA NA
4 4 NA NA NA NA NA
5 5 34 NA NA NA NA
Usually to get complete data without NA, i can use this function
na.omit(nad)
But my problem a little complex.
In spite of the fact that x2 has NA, i do not need delete row where there are NA for x2.
Valuable data is where there is value for x1 and not in x2,
and if there are observations in the row for x1 and x2 but not on the another variables, then the row should not be deleted.
Therefore, the first and 4th rows should not be deleted.
3 and 5 should be deleted, because here, on the one hand there are observations on x1 and x2, but other variables are blank.
Second row is completely complete, i do not need to delete it.
How can I delete NA using such condition?
Desired output
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
3 4 NA NA NA NA NA
As an addition(separately question, but adjacent), I also want to ask here, maybe I will need this for analytics
if there is such situation
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 NA 1 1 1 1
Here first row has NA for x2, and NA for other variables,
and second row has NA for x2,but another variable is not NA.
How in such case, left only rows where x1 has value, x2 doesn't have, but another variable have values?

So maybe you are looking for
nad[!is.na(nad$x1) & is.na(nad$x2) | rowSums(!is.na(nad)) == ncol(nad), ]
# x1 x2 x3 x4 x5 x6
#1 1 NA NA NA NA NA
#2 2 2 1 2 3 4
#4 4 NA NA NA NA NA
This selects rows where x1 has non-NA values and x2 has NA OR all the values in the row are non-NA.

I think you would probably be best off by checking each row wether it satisfies your conditions. If I understood correctly, something like the following could work:
keep <- apply(nad, 1, function(row) {
# Don't keep data if first column is NA
if (!is.na(row[[1]])) {
sumna <- sum(is.na(row[-1]))
# Only keep if rest is all NA or none is NA
if (sumna == 0 | sumna == length(row) - 1) {
return(TRUE)
} else {
return(FALSE)
}
} else {
return(FALSE)
}
})
nad[keep,]
x1 x2 x3 x4 x5 x6
1 1 NA NA NA NA NA
2 2 2 1 2 3 4
4 4 NA NA NA NA NA

Related

R Merging non-unique columns to consolidate data frame

I'm having issues figuring out how to merge non-unique columns that look like this:
2_2
2_3
2_4
2_2
3_2
1
2
3
NA
NA
2
3
-1
NA
NA
NA
NA
NA
3
-2
NA
NA
NA
-2
4
To make them look like this:
2_2
2_3
2_4
3_2
1
2
3
NA
2
3
-1
NA
3
NA
NA
-2
-2
NA
NA
4
Essentially reshaping any non-unique columns. I have a large data set to work with so this is becoming an issue!
Note that data.frame doesn't allow for duplicate column names. Even if we create those, it may get modified when we apply functions as make.unique is automatically applied. Assuming we created the data.frame with duplicate names, an option is to use split.default to split the data into list of subset of data, then loop over the list with map and use coalesce
library(dplyr)
library(purrr)
map_dfc(split.default(df1, names(df1)),~ invoke(coalesce, .x))
-output
# A tibble: 4 × 4
`2_2` `2_3` `2_4` `3_2`
<int> <int> <int> <int>
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
data
df1 <- structure(list(`2_2` = c(1L, 2L, NA, NA), `2_3` = c(2L, 3L, NA,
NA), `2_4` = c(3L, -1L, NA, NA), `2_2` = c(NA, NA, 3L, -2L),
`3_2` = c(NA, NA, -2L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
Also using coalesce:
You use non-syntactic names. R is strict in using names see here https://adv-r.hadley.nz/names-values.html and also notice the explanation by #akrun:
library(dplyr)
df %>%
mutate(X2_2 = coalesce(X2_2, X2_2.1), .keep="unused")
X2_2 X2_3 X2_4 X3_2
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4

How to change values across 1 row based on values in a column in R?

I have a lot of columns in 1 dataframe that identify different timepoints of the same variable. Basically, within my data, if there's no response at timepoint X-1, there will be no response at time point X or beyond (after an NA appears in a row, it will continue). I currently have a column that shows which row the last response came from and what that response is. The dataframe currently looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 4 NA NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 5 NA NA 5 X2
6 6 5 7 7 7 7 X4
My goal is to be able to conduct a regression using the last response of each row as the outcome variable. However, I don't want it to repeat twice in the "X_final" column and also in the column that the response actually comes from. Therefore, I am hoping to find a way to put a "." in for the cell where that value originally came from so it looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 . <NA> NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 . NA NA 5 X2
6 6 5 7 7 7 7 X4
Any suggestions would be appreciated - thank you!
Another method, since you already have the locations in $X_final_location. As mentioned in the question comments, NA values would be preferred if the goal would be regression analysis to preserve numeric values.
data_orig <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
X1 = c(5, 4, 7, 8, 1, 5),
X2 = c(5, NA, 1, 2, 5, 7),
X3 = c(6, NA, 3, 4, NA, 7),
X4 = c(NA, NA, 5, 2, NA, 7),
X_final = c(6, 4, 5, 2, 5, 7),
X_final_location = c("X3", "X1", "X4", "X4", "X2", "X4")
)
data_new <- data_orig
for (i in seq_len(nrow(data_new))) {
data_new[i, data_new$X_final_location[i]] <- NA
}
data_new
# id X1 X2 X3 X4 X_final X_final_location
# 1 1 5 5 NA NA 6 X3
# 2 2 NA NA NA NA 4 X1
# 3 3 7 1 3 NA 5 X4
# 4 4 8 2 4 NA 2 X4
# 5 5 1 NA NA NA 5 X2
# 6 6 5 7 7 NA 7 X4
One way to do this (NA instead of . to preserve data type):
match finds the first NA position, replace replaces the value in that position - 1 (previous) with NA.
apply(data, 1, \(x) ...) applies that function for each row. Finally t transposes the result (since apply by default coerces the result to columns.
data = data.frame(id = 1:6, X1 = c(5L, 4L, 7L, 8L, 1L, 5L), X2 = c(5L,
NA, 1L, 2L, 5L, 7L), X3 = c(6L, NA, 3L, 4L, NA, 7L), X4 = c(NA,
NA, 5L, 2L, NA, 7L), X_final = c(6L, 4L, 5L, 2L, 5L, 7L), X_final_location = c("X3",
"X1", "X4", "X4", "X2", "X4"))
data[,2:5] <- t(apply(data[,2:5], 1 , function(x) replace(x, match(NA, x) - 1, NA)))
data
#> id X1 X2 X3 X4 X_final X_final_location
#> 1 1 5 5 NA NA 6 X3
#> 2 2 NA NA NA NA 4 X1
#> 3 3 7 1 3 5 5 X4
#> 4 4 8 2 4 2 2 X4
#> 5 5 1 NA NA NA 5 X2
#> 6 6 5 7 7 7 7 X4
Another way using split (grouping by row):
split(data, row.names(data)) <-
lapply(split(data, row.names(data)), \(x) replace(x, x$X_final_location, "."))

In R data frame for each set of rows and column use value that is not na

I have a data frame df of the following structure:
observation x1 x2 x3 x4
"obs1" NA NA NA 51
"obs1" NA NA NA NA
"obs1" NA 25 NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA 56
"obs3" 26 NA NA NA
"obs3" NA 82 NA NA
"obs3" NA NA "x" NA
I want a data frame df2 that, for each observation and for each column, takes the one value, that is not NA. The resulting data frame should look like this:
observation x1 x2 x3 x4
"obs1" NA 25 NA 51
"obs2" NA NA NA 56
"obs3" 26 82 "x" NA
I tried to do:
only_value = function(x){
x[which(!is.na(x))]
}
df2 = df %>% lapply(only_value) %>% as.data.frame()
However, this only works if there is the same amount of values for each observation. This is not the case in my example.
A data.table option using fcoalesce may help
type.convert(setDT(df)[,data.table(t(fcoalesce(asplit(.SD,1)))),observation],as.is = TRUE)
which gives
observation x1 x2 x3 x4
1: obs1 NA 25 <NA> 51
2: obs2 NA NA <NA> 56
3: obs3 26 82 x NA
Data
> dput(df)
structure(list(observation = c("obs1", "obs1", "obs1", "obs2",
"obs2", "obs2", "obs3", "obs3", "obs3"), x1 = c(NA, NA, NA, NA,
NA, NA, 26L, NA, NA), x2 = c(NA, NA, 25L, NA, NA, NA, NA, 82L,
NA), x3 = c(NA, NA, NA, NA, NA, NA, NA, NA, "x"), x4 = c(51L,
NA, NA, NA, NA, 56L, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-9L))
Similarly, you can use coalesce with dplyr
df %>%
group_by(observation) %>%
summarise(across(x1:x4,~do.call(coalesce,as.list(.x))))
which gives
observation x1 x2 x3 x4
* <chr> <int> <int> <chr> <int>
1 obs1 NA 25 <NA> 51
2 obs2 NA NA <NA> 56
3 obs3 26 82 x NA
Change the only_value function to return only 1st non-NA value.
only_value = function(x){
x[!is.na(x)][1]
}
Now apply this function by group to columns x1 to x4 :
library(dplyr)
df %>%
group_by(observation) %>%
summarise(across(x1:x4, only_value))
# observation x1 x2 x3 x4
#* <chr> <int> <int> <chr> <int>
#1 obs1 NA 25 NA 51
#2 obs2 NA NA NA 56
#3 obs3 26 82 x NA

Calculate rolling average in matrix

I want to calculate a rolling average. Specifically, I want to fill each row of columns 5 and 6 of Mat1, with a rolling average of the prior 3 columns. For column 5 this implies an average over 2,3,4 and for column 6, the average over columns 3,4,5. I only want to calculate the average when there are no NAs in the columns over which the average is calculated.
mat1 <- data.frame(matrix(nrow =6, ncol =6))
mat1[1:4,1:4] = rnorm(16,0,1)
mat1[5:6,1:3] = rnorm(6,0,1)
mat1
X1 X2 X3 X4 X5 X6
1 0.40023542 2.05111693 0.695422777 0.9938004 NA NA
2 0.22673283 -0.86433614 0.002620227 0.8464388 NA NA
3 0.88522293 -0.72385091 0.751663489 1.3240476 NA NA
4 0.65373734 1.68385938 0.759718967 -0.4577604 NA NA
5 -0.09442161 0.72186678 0.180312264 NA NA NA
6 0.39930843 0.04311092 2.141065229 NA NA NA
for entry 1,5 = mean(2.051,0.69,0.99) and for entry 1,6 = mean(0.69, 0.99, mean(2.051,0.69,0.99)).
We can use for loop to calculate rolling mean of last three columns
cols <- 5:6
for(i in cols) {
mat1[i] <- rowMeans(mat1[(i-3):(i-1)])
}
mat1
# X1 X2 X3 X4 X5 X6
#1 0.40023542 2.05111693 0.695422777 0.9938004 1.246780036 0.9786677
#2 0.22673283 -0.86433614 0.002620227 0.8464388 -0.005092371 0.2813222
#3 0.88522293 -0.72385091 0.751663489 1.3240476 0.450620060 0.8421104
#4 0.65373734 1.68385938 0.759718967 -0.4577604 0.661939316 0.3212993
#5 -0.09442161 0.72186678 0.180312264 NA NA NA
#6 0.39930843 0.04311092 2.141065229 NA NA NA
This returns NA if any NA value is present in the calculation as mentioned in the comments. If we need to ignore NA values, we can set na.rm = TRUE in rowMeans.
data
mat1 <- structure(list(X1 = c(0.40023542, 0.22673283, 0.88522293, 0.65373734,
-0.09442161, 0.39930843), X2 = c(2.05111693, -0.86433614, -0.72385091,
1.68385938, 0.72186678, 0.04311092), X3 = c(0.695422777, 0.002620227,
0.751663489, 0.759718967, 0.180312264, 2.141065229), X4 = c(0.9938004,
0.8464388, 1.3240476, -0.4577604, NA, NA), X5 = c(NA, NA, NA,
NA, NA, NA), X6 = c(NA, NA, NA, NA, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Resources