is.na or complete.cases in R using column number - r

In this example I need to drop all rows with NA values, I tried
drop <- is.na(df[,c(3,4,5)])
Error in df[, c(3, 4, 5)] : incorrect number of dimensions
My dataframe have 5 columns
I am not trying to select columns with column name
Also tried
df[complete.cases(df[ , 3:5]),]
Same error, incorrect number of dimensions

Dropping missing values from vectors
The errors indicate that your data are likely a vector, not a data.frame. Accordingly, there are no rows or columns (it has no dim) and so using [,] is throwing errors. To support this, below I create a vector, reproduce the errors, and demonstrate how to drop missing values from it.
# Create vector, show it's a vector
vec <- c(NA,1:4)
vec
#> [1] NA 1 2 3 4
is.vector(vec)
#> [1] TRUE
# Reproduces your errors for both methods
is.na(vec[ ,2:3])
#> Error in vec[, 2:3]: incorrect number of dimensions
vec[complete.cases(vec[ , 2:3]), ]
#> Error in vec[, 2:3]: incorrect number of dimensions
# Remove missing values from the vector
vec[!is.na(vec)]
#> [1] 1 2 3 4
vec[complete.cases(vec)]
#> [1] 1 2 3 4
I'll additionally show you below how to check if your data object is a data.frame and how to omit rows with missing values in case it is.
Create data and check it's a data.frame
# Create an example data.frame
set.seed(123)
N <- 10
df <- data.frame(
x1 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x2 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x3 = sample(c(NA_real_, 1, 2, 3), N, replace = T)
)
print(df)
#> x1 x2 x3
#> 1 2 3 NA
#> 2 2 1 3
#> 3 2 1 NA
#> 4 1 NA NA
#> 5 2 1 NA
#> 6 1 2 2
#> 7 1 3 3
#> 8 1 NA 1
#> 9 2 2 2
#> 10 NA 2 1
# My hunch is that you are not using a data.frame. You can check as follows:
class(df)
#> [1] "data.frame"
Approaches to removing rows with missing values from data.frames
Your first approach returns logical values for whether a value is missing for the specified columns. You could then rowSum and drop them per below.
# Example: shows whether values are missing for second and third columns
miss <- is.na(df[ ,2:3])
print(miss)
#> x2 x3
#> [1,] FALSE TRUE
#> [2,] FALSE FALSE
#> [3,] FALSE TRUE
#> [4,] TRUE TRUE
#> [5,] FALSE TRUE
#> [6,] FALSE FALSE
#> [7,] FALSE FALSE
#> [8,] TRUE FALSE
#> [9,] FALSE FALSE
#> [10,] FALSE FALSE
# We can sum all of these values by row (`TRUE` = 1, `FALSE` = 0 in R) and keep only
# those rows that sum to 0 to remove missing values. Notice that the row names
# retain the original numbering.
df[rowSums(miss) == 0, ]
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
Your second approach is to use complete.cases. This also works and produces the same result as the first approach.
miss_cases <- df[complete.cases(df[ ,2:3]), ]
miss_cases
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
A third approach is to use na.omit() however, it doesn't let you specify columns and you should just use complete.cases instead if you need to filter on specific columns.
na.omit(df)
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
A fourth approach is to use the tidyr package where the appeal is you can use column indices as well as unquoted column names. This also updates row names.
library(tidyr)
drop_na(df, 2:3)
#> x1 x2 x3
#> 1 2 1 3
#> 2 1 2 2
#> 3 1 3 3
#> 4 2 2 2
#> 5 NA 2 1

Related

replace NA's with 0, and Non NA's with a different value

I have a dataframe like this:
my_df <- data.frame(
ID = c(2, 4, 6, 8, 10, 12, 14, 16, 18),
b2 = c(NA, 4, 6, 2, NA, 6, 1, 1, NA))
and, I want to replace all NA's with '0', and every other values (Non-NA's) with '1', and place them in a new column (b4)
I can replace only NA's with 0 using this:
my_df2 <- my_df %>%
mutate(b3 = replace(b2,is.na(b2),0))
I would have thought I can use below step to then replace other values (Non-NA's) with '1':
my_df3 <- my_df2 %>% mutate(b4=ifelse(b3=="NA","0","1"))
This however, does not work the way I anticipated. Perhaps how to get through this in one go.
Any advice with this please?
The problem with the code in the question is that comparing to "NA" is not the same as checking if the value is NA. What that is doing is comparing the value to a character string which contains N and A. Also note that comparing to NA always gives NA so we can't use that either. Instead use is.na.
my_df$b2 == "NA"
## [1] NA FALSE FALSE FALSE NA FALSE FALSE FALSE NA
my_df$b2 == NA
## [1] NA NA NA NA NA NA NA NA NA
is.na(my_df$b2)
## [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
Now, since coercing TRUE and FALSE to numeric gives 1 and 0 respectively,
+TRUE
## [1] 1
+FALSE
## [1] 0
we can compute !is.na(b2) which is TRUE if it is not NA and FALSE if it is and then convert that to numeric using + to give the 0/1 value needed.
my_df %>% mutate(b3 = +!is.na(b2))
giving:
ID b2 b3
1 2 NA 0
2 4 4 1
3 6 6 1
4 8 2 1
5 10 NA 0
6 12 6 1
7 14 1 1
8 16 1 1
9 18 NA 0
Please find below one possible answer using the dplyr library
Reprex
Code
library(dplyr)
my_df %>%
mutate(b2 = if_else(is.na(b2), 0, 1))
Output
#> ID b2
#> 1 2 0
#> 2 4 1
#> 3 6 1
#> 4 8 1
#> 5 10 0
#> 6 12 1
#> 7 14 1
#> 8 16 1
#> 9 18 0
Created on 2022-01-20 by the reprex package (v2.0.1)
You are not using NA properly here -- you are treating it like a character variable in x=="NA" - with NA values, standard practice is to use is.na(), not x==NA. Try:
my_df$b3 <- ifelse(is.na(my_df$b2), 0, 1)

Replace missing values with row means if exactly N missing values per row

I have a data matrix with different number of missing values per rows. What I want is to replace the missing values with row means if the number of missing values per row is N (let's say 1).
I have already created a solution for this problem but it's a very inelegant one so I'm looking for something else.
My solution:
#SAMPLE DATA
a <- c(rep(c(1:4, NA), 2))
b <- c(rep(c(1:3, NA, 5), 2))
c <- c(rep(c(1:3, NA, 5), 2))
df <- as.matrix(cbind(a,b,c), ncol = 3, nrow = 10)
#CALCULATING THE NUMBER OF MISSING VALUES PER ROW
miss_row <- rowSums(apply(as.matrix(df), c(1,2), function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
}) )
df <- cbind(df, miss_row)
#CALCULATING THE ROW MEANS FOR ROWS WITH 1 MISSING VALUE
row_mean <- ifelse(df[,4] == 1, rowMeans(df[,1:3], na.rm = TRUE), NA)
df <- cbind(df, row_mean)
Here is the way I mentionned in comment, with more details:
# create your matrix
df <- cbind(a, b, c) # already a matrix, you don't need as.matrix there
# Get number of missing values per row (is.na is vectorised so you can apply it directly on the entire matrix)
nb_NA_row <- rowSums(is.na(df))
# Replace missing values row-wise by the row mean when there is N NA in the row
N <- 1 # the given example
df[nb_NA_row==N] <- rowMeans(df, na.rm=TRUE)[nb_NA_row==N]
# check df
df
# a b c
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 NA NA
# [5,] 5 5 5
# [6,] 1 1 1
# [7,] 2 2 2
# [8,] 3 3 3
# [9,] 4 NA NA
#[10,] 5 5 5
df <- data.frame(df)
df$miss_row <- rowSums(is.na(df))
df$row_mean <- NA
df$row_mean[df$miss_row == 1] <- rowMeans(df[df$miss_row == 1,1:3],na.rm = TRUE)
# a b c miss_row row_mean
# 1 1 1 1 0 NA
# 2 2 2 2 0 NA
# 3 3 3 3 0 NA
# 4 4 NA NA 2 NA
# 5 NA 5 5 1 5
# 6 1 1 1 0 NA
# 7 2 2 2 0 NA
# 8 3 3 3 0 NA
# 9 4 NA NA 2 NA
# 10 NA 5 5 1 5
(This gives your expected output, which seems not to be completely in line with your text, but for this see comments and duplicate link)

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Correlations by grouping twice in R, using dplyR or aggregate?

My (toy) data looks like:
Item_Id Location_Id date price
1 A 5372 1 .5
2 A 5372 2 NA
3 A 5372 3 1
4 A 6065 1 1
5 A 6065 2 1
6 A 6065 3 3
7 A 7000 1 NA
8 A 7000 2 NA
9 A 7000 3 NA
10 B 5372 1 3
11 B 5372 2 NA
12 B 5372 3 1
13 B 6065 1 2
14 B 6065 2 1
15 B 6065 3 3
16 B 7000 1 8
17 B 7000 2 NA
18 B 7000 3 9
In reality there are hundreds of unique item_Ids and location_Ids.
Data
Item_Id=c(rep('A',9),rep('B',9))
Location_Id=rep(c(rep(5372,3),rep(6065,3),rep(7000,3)),2)
date = rep(1:3,6)
price = c(0.5,NA,1,1,1,3,NA,NA,NA,3,NA,1,2,1,3,8,NA,9)
df = data.frame(Item_Id,Location_Id,date,price)
I want to ultimate get the median correlation (over locations) of the prices series for every item with every other item. I tried writing a loop in the hopes that it would be quick (not finished):
for(item in items){
remainingitems = items[items!=item]
for(item2 in remainingitems){
cortemp = numeric(0)
for(locat in locations){
print(locat)
a = pricepanel[pricepanel$Item_Id==item &
pricepanel$Location_Id==locat,]$price
b = pricepanel[pricepanel$Item_Id==item2 &
pricepanel$Location_Id==locat,]$price
cortemp=c(cortemp,cor(cbind(a,b), use="pairwise.complete.obs")[2])
}
}
But I stopped because it was much too slow. The most inner loop took several minutes alone and there are hundreds of stores and items. Basically I want to get the correlation matrix (every product with every other product) for every location, and then take the element-wise median across those matrices.
I expect there is an efficient way to do this, but I am new to this kind of thing in R. I tried reading dplyr since I suspect the solution lies in there, but I got stuck.
The interim output would be something like:
$5752
A B
A 1 -1
B -1 1
$6065
A B
A 1 0.8660254
B 0.8660254 1
$7000
A B
A 1 NA
B NA 1
Then the final would take the elementwise median of all those location matrices.
Final:
A B
A 1 -.0669873
B -.0669873 1
You could get the "interim" output using dplyr and tidyr:
library(dplyr)
library(tidyr)
cors <- df %>% spread(Item_Id, price) %>%
group_by(Location_Id) %>%
do(correlation = cor(.[, -(1:2)], use = "pairwise.complete.obs"))
The way that this works is that the spread function (from tidyr) spreads the As, Bs, Cs etc into their own columns:
df %>% spread(Item_Id, price)
# Location_Id date A B
# 1 5372 1 0.5 3
# 2 5372 2 NA NA
# 3 5372 3 1.0 1
# 4 6065 1 1.0 2
# 5 6065 2 1.0 1
# 6 6065 3 3.0 3
# 7 7000 1 NA 8
# 8 7000 2 NA NA
# 9 7000 3 NA 9
(This should work with any number of "Items"- A, B, C, D...) The group_by(Location_Id) function then tells the code to operate within each location. Finally the do command tells it to find the correlation of the columns within each group (. is a placeholder for "the data within each group"), while ignoring the first two columns, Location_Id and date.
The above code produces a result that looks like:
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Location_Id correlation
# 1 5372 <dbl[2,2]>
# 2 6065 <dbl[2,2]>
# 3 7000 <dbl[2,2]>
The correlation column is a list of your three within-location matrices. At that point you can use the solution in this question to take the elementwise median:
apply(simplify2array(cors$correlation), c(1,2), median, na.rm = TRUE)
Here's a possible split apply solution using base R
lapply(split(df[, c("Item_Id", "price")], df$Location_Id),
function(x) {
cor(matrix(x$price, nrow = nrow(x)/length(unique(x$Item_Id))), use ="pairwise.complete.obs")
} )
# $`5372`
# [,1] [,2]
# [1,] 1 -1
# [2,] -1 1
#
# $`6065`
# [,1] [,2]
# [1,] 1.0000000 0.8660254
# [2,] 0.8660254 1.0000000
#
# $`7000`
# [,1] [,2]
# [1,] NA NA
# [2,] NA 1
And here's a similar solution to #Davids using data.table package
library(data.table)
DT <- dcast.data.table(as.data.table(df),
Location_Id + date ~ Item_Id,
value.var = "price")[, -2, with = FALSE]
Res <- DT[, .(Res = list(cor(.SD, use = "pairwise.complete.obs"))), Location_Id]
You can then view the cor matrices using
Res$Res
# [[1]]
# A B
# A 1 -1
# B -1 1
#
# [[2]]
# A B
# A 1.0000000 0.8660254
# B 0.8660254 1.0000000
#
# [[3]]
# A B
# A NA NA
# B NA 1

Convert a list of varying lengths into a dataframe

I am trying to convert a simple list of varying lengths into a data frame as shown below. I would like to populate the missing values with NaN. I tried using ldply, rbind, as.data.frame() but I failed to get it into the format I want. Please help.
x=c(1,2)
y=c(1,2,3)
z=c(1,2,3,4)
a=list(x,y,z)
a
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
[[3]]
[1] 1 2 3 4
Output should be:
x y z
1 1 1
2 2 2
NaN 3 3
NaN NaN 4
Using rbind.fill.matrix from "plyr" gets you very close to what you're looking for:
> library(plyr)
> t(rbind.fill.matrix(lapply(a, t)))
[,1] [,2] [,3]
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
This is a lot of code, so not as clean as Ananda's solution, but it's all base R:
maxl <- max(sapply(a,length))
out <- do.call(cbind, lapply(a,function(x) x[1:maxl]))
# out <- matrix(unlist(lapply(a,function(x) x[1:maxl])), nrow=maxl) #another way
out <- as.data.frame(out)
#names(out) <- names(a)
Result:
> out
V1 V2 V3
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
Note: names of the resulting df will depend on the names of your list (a), which doesn't currently have names.

Resources