This question already has answers here:
Determine the number of rows with NAs
(5 answers)
Closed 4 years ago.
I'm trying to compute the number of rows with NA of the whole df as I'm looking to compute the % of rows with NA over the total number of rows of the df.
I have already have seen this post: Determine the number of rows with NAs but it just shows a specific range of columns.
tl;dr: row wise, you'll want sum(!complete.cases(DF)), or, equivalently, sum(apply(DF, 1, anyNA))
There are a number of different ways to look at the number, proportion or position of NA values in a data frame:
Most of these start with the logical data frame with TRUE for every NA, and FALSE everywhere else. For the base dataset airquality
is.na(airquality)
There are 44 NA values in this data set
sum(is.na(airquality))
# [1] 44
You can look at the total number of NA values per row or column:
head(rowSums(is.na(airquality)))
# [1] 0 0 0 0 2 1
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You can use anyNA() in place of is.na() as well:
# by row
head(apply(airquality, 1, anyNA))
# [1] FALSE FALSE FALSE FALSE TRUE TRUE
sum(apply(airquality, 1, anyNA))
# [1] 42
# by column
head(apply(airquality, 2, anyNA))
# Ozone Solar.R Wind Temp Month Day
# TRUE TRUE FALSE FALSE FALSE FALSE
sum(apply(airquality, 2, anyNA))
# [1] 2
complete.cases() can be used, but only row-wise:
sum(!complete.cases(airquality))
# [1] 42
From the example here:
DF <- read.table(text=" col1 col2 col3
1 23 17 NA
2 55 NA NA
3 24 12 13
4 34 23 12", header=TRUE)
You can check which rows have at least one NA:
(which_nas <- apply(DF, 1, function(X) any(is.na(X))))
# 1 2 3 4
# TRUE TRUE FALSE FALSE
And then count them, identify them or get the ratio:
## Identify them
which(which_nas)
# 1 2
# 1 2
## Count them
length(which(which_nas))
#[1] 2
## Ratio
length(which(which_nas))/nrow(DF)
#[1] 0.5
Related
My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1
consider an ordered dataframe with a column that consists of values and NA's like this:
df <- data.frame(id=rep(1:6), value=c(NA,NA,23,45,12,76))
I would like to shift the position of the NA's to the first two rows of the data frame, whilst maintaining the order of the values as so:
df$new_value <- c(23,45,12,76,NA,NA)
Is there anyway I can do this? Thanks!
We can use order on the NA elements
df$new_value <- df$value[order(is.na(df$value))]
df$new_value
#[1] 23 45 12 76 NA NA
By doing is.na, it returns a logical vector
is.na(df$value)
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
applying order on it returns
order(is.na(df$value))
#[1] 3 4 5 6 1 2
because FALSE is considered first before TRUE alphabetically. The order values are the initial position index of the vector. This can be understand more easily with
sort(c(TRUE, FALSE, TRUE), index.return = TRUE)
#$x
#[1] FALSE TRUE TRUE
#$ix
#[1] 2 1 3
Another idea which will work only If your NAs are at the very end of your dataframe, is to use the lead function from dplyr in order to shift your data n positions forward. So for your case, it would be,
dplyr::lead(df$value, sum(is.na(df$value)))
#[1] 23 45 12 76 NA NA
Without being clever some elementary techniques can also be applied:
df$new_value <- c(df[!is.na(df$value), "value"], df[is.na(df$value), "value"])
id value new_value
1 1 NA 23
2 2 NA 45
3 3 23 12
4 4 45 76
5 5 12 NA
6 6 76 NA
This question already has answers here:
Subset of rows containing NA (missing) values in a chosen column of a data frame
(7 answers)
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a data frame with many rows (say 4 rows in the example below), and each row has a certain number of columns associated with it as below:
1 91 90 20
2 21 NA NA
3 20 20 NA
4 30 NA NA
The numbers 1,2,3 and 4 in the far left are row IDs. I need to extract the rows that contain more than one number across all associated columns. So what I would expect is:
1 91 90 20
3 20 20 NA
I have tried using "which" in combination with "lapply" but this just gives me TRUE or FALSE as output, whereas I need the actual values as above.
You can do that by using rowSums in conjunction with just checking if there is an na, and filtering to greater than 1.
df[rowSums(!is.na(df)) > 1,]
Breakdown:
df <- data.frame(x = c(91, 21, 20, 30), y = c(90, NA, 20, NA), z = c(20, NA, NA, NA))
We can turn it into a T/F matrix by:
!is.na(df)
x y z
[1,] TRUE TRUE TRUE
[2,] TRUE FALSE FALSE
[3,] TRUE TRUE FALSE
[4,] TRUE FALSE FALSE
This shows where there are and aren't numbers. Now we just need to sum up the rows:
rowSums(!is.na(df))
[1] 3 1 2 1
This yields the # of non-NA entries per row. Now we can change that back into a logical vector by looking for only ones that have more than 1:
rowSums(!is.na(df)) > 1
[1] TRUE FALSE TRUE FALSE
Now subset the df with that:
df[rowSums(!is.na(df)) > 1,]
x y z
1 91 90 20
3 20 20 NA
This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
New to R doing customer latency.
In the dataset i have around 300,000 rows with 15 columns. Some
relevant columns are "Account", "Account Open Date",
"Shipment pick up date" etc.
Account numbers are repeated and just want the rows with account numbers where it is recorded for the first time, not the subsequent rows.
For eg. acc # 610829952 is in the first row as well as in the 5th row, 6th row etc. I need to filter out the first row alone and i need to do this for all the account numbers.
I am not sure how to do this. Could someone please help me with this?
There is a function in R called duplicated(). It allows you to check whether a certain value, like your account, has already been recorded.
First you check in the relevant column account which account numbers have already appeared before using duplicated(). You will get a TRUE / FALSE vector (TRUE indicating that the corresponding value has already appeared). With that information, you will index your data.frame in order to only retrieve the rows you are interested in. I will assume you have your data looks like df below:
df <- data.frame(segment = sample(LETTERS, 20, replace = TRUE),
account = sample(1:5, 20, replace = TRUE))
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 4 4 Y
# 5 4 M
# 6 4 E
# 7 5 H
# 8 3 A
# 9 3 J
# 10 3 Y
# 11 4 R
# 12 5 O
# 13 4 O
# 14 1 R
# 15 5 U
# 16 2 Q
# 17 5 F
# 18 2 J
# 19 4 E
# 20 2 H
inds <- duplicated(df$account)
# [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# [11] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df <- df[!inds, ]
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 7 5 H
# 14 1 R
how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.