My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1
Related
This question already has answers here:
Determine the number of rows with NAs
(5 answers)
Closed 4 years ago.
I'm trying to compute the number of rows with NA of the whole df as I'm looking to compute the % of rows with NA over the total number of rows of the df.
I have already have seen this post: Determine the number of rows with NAs but it just shows a specific range of columns.
tl;dr: row wise, you'll want sum(!complete.cases(DF)), or, equivalently, sum(apply(DF, 1, anyNA))
There are a number of different ways to look at the number, proportion or position of NA values in a data frame:
Most of these start with the logical data frame with TRUE for every NA, and FALSE everywhere else. For the base dataset airquality
is.na(airquality)
There are 44 NA values in this data set
sum(is.na(airquality))
# [1] 44
You can look at the total number of NA values per row or column:
head(rowSums(is.na(airquality)))
# [1] 0 0 0 0 2 1
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You can use anyNA() in place of is.na() as well:
# by row
head(apply(airquality, 1, anyNA))
# [1] FALSE FALSE FALSE FALSE TRUE TRUE
sum(apply(airquality, 1, anyNA))
# [1] 42
# by column
head(apply(airquality, 2, anyNA))
# Ozone Solar.R Wind Temp Month Day
# TRUE TRUE FALSE FALSE FALSE FALSE
sum(apply(airquality, 2, anyNA))
# [1] 2
complete.cases() can be used, but only row-wise:
sum(!complete.cases(airquality))
# [1] 42
From the example here:
DF <- read.table(text=" col1 col2 col3
1 23 17 NA
2 55 NA NA
3 24 12 13
4 34 23 12", header=TRUE)
You can check which rows have at least one NA:
(which_nas <- apply(DF, 1, function(X) any(is.na(X))))
# 1 2 3 4
# TRUE TRUE FALSE FALSE
And then count them, identify them or get the ratio:
## Identify them
which(which_nas)
# 1 2
# 1 2
## Count them
length(which(which_nas))
#[1] 2
## Ratio
length(which(which_nas))/nrow(DF)
#[1] 0.5
consider an ordered dataframe with a column that consists of values and NA's like this:
df <- data.frame(id=rep(1:6), value=c(NA,NA,23,45,12,76))
I would like to shift the position of the NA's to the first two rows of the data frame, whilst maintaining the order of the values as so:
df$new_value <- c(23,45,12,76,NA,NA)
Is there anyway I can do this? Thanks!
We can use order on the NA elements
df$new_value <- df$value[order(is.na(df$value))]
df$new_value
#[1] 23 45 12 76 NA NA
By doing is.na, it returns a logical vector
is.na(df$value)
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
applying order on it returns
order(is.na(df$value))
#[1] 3 4 5 6 1 2
because FALSE is considered first before TRUE alphabetically. The order values are the initial position index of the vector. This can be understand more easily with
sort(c(TRUE, FALSE, TRUE), index.return = TRUE)
#$x
#[1] FALSE TRUE TRUE
#$ix
#[1] 2 1 3
Another idea which will work only If your NAs are at the very end of your dataframe, is to use the lead function from dplyr in order to shift your data n positions forward. So for your case, it would be,
dplyr::lead(df$value, sum(is.na(df$value)))
#[1] 23 45 12 76 NA NA
Without being clever some elementary techniques can also be applied:
df$new_value <- c(df[!is.na(df$value), "value"], df[is.na(df$value), "value"])
id value new_value
1 1 NA 23
2 2 NA 45
3 3 23 12
4 4 45 76
5 5 12 NA
6 6 76 NA
Quick question in understanding's R's anyDuplicated, when passed on a dataframe (lets say x y z columns with 1k observations) will if check if any of the rows has the exact same x y z values as another row in the same dataframe? Thanks
I would use duplicated and combine it from front to back.
mydf <- data.frame(x = c(1:3,1,1), y = c(3:5,3,3))
mydf
# x y
# 1 1 3
# 2 2 4
# 3 3 5
# 4 1 3
# 5 1 3
There three duplicated rows 1, 4, and 5. But 'duplicated' will only mark what is duplicated not the original value also.
duplicated(mydf)
#[1] FALSE FALSE FALSE TRUE TRUE
duplicated(mydf, fromLast = TRUE)
#[1] TRUE FALSE FALSE TRUE FALSE
Using from last looks from the end to front to include the original value. By the way, I will ask the R core team to add a unified function to do both.
myduplicates <- duplicated(mydf) | duplicated(mydf, fromLast = TRUE)
Saving the expression as a variable allows us to count and subset later.
sum(myduplicates)
#[1] 3
mydf[myduplicates,]
# x y
# 1 1 3
# 4 1 3
# 5 1 3
mydf[!myduplicates,]
# x y
# 2 2 4
# 3 3 5
Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().
Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.
> # A toy example
> a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
> a
col1 col2 col3 col4
1 1 2 1 4
2 2 NA 2 3
3 3 2 3 2
4 4 3 4 1
>
> # Bummer, there's an NA in my condition
> a$col2==2
[1] TRUE NA TRUE FALSE
>
> # Why is this a good thing to do?
> # It NA'd the whole row, and kept it
> a[a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
NA NA NA NA NA
3 3 2 3 2
>
> # Yes, this is the right way to do it
> a[!is.na(a$col2) & a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
>
> # Subset seems designed to avoid this problem
> subset(a, col2 == 2)
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
Can someone explain why the behavior you get without the is.na check would ever be good or useful?
I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as
non-comparable even to themselves, so comparisons involving them
will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match() function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
Update: How is NA handled when there are multiple logical conditions?
NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE