Swap NA's with values whilst preserving original value order - r

consider an ordered dataframe with a column that consists of values and NA's like this:
df <- data.frame(id=rep(1:6), value=c(NA,NA,23,45,12,76))
I would like to shift the position of the NA's to the first two rows of the data frame, whilst maintaining the order of the values as so:
df$new_value <- c(23,45,12,76,NA,NA)
Is there anyway I can do this? Thanks!

We can use order on the NA elements
df$new_value <- df$value[order(is.na(df$value))]
df$new_value
#[1] 23 45 12 76 NA NA
By doing is.na, it returns a logical vector
is.na(df$value)
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
applying order on it returns
order(is.na(df$value))
#[1] 3 4 5 6 1 2
because FALSE is considered first before TRUE alphabetically. The order values are the initial position index of the vector. This can be understand more easily with
sort(c(TRUE, FALSE, TRUE), index.return = TRUE)
#$x
#[1] FALSE TRUE TRUE
#$ix
#[1] 2 1 3

Another idea which will work only If your NAs are at the very end of your dataframe, is to use the lead function from dplyr in order to shift your data n positions forward. So for your case, it would be,
dplyr::lead(df$value, sum(is.na(df$value)))
#[1] 23 45 12 76 NA NA

Without being clever some elementary techniques can also be applied:
df$new_value <- c(df[!is.na(df$value), "value"], df[is.na(df$value), "value"])
id value new_value
1 1 NA 23
2 2 NA 45
3 3 23 12
4 4 45 76
5 5 12 NA
6 6 76 NA

Related

How to add dummy variables to data with specific characteristic

My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1

How to simply count number of rows with NAs - R [duplicate]

This question already has answers here:
Determine the number of rows with NAs
(5 answers)
Closed 4 years ago.
I'm trying to compute the number of rows with NA of the whole df as I'm looking to compute the % of rows with NA over the total number of rows of the df.
I have already have seen this post: Determine the number of rows with NAs but it just shows a specific range of columns.
tl;dr: row wise, you'll want sum(!complete.cases(DF)), or, equivalently, sum(apply(DF, 1, anyNA))
There are a number of different ways to look at the number, proportion or position of NA values in a data frame:
Most of these start with the logical data frame with TRUE for every NA, and FALSE everywhere else. For the base dataset airquality
is.na(airquality)
There are 44 NA values in this data set
sum(is.na(airquality))
# [1] 44
You can look at the total number of NA values per row or column:
head(rowSums(is.na(airquality)))
# [1] 0 0 0 0 2 1
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You can use anyNA() in place of is.na() as well:
# by row
head(apply(airquality, 1, anyNA))
# [1] FALSE FALSE FALSE FALSE TRUE TRUE
sum(apply(airquality, 1, anyNA))
# [1] 42
# by column
head(apply(airquality, 2, anyNA))
# Ozone Solar.R Wind Temp Month Day
# TRUE TRUE FALSE FALSE FALSE FALSE
sum(apply(airquality, 2, anyNA))
# [1] 2
complete.cases() can be used, but only row-wise:
sum(!complete.cases(airquality))
# [1] 42
From the example here:
DF <- read.table(text=" col1 col2 col3
1 23 17 NA
2 55 NA NA
3 24 12 13
4 34 23 12", header=TRUE)
You can check which rows have at least one NA:
(which_nas <- apply(DF, 1, function(X) any(is.na(X))))
# 1 2 3 4
# TRUE TRUE FALSE FALSE
And then count them, identify them or get the ratio:
## Identify them
which(which_nas)
# 1 2
# 1 2
## Count them
length(which(which_nas))
#[1] 2
## Ratio
length(which(which_nas))/nrow(DF)
#[1] 0.5

Select entries with more than one number across all columns in R [duplicate]

This question already has answers here:
Subset of rows containing NA (missing) values in a chosen column of a data frame
(7 answers)
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a data frame with many rows (say 4 rows in the example below), and each row has a certain number of columns associated with it as below:
1 91 90 20
2 21 NA NA
3 20 20 NA
4 30 NA NA
The numbers 1,2,3 and 4 in the far left are row IDs. I need to extract the rows that contain more than one number across all associated columns. So what I would expect is:
1 91 90 20
3 20 20 NA
I have tried using "which" in combination with "lapply" but this just gives me TRUE or FALSE as output, whereas I need the actual values as above.
You can do that by using rowSums in conjunction with just checking if there is an na, and filtering to greater than 1.
df[rowSums(!is.na(df)) > 1,]
Breakdown:
df <- data.frame(x = c(91, 21, 20, 30), y = c(90, NA, 20, NA), z = c(20, NA, NA, NA))
We can turn it into a T/F matrix by:
!is.na(df)
x y z
[1,] TRUE TRUE TRUE
[2,] TRUE FALSE FALSE
[3,] TRUE TRUE FALSE
[4,] TRUE FALSE FALSE
This shows where there are and aren't numbers. Now we just need to sum up the rows:
rowSums(!is.na(df))
[1] 3 1 2 1
This yields the # of non-NA entries per row. Now we can change that back into a logical vector by looking for only ones that have more than 1:
rowSums(!is.na(df)) > 1
[1] TRUE FALSE TRUE FALSE
Now subset the df with that:
df[rowSums(!is.na(df)) > 1,]
x y z
1 91 90 20
3 20 20 NA

Smartest way to check if an observation in data.frame(x) exists also in data.frame(y) and populate a new column according with the result

Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().

how to loop a vector comparing rows without FOR

I need some hints to make effective loop in vector but for “FOR…” loop because of optimization issues.
At first glance, it is recommended to use such functions as apply(), sapply().
I have a vector converted into matrix:
x1<-c(1,2,4,1,4,3,5,3,1,0)
Looping through the vector I need to replace all x1[i+1]=x1[i] if x[i]>x[i+1].
Example:
Input vector:
x1<-as.matrix(c(1,2,4,1,4,3,5,3,1,0))
Output vector:
c(1,2,4,4,4,4,5,5,5,5)
My approach is to use user function in apply() but I have some difficulties how to code correctly the relation of x[i] and x[i+1] in user function.
I would be very grateful for your ideas or hints.
In general you can use Reduce with accumulate=TRUE for cumulative operations
Reduce(max,x1,accumulate=TRUE)
# [1] 1 2 4 4 4 4 5 5 5 5
But as #Khashaa points out, the common cases cumsum,cumprod,cummin, and yours, cummax are provided as efficient base functions.
cummax(x1)
# [1] 1 2 4 4 4 4 5 5 5 5
We could do this using ave. (Using the vector x1)
ave(x1,cumsum(c(TRUE,x1[-1]>x1[-length(x1)])), FUN=function(x) head(x,1))
#[1] 1 2 4 4 4 4 5 5 5 5
We create a grouping variable based on the condition described in the OP's post. Check whether the succeeding element (x1[-1] - removed first element) is greater than the current element (x1[-length(x1)] -removed last element).
x1[-1]>x1[-length(x1)]
#[1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
The length is one less than the length of the vector x1. So, we append TRUE to make the length equal and then do the cumsum
cumsum(c(TRUE,x1[-1]>x1[-length(x1)]))
#[1] 1 2 3 3 4 4 5 5 5 5
This we use as grouping variable in ave and select the first observation of 'x1'
within each group
Another option would to get the logical index (c(TRUE, x1[-1] > x1[-length(x1)])) as before, negate it (!) so that TRUE becomes FALSE, and FALSE as TRUE, convert the TRUE values to 'NA' (NA^(!...)), and then use na.locf from library(zoo) to replace the NA values with the preceding non-NA value.
library(zoo)
na.locf(x1*NA^(!c(TRUE,x1[-1]>x1[-length(x1)])))
#[1] 1 2 4 4 4 4 5 5 5 5

Resources