How can I calculate pairwise difference between values in one column?
The calculation should start with the first two values, and should be continued with the next two values as it is done in column "desired_result" here:
data <- data.frame(data = c(5, NA, NA, NA, 3, NA, NA, 4, NA, 3, NA, NA, NA, 6, 1, 4, NA, 2))
Here's a one-liner:
data$desired_result[which(!is.na(data$data))[c(FALSE, TRUE)]] <-
rev(diff(rev(na.omit(data$data))))[c(TRUE, FALSE)]
where which(!is.na(data$data)) finds non-NA entries of data$data and then adding c(FALSE, TRUE) chooses only every second one. Also, na.omit(data$data) discards NA values, rev reverses this vector, diff takes differences, rev reverses the vector back to the correct order, and, lastly, since we don't want all the differences, I again choose every second with c(TRUE, FALSE).
Same as Julius but even shorter and faster:
data$desired_result[which(!is.na(data$data))[c(FALSE, TRUE)]] <-
diff(na.omit(data$data))[c(TRUE, FALSE)] * -1
since diff() calculates x1 - x0, both rev() can be replace by diff() * -1
Speed comparison using microbenchmark:
Unit: microseconds
expr min lq mean median uq max neval cld
julius 38.096 43.757 51.44687 46.143 50.8655 170511.851 1e+05 b
this 32.828 37.501 43.02233 39.548 43.4390 7405.489 1e+05 a
if you want to have a result exactly like you described here
you can use:
> data <- data.frame(data = c(5, NA, NA, NA, 3, NA, NA, 4, NA, 3, NA,
> NA, NA, 6, 1, 4, NA, 2)) %>% mutate(index = 1:n())
>
> ex = data %>% filter(!is.na(data))
>
> df2 = data.frame(index = rollapply(ex$index, width = 2, by = 2, last),
> desired_results = rollapply(ex$data, width = 2, by = 2, FUN = function (x) -1*diff(x)))
>
> data2 = left_join(data, df2, by = "index") %>% select(-index)
data desired_results
1 5 NA
2 NA NA
3 NA NA
4 NA NA
5 3 2
6 NA NA
7 NA NA
8 4 NA
9 NA NA
10 3 1
11 NA NA
12 NA NA
13 NA NA
14 6 NA
15 1 5
16 4 NA
17 NA NA
18 2 2
but if you just want the difference then you can use:
rollapply(na.omit(data$data), by = 2, width = 2, diff)
beware that you'll get negative results: -2 -1 -5 -2
Related
This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)
I have a data set with many columns (DATA_OLD) in which I want to exchange all values based on an allocation list with many entries (KEY).
Every value in DATA_OLD should be replaced by its counterpart (can be seen in KEY) to create DATA_NEW.
For simplicity, the example here contains a short KEY and DATA_OLD set. In reality, there are >2500 rows in KEY and >100 columns in DATA_OLD. Therefore, an approach that can be applied to the whole data set simultaneously without calling each colname of DATA_OLD is important.
KEY:
old
new
1
1
3
2
7
3
12
4
55
5
Following this example, every value "1" should be replaced with another value "1". Every value "3" should be replaced with value "2". Every value "7" should be replaced with value "3".
DATA_OLD (START):
var1
var2
var3
NA
3
NA
NA
55
NA
1
NA
NA
NA
NA
NA
3
NA
NA
55
NA
12
DATA_NEW (RESULT):
var1
var2
var3
NA
2
NA
NA
5
NA
1
NA
NA
NA
NA
NA
2
NA
NA
5
NA
4
Here reproducible data:
KEY<-structure(list(old = c(1, 3, 7, 12, 55), new = c(1, 2, 3, 4,
5)), class = "data.frame", row.names = c(NA, -5L))
DATA_OLD<-structure(list(var1 = c(NA, NA, 1, NA, 3, 55), var2 = c(3,
55, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 12)), class = "data.frame", row.names = c(NA, -6L))
DATA_NEW<-structure(list(var1 = c(NA, NA, 1, NA, 2, 5), var2 = c(2,
5, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 4)), class = "data.frame", row.names = c(NA, -6L))
I have tried back and forth, and it appears that I am completely clueless. Help would be greatly apprecciated! The real data set is quite large...
1) Base R Be careful here since some solutions have the side effect of converting the numeric columns to character or factor or the data frame to something else. A solution using match will generally work. The result of lapply is a list so convert back to data frame.
DATA_OLD |>
lapply(function(x) with(KEY, new[match(x, old)])) |>
as.data.frame()
or
DATA_NEW <- DATA_OLD
DATA_NEW[] <- lapply(DATA_OLD, function(x) with(KEY, new[match(x, old)]))
This last one is easy to convert to act only on some columns
DATA_NEW <- DATA_OLD
ix <- 1:2 # only convert these columns
DATA_NEW[ix] <- lapply(DATA_OLD[ix], function(x) with(KEY, new[match(x, old)]))
2) purrr Alternately use map_dfr which returns a data frame directly:
library(purrr)
map_dfr(DATA_OLD, ~ with(KEY, new[match(.x, old)]))
3) dplyr A dplyr solution using across is the following. If there were some non-numeric columns that should not be converted then replace everything() with where(is.numeric)
library(purrr)
DATA_OLD %>%
mutate(across(everything(), ~ with(KEY, new[match(.x, old)])))
The simplest way to implement a dictionary in R is a named array, where you can use the names as indices:
key <- setNames(KEY$new, KEY$old)
> key
1 3 7 12 55
1 2 3 4 5
The only thing to be mindful of is that the indexing must by done by character, rather than integer:
> key[3]
7
3 # WRONG! This is the 3rd item!
> key["3"]
3
2 # RIGHT! This is the item named "3"
Then you can apply the transformation column-wise. This turns the data into a matrix, but you can simply turn it back.
as.data.frame(apply(DATA_OLD, 2, \(col) key[as.character(col)]))
var1 var2 var3
1 NA 2 1
2 NA 5 NA
3 1 NA NA
4 NA NA NA
5 2 NA NA
6 5 NA 4
I have this kind of data :
daynight
[1] NA NA NA NA 2 1 NA NA
I want R to detect if there is a series of at least x NA and replace these by another value.
For example if x=3 and the replacement value is 3 I want R to give me in output :
daynight
[1] 3 3 3 3 2 1 NA NA
Would you have any ideas?
We can use rle
daynight <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
x <- 3
r <- 3
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
daynight
#[1] 3 3 3 3 2 1 NA NA
Taking another example :
daynight <- c(NA, NA, NA, 3,2,1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
#[1] 3 3 3 3 2 1 NA NA 1 3 3 3 1 NA NA
And here is another solution using the zoo package
library(zoo)
replace_consecutive_NAs <- function(x, nrNAs = 3, replaceBy = nrNAs){
x <- as.numeric(is.na(x))
indexes <- (rollapply(x, 3, prod, fill = 0, align = "left") +
rollapply(x, 3, prod, fill = 0, align = "right")) != 0
x[indexes] <- replaceBy
x
}
x <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
replace_consecutive_NAs(x, 3, 999)
[1] 999 999 999 999 2 1 NA NA
I have a dataset where I have to fill NA values using the previous value and a sum of current value in another column. Basically, my data looks like
library(lubridate)
library(tidyverse)
library(zoo)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
# A tibble: 8 x 4
Id Time av Value
<dbl> <date> <dbl> <dbl>
1 2012-09-01 18 121
1 2012-09-02 NA NA
1 2012-09-03 NA NA
1 2012-09-04 NA NA
2 2012-09-01 21 146
2 2012-09-02 NA NA
2 2012-09-03 NA NA
2 2012-09-04 NA NA
What I want to do is: where the Value is NA, I want to replace it by sum of previous Value and current value of av. If av is NA, it can be replaced with previous value. I use na.locf function from zoo package as
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
However, filling in for Value seems to be difficult. I can do it using for loop as
# Back up the Value column for testing
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This produces the result I want but for a large dataset, I believe there are better ways to do it in R. I tried complete function from dplyr but it adds two additional rows as:
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>% mutate(av = zoo::na.locf(av)) %>%
mutate(num_rows = n()) %>%
complete(nesting(Id), Value = seq(min(Value, na.rm = TRUE),
(min(Value, na.rm = TRUE) + max(num_rows) * min(na.omit(av))), min(na.omit(av))))
The output has two extra rows; 10 instead of 8
# A tibble: 10 x 5
# Groups: Id [2]
Id Value Time av num_rows
<dbl> <dbl> <date> < dbl> <int>
1 121 2012-09-01 18 4
1 139 NA NA NA
1 157 NA NA NA
1 175 NA NA NA
1 193 NA NA NA
2 146 2012-09-01 21 4
2 167 NA NA NA
2 188 NA NA NA
2 209 NA NA NA
2 230 NA NA NA
Any help to do it faster without loops would be greatly appreciated.
In the question av starts with a non-NA in each group and is followed by NAs so if this is the general pattern then this will work. Note that it is good form to close any group_by with ungroup; however, we did not do that below so that we could compare df2 with df1.
df2 <- df %>%
group_by(Id) %>%
mutate(Value_backup = Value,
av = first(av),
Value = first(Value) + cumsum(av) - av)
identical(df1, df2)
## [1] TRUE
Note
For reproducibility first run this (taken from question except we only load needed packages):
library(dplyr)
library(tibble)
library(lubridate)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "
2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
My zoo (time series) data set looks like below and goes on for hundreds of rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
NA NA NA NA 1 1 1 NA NA NA 3 3 3 NA NA 1 1
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
This variable is part of a larger zoo data set. The general pattern of this variable is a series of 1's, then NAs, then 3's, then NAs, and repeat the pattern again starting with a series of 1's. There is no regular pattern of the number of NAs.
I am trying to (i) fill the NAs between the 1's and 3's with 2, (ii) fill the NAs between the 3's and subsequent 1's with 4, and (iii) fill the NAs in the first four observations with 4 following the general pattern. When done, the values will be a series of 1, 2, 3, and 4 without a pattern of the quantity for each of the four values.
I have spent hours trying ifelse and for loops without success. (Relatively newbie with this part of R.)
I previous did this task in Stata but can't figure out the code in R to fill the NAs. The Stata code to fill the NAs is:
replace cycle4I = 2 if missing(cycle4I) & (cycle4I[_n-1] == 1 | cycle4I[_n-1] == 2) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 3)
replace cycle4I = 4 if missing(cycle4I) & (cycle4I[_n-1] == 3 | cycle4I[_n-1] == 4) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 1)
Here is a straightforward way:
library(zoo)
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
x <- cycle4I
x[1] <- 3
x <- is.na(x) + na.locf(x)
x[1] <- 4
Which gives:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 4 4 4 1 1 1 2 2 2 3 3 3 4 4 1 1
Here is one way
library(dplyr)
library(zoo)
data_frame(cycle4I = c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1)) %>%
mutate(final =
cycle4I %>%
lag %>%
na.locf(na.rm = FALSE) %>%
`+`(1) %>%
ifelse(is.na(cycle4I),
., cycle4I) )