I've looked up some web pages (but their results don't meet my needs):
NA replacing with blanks
Replacing "NA" (NA string) with NA inplace data.table
replace <NA> with NA.
I want to write a function that could do this:
Say there is a vector a.
a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
First, find the value before and after the single and consecutive NA.
In this situation is 137862, NA, NA, NA, 178337 and 178337, NA, NA, NA, NA, NA, 295530.
Second, calculate the slope in every part then replace the NA.
# 137862, NA, NA, NA, 178337
slope_1 = (178337 - 137862)/4
137862 + slope_1*1 # 1st NA replace with 147980.8
137862 + slope_1*2 # 2nd NA replace with 158099.5
137862 + slope_1*3 # 3rd NA replace with 168218.2
# 178337, NA, NA, NA, NA, NA, 295530
slope_2 = (295530 - 178337)/6
178337 + slope_2*1 # 4th NA replace with 197869.2
178337 + slope_2*2 # 5th NA replace with 217401.3
178337 + slope_2*3 # 6th NA replace with 236933.5
178337 + slope_2*4 # 7th NA replace with 256465.7
178337 + slope_2*5 # 8th NA replace with 275997.8
Finally, the expected vector should be this:
a_without_NA = c(100000, 137862, 147980.8, 158099.5, 168218.2, 178337, 197869.2, 217401.3,
236933.5, 256465.7, 275997.8, 295530)
If single or consecutive NA is in the begining, then it would be keep.
# NA at begining
b = c(NA, NA, 1, 3, NA, 5, 7)
# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 3rd NA replace with 4
b_without_NA = c(NA, NA, 1, 3, 4, 5, 7)
# NA at ending
c = c(1, 3, NA, 5, 7, NA, NA)
# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 1st NA replace with 4
c_without_NA = c(1, 3, 4, 5, 7, NA, NA)
Note: in my real situation, every element of the vector is increasing(vector[n + 1] > vector[n]).
I know the principle, but I don't know how to write a self-define function to implement this.
Any help will highly appreciated!!
zoo's na.approx can help :
a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
zoo::na.approx(a, na.rm = FALSE)
# [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
# [9] 236933.5 256465.7 275997.8 295530.0
b = c(NA, NA, 1, 3, NA, 5, 7)
zoo::na.approx(b, na.rm = FALSE)
#[1] NA NA 1 3 4 5 7
c = c(1, 3, NA, 5, 7, NA, NA)
zoo::na.approx(c, na.rm = FALSE)
#[1] 1 3 4 5 7 NA NA
Here is a base R option using approx
> approx(seq_along(a)[!is.na(a)], a[!is.na(a)], seq_along(a))$y
[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
[9] 236933.5 256465.7 275997.8 295530.0
Here is one approach with data.table. Get the run-length-id (rleid) of consecutive NA in 'a' ('grp'), create two temporary columns 'a1', 'a2' as the lag and lead of 'a', grouped by 'grp', create the 'tmp' based on the calculation and finally fcoalesce the original 'a' with that 'tmp'
library(data.table)
data.table(a)[, grp := rleid(is.na(a))][,
c('a1', 'a2') := .(shift(a), shift(a, type = 'lead'))][,
tmp := first(a1) + seq_len(.N) *( (last(a2) - first(a1))/(.N + 1)),
.(grp)][, fcoalesce(a, tmp)]
#[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0
#[7] 197869.2 217401.3 236933.5 256465.7 275997.8 295530.0
For this purpose I defined a custom function:
my_replace_na <- function(x) {
non <- which(!is.na(x)) # Here we extract the indices of non NA values
for(i in 1:(length(non)-1)) {
if(non[i+1] - non[i] > 1) {
c <- non[i+1]
b <- non[i]
for(i in 1:(c - b - 1)) {
x[b+i] <- x[b] + ((x[c] - x[b]) / (c - b))*i
}
}
}
x
}
a <- c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
my_replace_na(a)
[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3 236933.5 256465.7
[11] 275997.8 295530.0
# NA at begining
d <- c(NA, NA, 1, 3, NA, 5, 7)
my_replace_na(d)
[1] NA NA 1 3 4 5 7
# NA at ending
e <- c(1, 3, NA, 5, 7, NA, NA)
my_replace_na(e)
[1] 1 3 4 5 7 NA NA
Related
I have the following data frame:
data <- structure(list(Date = structure(c(-17897, -17896, -17895, -17894,
-17893, -17892, -17891, -17890, -17889, -17888, -17887, -17887,
-17886, -17885, -17884, -17883, -17882, -17881, -17880, -17879,
-17878, -17877, -17876, -17875, -17874, -17873, -17872, -17871,
-17870, -17869, -17868, -17867, -17866, -17865, -17864), class = "Date"),
duration = c(NA, NA, NA, 5, NA, NA, NA, 5, NA, NA, 1, 1,
NA, NA, 3, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4, NA, NA, 4, NA, NA), name = c(NA, NA, NA, "Date_beg",
NA, NA, NA, "Date_end", NA, NA, "Date_beg", "Date_end", NA,
NA, "Date_beg", NA, "Date_end", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "Date_beg", NA, NA, "Date_end", NA, NA
)), row.names = c(NA, -35L), class = c("tbl_df", "tbl", "data.frame"
))
And looks like:
Date duration name
<date> <dbl> <chr>
1 1921-01-01 NA NA
2 1921-01-02 NA NA
3 1921-01-03 NA NA
4 1921-01-04 5 Date_beg
5 1921-01-05 NA NA
6 1921-01-06 NA NA
7 1921-01-07 NA NA
8 1921-01-08 5 Date_end
9 1921-01-09 NA NA
10 1921-01-10 NA NA
...
I want to replace the NA values in column name that are between rows with Date_beg and Date_end with the word "event".
I have tried this:
data %<>% mutate(name = ifelse(((lag(name) == 'Date_beg')|(lag(name) == 'event')) &
But only the first row after Date_beg changes. It is quite easy with a for-loop, but I wanted to use a more R-like method.
There is probably a better way using data.table::nafill, but as you're using tidyverse functions, I would do it by creating an extra event column using tidyr::fill and then pulling it through to the name column where name is NA:
library(tidyr)
data %>%
mutate(
events = ifelse(
fill(data, name)$name == "Date_beg",
"event",
NA),
name = coalesce(name, events)
) %>%
select(-events)
You can do it by looking at the indices where there have been more "Date_beg" than "Dat_end" with:
data$name[lag(cumsum(data$name == "Date_beg" & !is.na(data$name))) -
cumsum(data$name == "Date_end" & !is.na(data$name)) >0] <- "event"
print(data, n=20)
# # A tibble: 35 x 3
# Date duration name
# <date> <dbl> <chr>
# 1 1921-01-01 NA NA
# 2 1921-01-02 NA NA
# 3 1921-01-03 NA NA
# 4 1921-01-04 5 Date_beg
# 5 1921-01-05 NA event
# 6 1921-01-06 NA event
# 7 1921-01-07 NA event
# 8 1921-01-08 5 Date_end
# 9 1921-01-09 NA NA
# 10 1921-01-10 NA NA
# 11 1921-01-11 1 Date_beg
# 12 1921-01-11 1 Date_end
# 13 1921-01-12 NA NA
# 14 1921-01-13 NA NA
# 15 1921-01-14 3 Date_beg
# 16 1921-01-15 NA event
# 17 1921-01-16 3 Date_end
# 18 1921-01-17 NA NA
# 19 1921-01-18 NA NA
# 20 1921-01-19 NA NA
# # ... with 15 more rows
Lagging the first index by one is required so that you don't overwrite the "Date_beg" at the start of each run.
Another dplyr approach using the cumsum function.
If the row in the name column in NA, it'll add 0 to the cumsum, otherwise add 1. Therefore the values under Date_beg will always be odd numbers (0 + 1) and the values under Date_end will always be even numbers (0 + 1 + 1). Then replace values that are odd in the ref column AND not NA in the name column with "event".
library(dplyr)
data %>%
mutate(ref = cumsum(ifelse(is.na(name), 0, 1)),
name = ifelse(ref %% 2 == 1 & is.na(name), "event", name)) %>%
select(-ref)
This question already has answers here:
R: coalescing a large data frame
(2 answers)
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
I have a df that looks something like this:
id <- c(1:8)
born.swis <- c(0, 1, NA, NA, NA, 2, NA, NA)
born2005 <- c(NA, NA, 2, NA, NA, NA, NA, NA)
born2006 <- c(NA, NA, NA, 1, NA, NA, NA, NA)
born2007 <- c(NA, NA, NA, NA, NA, NA, NA, 1)
born2008 <- c(NA, NA, NA, NA, NA, NA, 2, NA)
born2009 <- c(NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(id, born.swis, born2005, born2006, born2007, born2008, born2009)
I'm trying to mutate born.swis based on the values of the other variables. Basically, I want the value bornswis to be filled with the value one of the other variables IF born.id is NA and IF it is not NA for that variable. Something like this:
id <- c(1:8)
born.swis <- c(0, 1, 2, 1, NA, 2, 2,1)
df.desired <- data.frame(id, born.swis)
I tried several things with mutate and ifelse, like this:
df <- df%>%
mutate(born.swis = ifelse(is.na(born.swis), born2005, NA,
ifelse(is.na(born.swis), born2006, NA,
ifelse(is.na(born.swis), born2007, NA,
ifelse(is.na(born.swis), born2008, NA,
ifelse(is.na(born.swis), born2009, NA,)
)))))
and similar things, but I'm not able to reach my desired outcome.
Any ideas?
Many thanks!
One dplyr option could be:
df %>%
mutate(born.swis_res = coalesce(!!!select(., starts_with("born"))))
id born.swis born2005 born2006 born2007 born2008 born2009 born.swis_res
1 1 0 NA NA NA NA NA 0
2 2 1 NA NA NA NA NA 1
3 3 NA 2 NA NA NA NA 2
4 4 NA NA 1 NA NA NA 1
5 5 NA NA NA NA NA NA NA
6 6 2 NA NA NA NA NA 2
7 7 NA NA NA NA 2 NA 2
8 8 NA NA NA 1 NA NA 1
Or with dplyr 1.0.0:
df %>%
mutate(born.swis_res = Reduce(coalesce, across(starts_with("born"))))
In base R, you can use max.col :
df[cbind(1:nrow(df), max.col(!is.na(df[-1])) + 1 )]
#[1] 0 1 2 1 NA 2 2 1
max.col gives the column position of the first non-NA value in each row (exlcuding first column), we create a matrix with row-index and use it to subset df.
base R
df$born.swis <- apply(df[-1], 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm = T)))
I have a list of tibbles like the following:
list(A = structure(list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
g1 = c(0, 1, 2, NA, NA, NA, NA, NA, NA),
g2 = c(NA, NA, NA, 3, 4, 5, NA, NA, NA),
g3 = c(NA, NA, NA, NA, NA, NA, 6, 7, 8)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")),
B = structure(list(ID = c(1, 2, 1, 2, 1, 2),
g1 = c(10, 11, NA, NA, NA, NA),
g2 = c(NA, NA, 12,13, NA, NA),
g3 = c(NA, NA, NA, NA, 14, 15)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
)
Each element looks like this:
ID g1 g2 g3
<dbl> <dbl> <dbl> <dbl>
1 0 NA NA
2 1 NA NA
3 2 NA NA
1 NA 3 NA
2 NA 4 NA
3 NA 5 NA
1 NA NA 6
2 NA NA 7
3 NA NA 8
The g* columns are created dynamically, during previous mutates, and their number can vary, but it will be the same across all list elements.
Every g* column has only certain non-NA elements (as many as the unique IDs).
I would like to shift the g* columns so that they contain the non-NA element to the top rows.
I can do it for a single column by
num.shifts<- rle(is.na(myList[[1]]$g1))$lengths[1]
shift(myList[[1]]$g2,-num.shifts)
but how can I do it for all the g* columns, for all list elements, when I don't know in advance the number of g* columns?
Ideally, I would like a tidyverse solution, but not a requirement...
Thanks!
We can loop over the list with map, and use mutate_at to go over the columns that matches the 'g' followed by digits and order based on the non-NA elements
library(dplyr)
library(tidyr)
map(lst1, ~
.x %>%
mutate_at(vars(matches('^g\\d+')), ~ .[order(is.na(.))]))
In base R, we can do
lapply(lst1, function(x) {i1 <- grepl("^g\\d+$", names(x))
x[i1] <- lapply(x[i1], function(y) y[order(is.na(y))])
x})
I have performed an experiment in which people are moving around cubes until they made a figure they like. When they like a figure, they save if and creates a new one. The script tracked time and number of moves between all figures safes.
I now have a column (A) with number of moves between each save and a column (B) with the time between each move until the figure is saved. Thus column A is filled with NA's and then a number (signifies figure safe) and column B has time in seconds in all rows (except from first row) signifing all the moves made.
Excerpt of data:
A B C
NA 1.6667798
NA 3.3326443
NA 3.5506110
NA 11.4995562
NA 1.4334849
NA 4.9502637
NA 2.1161980
NA 4.7833326
NA 2.8500842
NA 4.0331373
NA 4.3498785
12 5.0910905 Sum
NA 4.2424078
NA 1.7332665
NA 1.5341006
3 4.8923275 Sum
NA 4.1064621
NA 3.3498289
NA 1.6002373
3 6.0122170 Sum
I have tried several loop options, but I cannot seem make it work properly.
I made this loop, but it is not doing the correct calculation in column C.
data$C <- rep(NA, nrow(data))
for (i in unique(data$id)) {
C <- which(data$id == i & data$type == "moveblock")
for (e in 1:length(C)){
if (e == 1){
data$C[C[e]] = C[e] - which(data$id == i)[1]
}
else if (e > 1){
data$C[C[e]] = C[e] + C[e+1]+1}
}
d_times <- which(data$id == i)
for (t in 2:length(d_times)){
data$B[d_times[t]] <- data$time[d_times[t]] - data$time[d_times[t-1]]
}
}
I want a new column (C) which has the sum of all rows from column B until a figure has been saved = a number in column A. In other words, I want to calculate the total time it took the subject to make all the moves before saving the figure.
Hope anyone can figure this out!
We can create groups based on occurrence of non NA values and take sum
library(dplyr)
df %>%
group_by(group = lag(cumsum(!is.na(A)), default = 0)) %>%
summarise(sum = sum(B, na.rm = TRUE))
# group sum
# <dbl> <dbl>
#1 0 49.7
#2 1 12.4
#3 2 15.1
In base R, we can use aggregate to do the same
aggregate(B~c(0, head(cumsum(!is.na(A)), -1)), df, sum, na.rm = TRUE)
data
df <- structure(list(A = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 12L, NA, NA, NA, 3L, NA, NA, NA, 3L), B = c(1.6667798, 3.3326443,
3.550611, 11.4995562, 1.4334849, 4.9502637, 2.116198, 4.7833326,
2.8500842, 4.0331373, 4.3498785, 5.0910905, 4.2424078, 1.7332665,
1.5341006, 4.8923275, 4.1064621, 3.3498289, 1.6002373, 6.012217
)), class = "data.frame", row.names = c(NA, -20L))
You could create a matrix out of the periods (i.e. sequences) and sum the values of column B accordingly. For this create vector saved that indicates where a subject has "saved" and list sequences using apply(). Finally the sapply() loops over the sequences in the periods list.
saved <- which(!is.na(dat$A))
periods <- apply(cbind(c(1, (saved + 1)[-3]), saved), 1, function(x) seq(x[1], x[2]))
dat$C[saved] <- sapply(periods, function(x) sum(dat$B[x]))
Result
dat$C
# [1] NA NA NA NA NA NA NA NA NA
# [10] NA NA 49.65706 NA NA NA 12.40210 NA NA
# [19] NA 15.06875
Data
dat <- structure(list(A = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 12L, NA, NA, NA, 3L, NA, NA, NA, 3L), B = c(1.6667798, 3.3326443,
3.550611, 11.4995562, 1.4334849, 4.9502637, 2.116198, 4.7833326,
2.8500842, 4.0331373, 4.3498785, 5.0910905, 4.2424078, 1.7332665,
1.5341006, 4.8923275, 4.1064621, 3.3498289, 1.6002373, 6.012217
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = "data.frame")
I have two data frames. They are
x <- data.frame(sulfur = c(NA, 5, 7, NA, NA), nitrate = c(NA, NA, NA, 3, 7))
y <- data.frame(sulfur = c(NA, 3, 7, 9, NA), nitrate = c(NA, NA, NA, 6, 7))
I want a new data frame which should be like
z <- data.frame(sulfur(NA, 5, 7, NA, NA, NA, 3, 7, 9, NA), nitrate=c(NA, NA, NA, 3, 7, NA, NA, NA, 6, 7))
I am trying two join columns and make it a single data frame. How do I do it?
Try this:
df<-data.frame(Sulfur=c(NA,5,7,NA,NA), Nitrate = c(NA,NA,NA,3,7))
df2<-data.frame(Sulfur=c(NA,3,7,9,NA), Nitrate = c(NA,NA,NA,6,7))
df3<-(rbind(df,df2))
>df3
Sulfur Nitrate
1 NA NA
2 5 NA
3 7 NA
4 NA 3
5 NA 7
6 NA NA
7 3 NA
8 7 NA
9 9 6
10 NA 7
> class(lst)
[1] list
> dplyr::rbind_all(lst)
> do.call(lst,rbind)
You can use above function since it can apply on multiple (more than two) data frames.
Other options include, placing the datasets in a list and then use rbindlist from data.table
library(data.table)
rbindlist(list(x,y))
or we can use bind_rows from dplyr.
library(dplyr)
bind_rows(x,y)
NOTE: The above two functions can be applied to more than 2 datasets.