I am working with a dataframe in R which has some missing values across rows. Data frame is next (dput added in the end):
df
id V1 V2 V3 V4
1 01 1 1 1 NA
2 02 2 1 NA NA
3 03 3 1 NA NA
4 04 4 1 2 NA
Each row is a different id. As you can see the rows have missing values. I would like to know how can I get a dataframe completed in this style without using reshape to long or pivot as my real data is very large:
df
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
I was trying to use fill from tidyr but at row level I am having issues. I have seen some posts where it is used along with dplyr function across but I can not find it. I have tried using group_by(id) and rowwise but I have not had success. Also only the variables/columns starting with V should be filled with previous values.
Data is next:
#Data
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
Many thanks for your time.
One solution could be using na.locf function from package zoo combining with purrr::pmap function in a row-wise operation. na.locf takes the most recent non-NA value and replace all the upcoming NA values by that. Just as a reminder c(...) in both solutions captures all values of V1:V4 in each row in every iteration. However, I excluded id column in both as it is not involved in the our calculations.
library(zoo)
library(purrr)
df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or we can use coalesce function from dplyr. We can replace every NA values in each row with the last non-NA value, something we did earlier with na.locf. However this solution is a bit verbose:
df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or you could also use this:
library(purrr)
df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
The warning message can be ignored. It is in fact produced because we have 6 NA values but the result of applying dplyr::coalesce on every vector is 1 element resulting in 4 elements to replace 6 slots.
One option using dplyr could be:
df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .)))
id V1 V2 V3 V4
1 1 1 1 1 1
2 2 2 1 1 1
3 3 3 1 1 1
4 4 4 1 2 2
A dplyr approach
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr, warn.conflicts = F)
df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1))))
#> id V1 V2 V3 V4
#> 1 01 1 1 1 2
#> 2 02 2 1 2 2
#> 3 03 3 1 2 2
#> 4 04 4 1 2 2
If you'll group_by on id column, you won't have to use [-1] on cur_data()`
df %>% group_by(id) %>%
mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[!is.na(cur_data())],1))))
A data.table option with nafill
> setDT(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id]
id V1 V2 V3 V4
1: 01 1 1 1 1
2: 02 2 1 1 1
3: 03 3 1 1 1
4: 04 4 1 2 2
If the reason you want to avoid reshaping is to save runtime then that idea is actually mistaken if the benchmark below continues to hold at scale. Note that f which transposes, uses na.locf and then transposes back is the fastest.
library(microbenchmark)
library(data.table)
library(dplyr)
library(purrr)
library(zoo)
microbenchmark(times = 10,
a = df %>% mutate(pmap_df(., ~ na.locf(c(...)[-1]))),
b = df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])})),
c = df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .))),
d = df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1)))),
e = as.data.table(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id],
f = data.frame(id = df$id, t(na.locf(t(df[-1])))))
giving:
Unit: milliseconds
expr min lq mean median uq max neval
a 11.343302 12.934702 15.032001 13.115151 14.799400 30.135901 10
b 11.641301 13.116401 14.030551 14.426751 15.012701 15.517501 10
c 28.201501 30.470801 33.375761 32.672950 36.671101 40.448701 10
d 25.394901 26.648801 30.044331 27.971251 32.433801 39.570600 10
e 3.750801 4.023700 8.771401 4.150701 4.367502 50.636700 10
f 2.454701 2.458201 3.009181 2.603951 2.952302 6.126101 10
Related
I wanted to delete rows in x1 column that don't appear in EVERY month in another column:
The dataset is as follows:
id month
1 01
2 01
3 01
1 02
2 02
1 03
2 03
I want to delete id = 3 from the dataset, since it doesn't appear in month = 02
Im using R
Thank you for helping
You can split the dataset and use Reduce, i.e.
remove <- Reduce(setdiff, split(df$id, df$month))
df[!df$id %in% remove,]
id month
1 1 1
2 2 1
4 1 2
5 2 2
6 1 3
7 2 3
As #jay.sf mentioned, you need to assign it back to your dataframe,
df <- df[!df$id %in% remove,]
Using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(n_distinct(month) == n_distinct(df$month)) %>%
ungroup
-output
# A tibble: 6 × 2
id month
<int> <int>
1 1 1
2 2 1
3 1 2
4 2 2
5 1 3
6 2 3
Or using data.table
library(data.table)
data_hh[, if(uniqueN(month) == uniqueN(.SD$month)) .SD, .(id)]
data
data_hh <- structure(list(id = c(18354L, 18815L, 19014L, 63960L, 72996L,
73930L), month = c(1, 1, 1, 1, 1, 1), value = c(113.33, 251.19,
160.15, 278.8, 254.39, 733.22), x1 = c(96.75, 186.78, 106.02,
195.23, 184.57, 473.92), x2 = c(1799.1, 5399.1, 1799.1, 1349.1,
2924.1, 2024.1), x3 = c(85.37, 74.36, 66.2, 70.02, 72.55, 64.63
), x4 = c(6.29, 4.65, 8.9, 20.66, 8.69, 36.22)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
lets say I have the following data frame:
dt <- data.frame(id= c(1),
parameter= c("a","b","c"),
start_day = c(1,8,4),
end_day = c(16,NA,30))
I need to combine start_day and end_day columns (lets call the new column as day) such that I reserve all the other columns. Also I need to create another column that indicates if each row is showing start_day or end_day. To clarify, I am looking to create the following data frame
I am creating the above data frame using the following code:
dt1 <- subset(dt, select = -c(end_day))
dt1 <- dt1 %>% rename(day = start_day)
dt1$start <- 1
dt2 <- subset(dt, select = -c(start_day))
dt2 <- dt2 %>% rename(day = end_day)
dt2$end <- 1
dt <- bind_rows(dt1, dt2)
dt <- dt[order(dt$id, dt$parameter),]
Although my code works, but I am not happy with my solution. I am certain that there is a better and cleaner way to do that. I would appreciate any input on better alternatives of tackling this problem.
(tidyr::pivot_longer(dt, cols = c(start_day, end_day), values_to = "day")
|> dplyr::mutate(start = ifelse(name == "start_day", 1, NA),
end = ifelse(name == "end_day", 1, NA))
)
Result:
# A tibble: 6 × 6
id parameter name day start end
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 a start_day 1 1 NA
2 1 a end_day 16 NA 1
3 1 b start_day 8 1 NA
4 1 b end_day NA NA 1
5 1 c start_day 4 1 NA
6 1 c end_day 30 NA 1
You could get rid of the name column, but maybe it would be more useful than your new start/end columns?
using base R (faster than data.table up to ~300 rows; faster than tidyr up to ~1k rows) :
cbind(dt[1:2], day = c(dt$start_day,dt$end_day)) |>
(\(x) x[order(x$id, x$parameter),])() |>
(`[[<-`)("start", value = c(1, NA)) |>
(`[[<-`)("end", value = c(NA, 1))
id parameter day start end
1 1 a 1 1 NA
4 1 a 16 NA 1
2 1 b 8 1 NA
5 1 b NA NA 1
3 1 c 4 1 NA
6 1 c 30 NA 1
using the data.table package (faster than tidyr up to ~500k rows) :
dt <- as.data.table(dt)
dt[,.(day = c(start_day, end_day),
start = rep(c(1, NA), .N),
end = rep(c(NA, 1), .N)),
by = .(id, parameter)]
id parameter day start end
1: 1 a 1 1 NA
2: 1 a 16 NA 1
3: 1 b 8 1 NA
4: 1 b NA NA 1
5: 1 c 4 1 NA
6: 1 c 30 NA 1
I have a data frame that contains several scattered NA values. I would like to fill those NAs with the values immediately preceding it in the cell to the left (same row) or the following cell to the right (same row) if a value doesn't exist to the left or is NA. It seems like using zoo::na.locf or tidyr::fill() can help with this but it only seems to work by taking the previous/next value either above or below in the same column.
I currently have this code but it's only filling based on above values in same column:
lapply(df, function(x) zoo::na.locf(zoo::na.locf(x, na.rm = FALSE), fromLast = TRUE))
My dataframe df looks like this:
C1 C2 C3 C4
1 2 1 9 2
2 NA 5 1 1
3 1 NA 3 8
4 3 NA NA 4
structure(list(C1 = c(2, NA, 1, 3), C2 = c(1, 5, NA, NA), C3 = c(9,
1, 3, NA), C4 = c(2, 1, 8, 4)), row.names = c(NA, 4L), class = "data.frame")
After filling the NA values, I would like it to look like this:
C1 C2 C3 C4
1 2 1 9 2
2 5 5 1 1
3 1 1 3 8
4 3 3 3 4
This is indeed not the usual way to store data, but if you just transpose you can use tidyr::fill(). Only downside is that it adds quite a bit of wrapping code.
xx <- structure(list(C1 = c(2, NA, 1, 3), C2 = c(1, 5, NA, NA), C3 = c(9,
1, 3, NA), C4 = c(2, 1, 8, 4)), row.names = c(NA, 4L), class = "data.frame")
xx %>%
t() %>%
as_tibble() %>%
tidyr::fill(everything(), .direction = "downup") %>%
t() %>%
as_tibble() %>%
set_names(names(xx))
# A tibble: 4 x 4
# C1 C2 C3 C4
# <dbl> <dbl> <dbl> <dbl>
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4
With apply and na.locf
library(zoo)
df[] <- t(apply(df, 1, function(x) na.locf0(na.locf0(x), fromLast = TRUE)))
-output
df
# C1 C2 C3 C4
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4
na.locf can directly work on dataframes but it works column-wise. If you want to make it run row-wise you can transpose the dataframe. You can also use fromLast = TRUE to fill the data from opposite direction. Finally, we use coalesce to select the first non-NA value from the two vectors.
library(zoo)
df[] <- dplyr::coalesce(c(t(na.locf(t(df), na.rm = FALSE))),
c(t(na.locf(t(df), na.rm = FALSE, fromLast = TRUE))))
df
# C1 C2 C3 C4
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4
I have two dataframes:
df1 <- data.frame( v1 = c(1,2,3,4),
v2 = c(2, 10, 5, 11),
v3=c(20, 25, 23, 2))
> df1
v1 v2 v3
1 1 2 20
2 2 10 35
3 3 5 23
4 4 11 2
df2 <- data.frame(v1 = 4, = 10, v3 = 30)
> df2
v1 v2 v3
1 4 10 30
I want to add a new column that would say "Fail" when df1 is larger than df2 and "Pass" when it is smaller so that the intended results would be:
> df3
v1 v2 v3 check
1 1 2 20 Pass
2 2 10 35 Fail
3 3 5 23 Pass
4 4 11 2 Fail
You can make size of both the dataframes similar and directly compare :
ifelse(rowSums(df1 >= df2[rep(1,length.out = nrow(df1)), ]) == 0, 'Pass', 'Fail')
#[1] "Pass" "Fail" "Pass" "Fail"
Or using Map :
ifelse(Reduce(`|`, Map(`>=`, df1, df2)), 'Fail', 'Pass')
#Other similar alternatives :
#c('Pass', 'Fail')[Reduce(`|`, Map(`>=`, df1[-1], df2[-1])) + 1]
#c('Fail', 'Pass')[(rowSums(mapply(`>=`, df1, df2)) == 0) + 1]
In tidyverse, we can make use of c_across
library(dplyr) # >= 1.0.0
df1 %>%
rowwise %>%
mutate(check = c('Pass', 'Fail')[1 + any(c_across(everything()) >= df2)])
# A tibble: 4 x 4
# Rowwise:
# v1 v2 v3 check
# <dbl> <dbl> <dbl> <chr>
#1 1 2 20 Pass
#2 2 10 25 Fail
#3 3 5 23 Pass
#4 4 11 2 Fail
Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear
Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)
df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3
Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.
Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)
One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)