Replacing NA's with a specific condition in R [duplicate] - r

This question already has answers here:
Replace NA values by row means
(3 answers)
Closed 4 years ago.
In case 2017 is NA and columns of 2015 and 2016 have value, I want to assign average of them to 2017 based on the same row.
Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288
...
Here is that I tried, didn't work!
ind <- which(is.na(df), arr.ind=TRUE)
df[ind] <- rowMeans(df, na.rm = TRUE)[ind[,1]]
Also if we have values in 2015 and 2017 columns and 2016 is NA, I want to assign average of them to the column of 2016 based on the same row. Any help would be appreciated!

Disclaimer: I'm not entirely clear on what your expected output is. My solution below is based on the assumption that you want to replace NA values with either the mean of all values for every year or with the mean value of all values for every Index.
Here is a tidyverse option first spreading from wide to long, replacing NAs with the mean value per year, and finally converting back from long to wide.
library(tidyverse)
df %>%
gather(year, value, -Index) %>%
group_by(year) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 115507293. 6355698. 10107023.
#2 2 13000000. 223472356. 186197372.
#3 4 115507293. 223472356. 186197372.
#4 5 115507293. 223472356. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 115507293. 223472356. 32032920.
#7 8 115507293. 223472356. 20000000.
#8 9 2521880. 5061370. 7044288.
Note that here we replace NAs with mean value per year. If instead you want to replace NAs with the mean value per Index value, simply replace group_by(year) with group_by(Index):
df %>%
gather(year, value, -Index) %>%
group_by(Index) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
## Groups: Index [8]
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 8231360. 6355698. 10107023.
#2 2 13000000. 13000000. 13000000.
#3 4 NaN NaN NaN
#4 5 8000000. 8000000. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 32032920. 32032920. 32032920.
#7 8 20000000. 20000000. 20000000.
#8 9 2521880. 5061370. 7044288.
Update
To only replace NAs in column 2017 with the row average based on the 2015,2016 values you can do
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")
df %>%
mutate(`2017` = ifelse(is.na(`2017`), 0.5 * (`2015` + `2016`), `2017`))
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <int> <int> <dbl>
#1 1 NA 6355698 10107023.
#2 2 13000000 73050000 43025000.
#3 4 NA NA NA
#4 5 10500000 NA 8000000.
#5 6 331000000 659000000 1040000000.
#6 7 55500000 NA 32032920.
#7 8 NA NA 20000000.
#8 9 2521880 5061370 7044288.
Sample data
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 NA NA
4 NA NA NA
5 NA NA 8000000
6 331000000 659000000 1040000000
7 NA NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")

Related

Why does grepl work but not str_detect for mutate depending on row value?

I have been trying to wrap my head around this.
I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).
Here is an example database:
test <- tibble(
id = seq(1:30),
date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
rep(NA, 5), "wrong date! Correct date = 01.05.2021",
rep(NA, 5), "wrong date! Correct date = 01.03.2022",
rep(NA, 7), "wrong date! Correct date = 01.05.2021",
rep(NA, 2), "date already corrected on 01.05.2021",
NA, "date already corrected on 01.03.2022", NA))
I first tried to create a new "date_corr" column with str_detect:
test %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
This yields:
A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
Adding rowwise is irrelevant:
test %>%
rowwise() %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
However, with grepl I get the desired outcome, regardless of rowwise:
test %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
# A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
test %>%
rowwise() %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
What I am missing here?
The difference is how they handle NA values
str_detect(NA, "missing")
# [1] NA
grepl("missing", NA)
# [1] FALSE
And note that if you have an NA value in the condition for if_else, it will also preserve the NA value
if_else(NA, 1, 2)
# [1] NA
The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values
test %>%
mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

R: take different variables from a row and make column header

rainfall <- data.frame("date" = rep(1:15),"location_code" = rep(6:8,5),
"rainfall"=runif(15, min=12, max=60))
rainfall30 <- rainfall %>%
group_by(location_code) %>%
filter(rainfall>30)
I want to use the above data to make the following table, is there a way to do it in R using dplyr?
date location6 location7 location8
2 47.7
5 46.8
6 32.3
7 55.3
9 40.5
I am just starting to use R, please apologize if this already answered. Thanks.
I think what you are looking for is tidyr::pivot_wider, which turns this long-form data.frame into a wide form. See here and here for more information on pivoting data with tidyr.
rainfall30 %>%
pivot_wider(names_from = location_code,
values_from = rainfall)
# date `6` `7` `8`
# <int> <dbl> <dbl> <dbl>
# 1 1 32.3 NA NA
# 2 2 NA 52.7 NA
# 3 3 NA NA 54.3
# 4 4 30.6 NA NA
# 5 7 52.4 NA NA
Here is a base R option using reshape + subset
reshape(
subset(rainfall, rainfall > 30),
idvar = "date",
timevar = "location_code",
direction = "wide"
)
which gives something like below (using set.seed(1) to generate rainfall)
date rainfall.8 rainfall.6 rainfall.7
3 3 39.49696 NA NA
4 4 NA 55.59397 NA
6 6 55.12270 NA NA
7 7 NA 57.34441 NA
8 8 NA NA 43.71829
9 9 42.19747 NA NA
13 13 NA 44.97710 NA
14 14 NA NA 30.43698
15 15 48.95239 NA NA

Counting columns with NAs after group_by

I want to count the number of columns that have an NA value after using group_by.
Similar questions have been asking, but counting total NAs not columns with NA (group by counting non NA)
Data:
Spes <- "Year Spec.1 Spec.2 Spec.3 Spec.4
1 2016 5 NA NA 5
2 2016 1 NA NA 6
3 2016 6 NA NA 4
4 2018 NA 5 5 9
5 2018 NA 4 7 3
6 2018 NA 5 2 1
7 2019 6 NA NA NA
8 2019 4 NA NA NA
9 2019 3 NA NA NA"
Data <- read.table(text=spes, header = TRUE)
Data$Year <- as.factor(Data$Year)
The desired output:
2016 2
2018 1
2019 3
I have tried a few things, this is my current best attempt. I would be keen for a dplyr solution.
> Data %>%
group_by(Year) %>%
summarise_each(colSums(is.na(Data, [2:5])))
Error: Can't create call to non-callable object
I have tried variations without much luck. Many thanks
One option could be to group_by Year, check if there is any NA values in each column and calculate their sum for each Year.
library(dplyr)
Data %>%
group_by(Year) %>%
summarise_all(~any(is.na(.))) %>%
mutate(output = rowSums(.[-1])) %>%
select(Year, output)
# A tibble: 3 x 2
# Year output
# <fct> <dbl>
#1 2016 2
#2 2018 1
#3 2019 3
Base R translation using aggregate
rowSums(aggregate(.~Year, Data, function(x)
any(is.na(x)), na.action = "na.pass")[-1], na.rm = TRUE)
#[1] 2 1 3

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Keep unique entries by name and by time

A bit of code golf I am facing and struggling quite a bit. I had a hold to a complex dataset in long format, which I need in wide for analysis. I managed to convert easily. However, there is redundancy in the dataset after the convertion because of how the data was filled. So here is a MWE with the problem I am facing:
id <- c("ana","ana","ana", "brad","ana","brad","brad","brad", "matt", "matt", "matt")
hour <- c(0, 0, 24, 0, 48, 24, NA, 72, 0 , 24, 48 )
assessment <- c("memory", "memory", "attention", "verbal", "attention", "memory", "attention","attention", "memory", "attention", "attention")
value <- c(0.000,NA,0.895,0.000,15.000, 3, 5, NA,2, 4,5 )
mydata<-data.frame(id, hour, assessment, value)
Results in:
> mydata
id hour assessment value
1 ana 0 memory 0.000
2 ana 0 memory NA
3 ana 24 attention 0.895
4 brad 0 verbal 0.000
5 ana 48 attention 15.000
6 brad 24 memory 3.000
7 brad NA attention 5.000
8 brad 72 attention NA
9 matt 0 memory 2.000
10 matt 24 attention 4.000
11 matt 48 attention 5.000
and after:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
mutate(i1=row_number()) %>%
spread(assessment, value)
gets to:
Source: local data frame [11 x 6]
Groups: id [3]
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 0 2 NA NA NA
3 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 NA NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
8 brad NA 3 5.000 NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Note that ana has two entries for hour 0 and memory; and brad has one entry with zero and another with missing. That missing should be considered as zero as well, that was a typing error of whoever collected the data.
The table below shows how ana's and brad's entries should be. Repetitions for the same id and hour (including NA) should be collapsed/merged (look at lines 1 and 5 below).
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 5.000 NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Question:
How do I reduce the duplicates for each subject+hour in such a dataset, so that it will look like the previous table?
One option is to replace the NA with 0, get the distinct rows and then proceed as in the OP's code
mydata %>%
mutate_at(vars(hour, value), funs(replace(., is.na(.), 0))) %>%
arrange(id, hour, desc(value)) %>%
distinct() %>%
group_by(id, hour, assessment) %>%
spread(assessment, value)

Resources