How to efficiently update multiple columns from NA to 0? Don't want to update all NA to 0. Only certain columns need be updated.
My current solution: Is there a better method?
dataframe$col1 = replace(dataframe$col1, is.na(dataframe$col1), 0)
dataframe$col2 = replace(dataframe$col2, is.na(dataframe$col2), 0)
dataframe$col3 = replace(dataframe$col3, is.na(dataframe$col3), 0)
Syntax used and this is not working as expected. Meaning, does not replace NA to 0.
dataframe = dataframe %>% mutate(across(c('col1', 'col2', 'col3'), ~ replace(., all(is.na(.)), 0)))
Sample Data.
structure(list(col1 = c(63755.4062, 61131.3242,
61131.3242, 192055.25, 191429.9844, 190076.4688), col2 = c(18.8754,
14.6002, 14.6002, 24.0053, 24.4012, 25.3588), col3 = c(NA, NA, NA, 45.6442, 43.9821, 47.2581)), row.names = c(NA, 6L), class = "data.frame")
Following worked. Thanks #MATT, and #Karthik.
dataframe = dataframe %>% mutate(across(c('col1', 'col2', 'col3'), ~ ~tidyr::replace_na(., 0)))
I'm not sure why Karthik's solution still returns NA for your sample data, but using replace_na from tidyr seems to work:
library(tidyr)
dataframe %>% mutate(across(c('col1', 'col2', 'col3'), ~ replace_na(., 0)))
Which gives us:
col1 col2 col3
1 63755.41 18.8754 0.0000
2 61131.32 14.6002 0.0000
3 61131.32 14.6002 0.0000
4 192055.25 24.0053 45.6442
5 191429.98 24.4012 43.9821
6 190076.47 25.3588 47.2581
Does this work:
library(dplyr)
library(tibble)
df <- tibble(c1 = round(rnorm(10, 10,1)),
c2 = NA_real_,
c3 = round(rnorm(10, 10,1)),
c4 = NA_real_,
c5 = round(rnorm(10, 10,1)),
c6 = NA_real_)
df
# A tibble: 10 x 6
c1 c2 c3 c4 c5 c6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 NA 11 NA 11 NA
2 9 NA 10 NA 10 NA
3 11 NA 11 NA 10 NA
4 11 NA 9 NA 10 NA
5 10 NA 9 NA 10 NA
6 9 NA 13 NA 12 NA
7 10 NA 10 NA 9 NA
8 10 NA 10 NA 9 NA
9 11 NA 11 NA 10 NA
10 10 NA 10 NA 10 NA
df %>% mutate(across(c3:c6, ~ replace(., all(is.na(.)), 0)))
# A tibble: 10 x 6
c1 c2 c3 c4 c5 c6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 NA 11 0 11 0
2 9 NA 10 0 10 0
3 11 NA 11 0 10 0
4 11 NA 9 0 10 0
5 10 NA 9 0 10 0
6 9 NA 13 0 12 0
7 10 NA 10 0 9 0
8 10 NA 10 0 9 0
9 11 NA 11 0 10 0
10 10 NA 10 0 10 0
The quickest way IMO would be to subset the columns you want to edit as a new dataframe, edit all NAs in the subset to 0, then overwrite your original df's selected columns.
DFsubset <- DF[,10:12] #whichever columns
DFsubset[is.na(DFsubset) == T] <- 0
DF[,10:12] <- DFsubset
Related
I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.)
Some simplified data look like this:
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)
My approach was to duplicate and then join the data, which would look like this:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df is the result that I am looking for - from here I can create an overall neighbour dummy or a count.
This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.
How can I solve this in a less clumsy way?
Thanks in advance for your ideas!!
A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.
For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().
library(dplyr)
library(tidyr)
df1 %>%
select(!id) %>%
pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
mutate(original_id = rep(1:100, each = 3))%>%
left_join(df2, by = c("value" = "id"))%>%
pivot_wider(original_id, values_from = c(value, nbdum))
#> # A tibble: 100 × 7
#> original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 25 90 23 0 0 1
#> 2 2 12 NA NA 1 NA NA
#> 3 3 11 40 47 0 0 0
#> 4 4 94 87 NA 0 1 NA
#> 5 5 46 77 NA 1 0 NA
#> 6 6 98 82 NA 1 0 NA
#> 7 7 43 NA NA 1 NA NA
#> 8 8 74 NA 7 0 NA 1
#> 9 9 57 NA NA 1 NA NA
#> 10 10 49 72 NA 0 0 NA
#> # … with 90 more rows
## compare to original
as_tibble(df)
#> # A tibble: 100 × 8
#> id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 0 25 90 23 0 0 1
#> 2 2 1 12 NA NA 1 NA NA
#> 3 3 1 11 40 47 0 0 0
#> 4 4 1 94 87 NA 0 1 NA
#> 5 5 0 46 77 NA 1 0 NA
#> 6 6 1 98 82 NA 1 0 NA
#> 7 7 1 43 NA NA 1 NA NA
#> 8 8 0 74 NA 7 0 NA 1
#> 9 9 0 57 NA NA 1 NA NA
#> 10 10 0 49 72 NA 0 0 NA
#> # … with 90 more rows
As you just seem to be indexing dum with your neighbor variables you should be able to do:
library(dplyr)
df_base %>%
mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))
id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1 1 0 25 90 23 0 0 1
2 2 1 12 NA NA 1 NA NA
3 3 1 11 40 47 0 0 0
4 4 1 94 87 NA 0 1 NA
5 5 0 46 77 NA 1 0 NA
6 6 1 98 82 NA 1 0 NA
7 7 1 43 NA NA 1 NA NA
8 8 0 74 NA 7 0 NA 1
9 9 0 57 NA NA 1 NA NA
10 10 0 49 72 NA 0 0 NA
...
Or same idea in base R:
df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])
I am trying to sum the row of values if any column have values but not working for me like below
df=data.frame(
x3=c(2,NA,3,5,4,6,NA,NA,3,3),
x4=c(0,NA,NA,6,5,6,NA,0,4,2))
df$summ <- ifelse(is.na(c(df[,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
library(data.table)
setDT(df)[!( is.na(x3) & is.na(x4)),summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse(is.na(x3) & is.na(x4), NA,
rowSums(df, na.rm = TRUE)))
df
# x3 x4 summ
#1 2 0 2
#2 NA NA NA
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#7 NA NA NA
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums(is.na(df[cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums(is.na(df)) == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
library(dplyr)
df %>%
rowwise() %>%
mutate(Count = ifelse(all(is.na(cur_data())), NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
Hello a really simple question but I have just got stuck, how do I add a conditional column containing number 1 where completed column is not NA?
id completed
<chr> <chr>
1 abc123sdf 35929
2 124cv NA
3 125xvdf 36295
4 126v NA
5 127sdsd 43933
6 128dfgs NA
7 129vsd NA
8 130sdf NA
9 131sdf NA
10 123sdfd NA
I need this to calculate an overall percent of completed column/id.
(Additional question - how can I do this in dplyr without using a helper column?)
Thanks
You can use is.na to check for NA values.
library(dplyr)
df %>% mutate(newcol = as.integer(!is.na(completed)))
# id completed newcol
#1 abc123sdf 35929 1
#2 124cv NA 0
#3 125xvdf 36295 1
#4 126v NA 0
#5 127sdsd 43933 1
#6 128dfgs NA 0
#7 129vsd NA 0
#8 130sdf NA 0
#9 131sdf NA 0
#10 123sdfd NA 0
library("dplyr")
df <- data.frame(id = 1:10,
completed = c(35929, NA, 36295, NA, 43933, NA, NA, NA, NA, NA))
df %>%
mutate(is_na = as.integer(!is.na(completed)))
#> id completed is_na
#> 1 1 35929 1
#> 2 2 NA 0
#> 3 3 36295 1
#> 4 4 NA 0
#> 5 5 43933 1
#> 6 6 NA 0
#> 7 7 NA 0
#> 8 8 NA 0
#> 9 9 NA 0
#> 10 10 NA 0
But you shouldn't need this extra column to calculate a percentage, you can just use na.rm:
df %>%
mutate(pct = completed / sum(completed, na.rm = TRUE))
#> id completed pct
#> 1 1 35929 0.3093141
#> 2 2 NA NA
#> 3 3 36295 0.3124650
#> 4 4 NA NA
#> 5 5 43933 0.3782209
#> 6 6 NA NA
#> 7 7 NA NA
#> 8 8 NA NA
#> 9 9 NA NA
#> 10 10 NA NA
We can also do
library(dplyr)
df %>%
mutate(newcol = +(!is.na(completed)))
I have a dataframe df where:
Days Treatment A Treatment B Treatment C
0 5 1 1
1 0 2 3
2 1 1 0
For example, there were 5 individuals receiving Treatment A that survived 0 days and 1 who survived 2, etc. However, I would like it where those 5 individuals now become a unique row, with that cell representing the days they survived:
Patient # A B C
1 0
2 0
3 0
4 0
5 0
6 2
7 0
8 1
9 1
10 2
11 0
12 1
13 1
14 1
Let Patient # = an arbitrary value.
I am sorry if this is not descriptive enough, but I appreciate any and all help you have to offer! I have the dataset in Excel at the moment, but I can place it into R if that's easier.
We can replicate values the 'Days' with each of the 'Patient' column values in a list, then create a list of the sequence, use Map to construct a data.frame and finally use bind_rows
library(dplyr)
lst1 <- lapply(df[-1], function(x) rep(df$Days, x))
bind_rows(Map(function(x, y, z) setNames(data.frame(x, y),
c("Patient", z)), relist(seq_along(unlist(lst1)),
skeleton = lst1), lst1, sub("Treatment\\s+", "", names(lst1))))
-output
# Patient A B C
#1 1 0 NA NA
#2 2 0 NA NA
#3 3 0 NA NA
#4 4 0 NA NA
#5 5 0 NA NA
#6 6 2 NA NA
#7 7 NA 0 NA
#8 8 NA 1 NA
#9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
Or another option with reshaping into 'long' and then to 'wide'
library(tidyr)
df %>%
pivot_longer(cols = -Days) %>%
separate(name, into = c('name1', 'name2')) %>%
group_by(name2) %>%
summarise(value = rep(Days, value), .groups = 'drop') %>%
mutate(Patient = row_number()) %>%
pivot_wider(names_from = name2, values_from = value)
-output
# A tibble: 14 x 4
# Patient A B C
# <int> <int> <int> <int>
# 1 1 0 NA NA
# 2 2 0 NA NA
# 3 3 0 NA NA
# 4 4 0 NA NA
# 5 5 0 NA NA
# 6 6 2 NA NA
# 7 7 NA 0 NA
# 8 8 NA 1 NA
# 9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
data
df <- structure(list(Days = 0:2, `Treatment A` = c(5L, 0L, 1L),
`Treatment B` = c(1L,
2L, 1L), `Treatment C` = c(1L, 3L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA