My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0
Related
I have to calculate the number of missing values per observation in a data set. As there are several variables across multiple time periods, I thought it best to try a function to keep my syntax clean. The first part of looking up the number of missing values works fine:
data$NMISS <- data %>%
select('x1':'x4') %>%
apply(1, function(x) sum(is.na(x)))
But when I try turn it into a function I get "Error in select():! NA/NaN argument"
library(dplyr)
library(tidyverse)
data <- data.frame(x1 = c(NA, 1, 5, 1),
x2 = c(7, 1, 1, 5),
x3 = c(9, NA, 4, 9),
x4 = c(3, 4, 1, 2))
NMISSfunc <- function (dataFrame,variables) {
dataFrame %>% select(variables) %>%
apply(1, function(x) sum(is.na(x)))
}
data$NMISS2 <- NMISSfunc(data,'x1':'x4')
I think it doesn't like the : in the range as it will accept c('x1','x2','x3','x4') instead of 'x1':'x4'
Some of the ranges are over twenty columns so listing them doesn't really provide a solution to keep the syntax neat.
Any suggestions?
You are right that you can't use "x4":"x4", as this isn't valid use of the : operator in this context. To get this to work in a tidyverse-style, your variables variable needs to be selectively unquoted inside select. Fortunately, the tidyverse has the curly-curly notation {{variables}} for handling exactly this situation:
NMISSfunc <- function (dataFrame, variables) {
dataFrame %>%
select({{variables}}) %>%
apply(1, function(x) sum(is.na(x)))
}
Now we can use x1:x4 (without quotes) and the function works as expected:
NMISSfunc(data, x1:x4)
#> [1] 1 1 0 0
Created on 2022-12-13 with reprex v2.0.2
Why not simply,
data %>%
mutate(NMISS = rowSums(is.na(select(., x1:x4))))
x1 x2 x3 x4 NMISS
1 NA 7 9 3 1
2 1 1 NA 4 1
3 5 1 4 1 0
4 1 5 9 2 0
I am working with the dplyr library and have created a dataframe in a pipe that looks something like this:
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>% summarize_all(c(min, max))
which gives me this dataframe:
a_fn1 b_fn1 a_fn2 b_fn2
1 3 2 4
and I am trying to reshape this dataframe so that the output of the pipe stacks multiple columns on top of each other in several rows that look like this:
A B
----
1 3
2 4
How would I go about this? I do not want to change how the functions are called because the summarize_all function helps me achieve the values I am looking for. I just want to know how to change this dataframe to the shape such that each value in each row is the value of the summarize function for the given column.
First, naming your functions in summarize_all() will make them appear in the result for easier wrangling.
Then, you can use pivot_longer() with the special .value sentinel in names_to to achieve what you want:
library(tidyverse)
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>%
summarize_all(c(min=min, max=max)) %>%
pivot_longer(everything(), names_to=c(".value", "variable"), names_pattern="(.)_(.+)")
#> # A tibble: 2 x 3
#> variable a b
#> <chr> <dbl> <dbl>
#> 1 min 1 3
#> 2 max 2 4
Created on 2021-07-22 by the reprex package (v2.0.0)
Depending on what output you want, you can even switch the order to c("variable", ".value").
Note that summarize_all() is deprecated and that you might want to use the new, more verbous syntax: summarize(across(everything(), c(min=min, max=max))).
I have a dataframe of this form:
df <- data.frame(abc = c(1, 0, 3, 2, 0),
foo = c(0, 4, 2, 1, 0),
glorx = c(0, 0, 0, 1, 2))
Here, the column names are strings and the values in the data frame are the number of times I would like to concatenate that string in a new data column. The new column I'd like to create would be a concatenation across all existing columns, with each column name being repeated according to the data.
For example, I'd like to create this new column and add it to the dataframe.
new_col <- c('abc', 'foofoofoofoo', 'abcabcabcfoofoo', 'abcabcfooglorx', 'glorxglorx')
also_acceptable <- c('abc', 'foofoofoofoo', 'abcfooabcfooabc', 'abcfooglorxabc', 'glorxglorx')
df %>% mutate(new_col = new_col, also_acceptable = also_acceptable)
The order of concatenation does not matter. The core problem I have is I don't know how to reference the name of a column by row when constructing a purrr::map() or dplyr::mutate() function to build a new column. Thus, I'm not sure how to programatically construct this new column.
(The core application here is combinatorial construction of chemical formulae in case anyone wonders why I would need such a thing.)
Here is an option using Map and strrep:
mutate(df, new_col = do.call(paste, c(sep="", Map(strrep, names(df), df))))
# abc foo glorx new_col
#1 1 0 0 abc
#2 0 4 0 foofoofoofoo
#3 3 2 0 abcabcabcfoofoo
#4 2 1 1 abcabcfooglorx
#5 0 0 2 glorxglorx
Or a simpler version as #thelatemail's comment:
df %>% mutate(new_col = do.call(paste0, Map(strrep, names(.), .)))
Map gives a list as follows:
Map(strrep, names(df), df) %>% as.tibble()
# A tibble: 5 x 3
# abc foo glorx
# <chr> <chr> <chr>
#1 abc
#2 foofoofoofoo
#3 abcabcabc foofoo
#4 abcabc foo glorx
#5 glorxglorx
Use do.call(paste, ...) to paste strings rowwise.
Using dplyr, I am trying to conditionally update values in a column using ifelse and mutate. I am trying to say that, in a data frame, if any variable (column) in a row is equal to 7, then variable c should become 100, otherwise c remains the same.
df <- data.frame(a = c(1,2,3),
b = c(1,7,3),
c = c(5,2,9))
df <- df %>% mutate(c = ifelse(any(vars(everything()) == 7), 100, c))
This gives me the error:
Error in mutate_impl(.data, dots) :
Evaluation error: (list) object cannot be coerced to type 'double'.
The output I'd like is:
a b c
1 1 1 5
2 2 7 100
3 3 3 9
Note: this is an abstract example of a larger data set with more rows and columns.
EDIT:
This code gets me a bit closer, but it does not apply the ifelse statement by each row. Instead, it is changing all values to 100 in column c if 7 is present anywhere in the data frame.
df <- df %>% mutate(c = ifelse(any(select(., everything()) == 7), 100, c))
a b c
1 1 1 100
2 2 7 100
3 3 3 100
Perhaps this is not possible to do using dplyr?
I think this should work. We can check if values in df equal to 7. After that, use rowSums to see if any rows larger than 0, which means there is at least one value is 7.
df <- df %>% mutate(c = ifelse(rowSums(df == 7) > 0, 100, c))
Or we can use apply
df <- df %>% mutate(c = ifelse(apply(df == 7, 1, any), 100, c))
A base R equivalent is like this.
df$c[apply(df == 7, 1, any)] <- 100
You could try with purrr::map_dbl
library(purrr)
df$c <- map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c))
Output
a b c
1 1 1 5
2 2 7 100
3 3 3 9
In a dplyr::mutate statement this would be
library(purrr)
library(dplyr)
df %>%
mutate(c = map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c)))
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.