Replace values in vector where not %in% vector - r

Short question:
I can substitute certain variable values like this:
values <- c("a", "b", "a", "b", "c", "a", "b")
df <- data.frame(values)
What's the easiest way to replace all the values of df$values by "x" (where the value is neither "a" or "b")?
Output should be:
c("a", "b", "a", "b", "x", "a", "b")

Your example is a bit unclear and not reproducible.
However, based on guessing what you actually want, I could suggest trying this option using the data.table package:
df[values %in% c("a", "b"), values := "x"]
or the dplyr package:
df %>% mutate(values = ifelse(values %in% c("a","b"), x, values))

What about:
df[!df[, 1] %in% c("a", "b"), ] <- "x"
values
1 a
2 b
3 a
4 b
5 x
6 a
7 b

Related

Subset column values into separate vectors using 'for' loop in R

I have this data.frame and vector:
df <- data.frame (fruit = c(rep("apple", 5), rep("banana", 3), rep("cherry", 6), rep("date", 4)),
letter = c("a", "b", "c", "d", "e", "a", "d", "f", "b", "c", "f", "p", "q", "r", "d", "p",
"x", "y")
)
my_vector <- c("apple", "banana", "date")
Now I would like to use a for loop, which results in vectors with as names the elements in my_vector and as elements those listed in the letter column.
So expected outcome is like this:
apple <- c("a", "b", "c", "d", "e")
banana <- c("a", "d", "f")
date <- c("d", "p", "x", "y")
Thanks you.
We can subset to keep only fruit in my_vector in the data and split it into list of vectors.
list2env(with(subset(df, fruit %in% my_vector),split(letter, fruit)), .GlobalEnv)
apple
#[1] "a" "b" "c" "d" "e"
banana
#[1] "a" "d" "f"
date
#[1] "d" "p" "x" "y"
list2env does write the list of vectors as separate vectors in global environment but usually it is good practice to keep data in the list and not separate them in individual vectors.
A for loop solution would be with assign -
for(vec in my_vector) {
assign(vec, df$letter[df$fruit == vec])
}

How to avoid for loop when iterating through unique values in a column [R]

Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.
Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1

create long list of variables based on existing variables

I have a long list of variables and for each I want to create a dummy variable. I am using the below dplyr mutate code to do this, but know that something like an array in SAS could be used (so I don't have to copy this line out multiple times). I just haven't been able to find an answer on Stack or anywhere else that fits.
Grade_Dist2 <- Grade_Dist2 %>% mutate(
ACCT2301_FA15_z = ifelse(ACCT2301_FA15 %in% c("A", "B", "C"), 1,
ifelse(ACCT2301_FA15 %in% c("D", "F", "W", "Q"), 0, NA)))
The columns/vars are arranged together--all vars in the table are similar except an ID var.
In the tidyverse you should probably look at something like mutate_all(), but in the meantime I would think something like this base R solution would work:
all_names <- grep("FA[0-9]+",names(Grade2),value=TRUE)
for (id in all_names) {
cur_var <- Grade2[[id]]
Grade2[[paste0(id,"_z")]] <-
ifelse(cur_var %in% c("A", "B", "C"), 1,
ifelse(cur_var %in% c("D", "F", "W", "Q"), 0, NA)))
}
Here's a try at using a tidyverse approach with mutate_all as suggested by #BenBolker.
library(tidyverse)
Grade_Dist2 <- tibble(ACCT2301_FA15_z = c("A", "F", "C", "Z"))
Grade_Dist2 <- Grade_Dist2 %>%
mutate_all(., funs(if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
Grade_Dist2
#> # A tibble: 4 x 1
#> ACCT2301_FA15_z
#> <dbl>
#> 1 1
#> 2 0
#> 3 1
#> 4 NA
If you want to append the dummy variables to the existing data instead of overwriting then
mutate_all(., funs("dummy" = if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
will append variables with names like ACCT2301_FA15_z_dummy (or be called dummy if there is only one variable being mutated).

R: efficient way of assigning factor levels

I have a factor vector. Some values can be repeated. The values are not known beforehand, but can be sorted. For example,
x1 <- factor(c("A", "C", "C", "A", "B" ), levels=c("A", "B", "C"))
x2 <- factor(c("E", "C", "C", "D", "B" ), levels=c("B", "C", "D", "E"))
I want to create another vector, in which each value is either "last", "other" or "first", and the values correspond to the first or last factor level. In the above case, the resulting vector y1 would have to be c("first", "last", "last", "first", "other"), while y2 would have to be c("last", "other", "other", "other", "first").
Currently, I do it like this:
f2l <- function(x) {
x <- as.numeric(x)
y <- rep("other", length(x))
y[ x == max(x) ] <- "last"
y[ x == min(x) ] <- "first"
y
}
This works as intended, but I wonder whether there is a more efficient solution.
You can reassign level labels using a list.
x1 <- factor(c("A", "C", "C", "A", "B" ), levels=c("A", "B", "C"))
x2 <- factor(c("E", "C", "C", "D", "B" ), levels=c("B", "C", "D", "E"))
f2l <- function(x){
levels(x) <- list("first" = levels(x)[1],
"other" = levels(x)[-c(1, nlevels(x))],
"last" = levels(x)[nlevels(x)])
x
}
f2l(x1)
f2l(x2)
Apart from Benjamin's method, if you are sure that the number of levels would be more than 2, you can use
f2l <- function(x){
levels(x) <- c("first",rep("other",length(levels(x))-2),"last");
x
}
If you are doing this for many factors then Benjamin's method is slow in comparison to the above method. The times for 100000 repetitions are
Benjamin
user system elapsed
26.58 0.00 26.68
Saksham
user system elapsed
17.15 0.08 18.30

How to filter a column by multiple, flexible criteria

I'm writing a function to aggregate a dataframe, and it needs to be generally applicable to a wide variety of datasets. One step in this function is dplyr's filter function, used to select from the data only the ad campaign types relevant to the task at hand. Since I need the function to be flexible, I want ad_campaign_types as an input, but this makes filtering kind of hairy, as so:
aggregate_data <- function(ad_campaign_types) {
raw_data %>%
filter(ad_campaign_type == ad_campaign_types) -> agg_data
agg_data
}
new_data <- aggregate_data(ad_campaign_types = c("campaign_A", "campaign_B", "campaign_C"))
I would think the above would work, but while it runs, oddly enough it only returns only a small fraction of what the filtered dataset should be. Is there a better way to do this?
Another tiny example of replaceable code:
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
data <- as.data.frame(cbind(ad_types, revenue))
# Now, filtering to select only ad types "a", "b", and "d",
# which should leave us with only 7 values
new_data <- filter(data, ad_types == c("a", "b", "d"))
nrow(new_data)
[1] 3
For multiple criteria use %in% function:
filter(data, ad_types %in% c("a", "b", "d"))
you can also use "not in" criterion:
filter(data, !(ad_types %in% c("a", "b", "d")))
However notice that %in%'s behavior is a little bit different than ==:
> c(2, NA) == 2
[1] TRUE NA
> c(2, NA) %in% 2
[1] TRUE FALSE
some find one of those more intuitive than other, but you have to remember about the difference.
As for using multiple different criteria simply use chains of criteria with and/or statements:
filter(mtcars, cyl > 2 & wt < 2.5 & gear == 4)
Tim is correct for filtering a dataframe. However, if you want to make a function with dplyr, you need to follow the instructions at this webpage: https://rpubs.com/hadley/dplyr-programming.
The code I would suggest.
library(tidyverse)
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <- data_frame(ad_types = as.factor(ad_types), revenue = revenue)
aggregate_data <- function(df, ad_types, my_list) {
ad_types = enquo(ad_types) # Make ad_types a quosure
df %>%
filter(UQ(ad_types) %in% my_list) # Unquosure
}
new_data <- aggregate_data(df = df, ad_types = ad_types,
my_list = c("a", "b", "c"))
That should work!

Resources