create long list of variables based on existing variables - r

I have a long list of variables and for each I want to create a dummy variable. I am using the below dplyr mutate code to do this, but know that something like an array in SAS could be used (so I don't have to copy this line out multiple times). I just haven't been able to find an answer on Stack or anywhere else that fits.
Grade_Dist2 <- Grade_Dist2 %>% mutate(
ACCT2301_FA15_z = ifelse(ACCT2301_FA15 %in% c("A", "B", "C"), 1,
ifelse(ACCT2301_FA15 %in% c("D", "F", "W", "Q"), 0, NA)))
The columns/vars are arranged together--all vars in the table are similar except an ID var.

In the tidyverse you should probably look at something like mutate_all(), but in the meantime I would think something like this base R solution would work:
all_names <- grep("FA[0-9]+",names(Grade2),value=TRUE)
for (id in all_names) {
cur_var <- Grade2[[id]]
Grade2[[paste0(id,"_z")]] <-
ifelse(cur_var %in% c("A", "B", "C"), 1,
ifelse(cur_var %in% c("D", "F", "W", "Q"), 0, NA)))
}

Here's a try at using a tidyverse approach with mutate_all as suggested by #BenBolker.
library(tidyverse)
Grade_Dist2 <- tibble(ACCT2301_FA15_z = c("A", "F", "C", "Z"))
Grade_Dist2 <- Grade_Dist2 %>%
mutate_all(., funs(if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
Grade_Dist2
#> # A tibble: 4 x 1
#> ACCT2301_FA15_z
#> <dbl>
#> 1 1
#> 2 0
#> 3 1
#> 4 NA
If you want to append the dummy variables to the existing data instead of overwriting then
mutate_all(., funs("dummy" = if_else(. %in% c("A", "B", "C"), 1,
if_else(. %in% c("D", "F", "W", "Q"), 0, NA_real_))))
will append variables with names like ACCT2301_FA15_z_dummy (or be called dummy if there is only one variable being mutated).

Related

How to check if vector elements are in grouped column of a dataframe and add a binary column (yes, no)

I have a dataframe with one column of letters A to Z (name_x) and another column with values (value_x).
For simplicity this 26 rows represent a group of multiple rows.
I want to check if the values in the vector vocals vocals <- c("A", "E", "I", "O", "U") are in the column name_x of the dataframe df and append a third column to the dataframe with 1 if yes and 0 if no.
I tried it with case_when function from dplyr and get this error:
Fehler: Problem with mutate() input vocal_yes.
x Input vocal_yes can't be recycled to size 26.
i Input vocal_yes is case_when(vocals %in% name_x ~ 1, TRUE ~ 0).
i Input vocal_yes must be size 26 or 1, not 5.
Run rlang::last_error() to see where the error occurred.
I understand the problem. Is there a way to overcome this problem. Many thanks.
The code:
library(dplyr)
# constructing the dataframe
name_x <- LETTERS[1:26]
value_x <- sample.int(100, 26)
df <- data.frame(name_x, value_x)
# vector vocals
vocals <- c("A", "E", "I", "O", "U")
# vector consonant
consonant <- c("B", "C", "D", "F", "G", "H", "J", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "X", "Y", "Z")
df1 <- df %>%
mutate(vocal_yes = case_when(vocals %in% name_x ~ 1,
TRUE ~ 0)
)
df1 <- df %>%
mutate(vocal_yes = case_when(name_x %in% vocals ~ 1,
TRUE ~ 0)
)
This will solve the problem if there is only one letter in name_x.
For better understanding try this simple code
vocals <- c('a', 'e', 'i', 'o', 'u')
letters %in% vocals
vocals %in% letters
WE can also do this without case_when
library(dplyr)
df <- df %>%
mutate(vocal_yes = +(name_x %in% vocals))

Replace values in vector where not %in% vector

Short question:
I can substitute certain variable values like this:
values <- c("a", "b", "a", "b", "c", "a", "b")
df <- data.frame(values)
What's the easiest way to replace all the values of df$values by "x" (where the value is neither "a" or "b")?
Output should be:
c("a", "b", "a", "b", "x", "a", "b")
Your example is a bit unclear and not reproducible.
However, based on guessing what you actually want, I could suggest trying this option using the data.table package:
df[values %in% c("a", "b"), values := "x"]
or the dplyr package:
df %>% mutate(values = ifelse(values %in% c("a","b"), x, values))
What about:
df[!df[, 1] %in% c("a", "b"), ] <- "x"
values
1 a
2 b
3 a
4 b
5 x
6 a
7 b

How to drop observations based on conditions

I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)
If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

How to avoid for loop when iterating through unique values in a column [R]

Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.
Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1

How to filter a column by multiple, flexible criteria

I'm writing a function to aggregate a dataframe, and it needs to be generally applicable to a wide variety of datasets. One step in this function is dplyr's filter function, used to select from the data only the ad campaign types relevant to the task at hand. Since I need the function to be flexible, I want ad_campaign_types as an input, but this makes filtering kind of hairy, as so:
aggregate_data <- function(ad_campaign_types) {
raw_data %>%
filter(ad_campaign_type == ad_campaign_types) -> agg_data
agg_data
}
new_data <- aggregate_data(ad_campaign_types = c("campaign_A", "campaign_B", "campaign_C"))
I would think the above would work, but while it runs, oddly enough it only returns only a small fraction of what the filtered dataset should be. Is there a better way to do this?
Another tiny example of replaceable code:
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
data <- as.data.frame(cbind(ad_types, revenue))
# Now, filtering to select only ad types "a", "b", and "d",
# which should leave us with only 7 values
new_data <- filter(data, ad_types == c("a", "b", "d"))
nrow(new_data)
[1] 3
For multiple criteria use %in% function:
filter(data, ad_types %in% c("a", "b", "d"))
you can also use "not in" criterion:
filter(data, !(ad_types %in% c("a", "b", "d")))
However notice that %in%'s behavior is a little bit different than ==:
> c(2, NA) == 2
[1] TRUE NA
> c(2, NA) %in% 2
[1] TRUE FALSE
some find one of those more intuitive than other, but you have to remember about the difference.
As for using multiple different criteria simply use chains of criteria with and/or statements:
filter(mtcars, cyl > 2 & wt < 2.5 & gear == 4)
Tim is correct for filtering a dataframe. However, if you want to make a function with dplyr, you need to follow the instructions at this webpage: https://rpubs.com/hadley/dplyr-programming.
The code I would suggest.
library(tidyverse)
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <- data_frame(ad_types = as.factor(ad_types), revenue = revenue)
aggregate_data <- function(df, ad_types, my_list) {
ad_types = enquo(ad_types) # Make ad_types a quosure
df %>%
filter(UQ(ad_types) %in% my_list) # Unquosure
}
new_data <- aggregate_data(df = df, ad_types = ad_types,
my_list = c("a", "b", "c"))
That should work!

Resources