Extract separate columns from data frame - r

I am trying to extract the first names of the titles of the columns such as pack_10, pack_18 and pack_20 and group all of them with sum. Below you can see my data
df<-data.frame(
packs_10_value5=c(100,0,0,0,0),
packs_18_value9=c(200,0,0,0,0),
packs_20_value13=c(300,0,0,0,0),
packs_10_value15=c(100,0,0,0,0),
packs_18_value17=c(200,0,0,0,0),
packs_20_value18=c(300,0,0,0,0)
)
df
So can anybody help me with how to solve this?

You can split the columns and apply rowSums by group:
library(purrr)
split.default(df, f = gsub("_value.*", "", names(df))) %>%
map_dfc(rowSums)
# A tibble: 5 × 3
packs_10 packs_18 packs_20
<dbl> <dbl> <dbl>
1 200 400 600
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0

A bit more convoluted/less elegant, but you could also create a row_number column, pivot_longer by all other columns, do the regex, aggregate by the clean column name and each row and pivot_wider back:
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
group_by(name = sub('_value\\d+', '', name), rn) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
pivot_wider(names_from = 'name', values_from = 'value') %>%
select(-rn)
Output:
# A tibble: 5 x 3
packs_10 packs_18 packs_20
<dbl> <dbl> <dbl>
1 200 400 600
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0

Related

Rename and create indicator variables?

I have a data frame in which one of the columns ('subject') has a bit of an odd format. I would like to eliminate all observations where the first digit is greater than one. Additionally, I would like to create indicator variables for the remaining observations, and eliminate the number from the beginning.
So I want this:
Subject
1; HMB 2 (HB)
1; HRB 4 (HB-R)
2; HRB 1 (HB-L); HRB4
1; HRB 2 (HB-L)
To become this:
HMB 2 (HB) HRB 4 (HB-R) HRB 2 (HB-L)
1 0 0
0 1 0
0 0 1
You can use separate to get data in different columns, keep observations which are less than equal to 1 and get data in wide format.
library(dplyr)
library(tidyr)
df %>%
separate(Subject, c('col1', 'col2'),
sep = ';', extra = 'drop', convert = TRUE) %>%
filter(col1 <= 1) %>%
mutate(col1 = 1,
row = row_number()) %>%
pivot_wider(names_from = col2, values_from = col1, values_fill = 0) %>%
select(-row)
# ` HMB 2 (HB)` ` HRB 4 (HB-R)` ` HRB 2 (HB-L)`
# <dbl> <dbl> <dbl>
#1 1 0 0
#2 0 1 0
#3 0 0 1
data
df <- structure(list(Subject = c("1; HMB 2 (HB)", "1; HRB 4 (HB-R)",
"2; HRB 1 (HB-L); HRB4", "1; HRB 2 (HB-L)")),
class = "data.frame", row.names = c(NA, -4L))
Here is a more generic dplyr approach. You can separate the values into rows and then filter by group. In this way, you can avoid specifying the columns to be created for each Subject.
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number(), value = 1L) %>%
separate_rows(Subject, sep = ";\\s*") %>%
group_by(id) %>%
filter(row_number() > 1L & as.integer(Subject[[1L]]) < 2L) %>%
pivot_wider(names_from = "Subject", values_fill = 0L)
Output
# A tibble: 3 x 4
# Groups: id [3]
id `HMB 2 (HB)` `HRB 4 (HB-R)` `HRB 2 (HB-L)`
<int> <int> <int> <int>
1 1 1 0 0
2 2 0 1 0
3 4 0 0 1
Does this work:
library(tidyr)
library(dplyr)
df %>% separate(col = Subject, into = c('count','Subject', 'Subject2'), sep = ';') %>%
filter(!count >1) %>% select(1,2) %>% type.convert(as.is = T) %>%
mutate(ID = row_number()) %>% pivot_wider(id_cols = ID, names_from = Subject, values_from = count, values_fill = 0) %>%
select(-ID)
# A tibble: 3 x 3
` HMB 2 (HB)` ` HRB 4 (HB-R)` ` HRB 2 (HB-L)`
<int> <int> <int>
1 1 0 0
2 0 1 0
3 0 0 1
>

Mark row before count starts again

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

How to filter for a combination of list arguments and multiple character strings in dplyr

Given a dataframe:
v1_attr1 <- c(1,0,0,0,1,0,0,0,1,1) %>% as.integer ()
v1_attr2 <- c(0,1,0,0,1,1,1,1,1,1) %>% as.integer ()
v2_attr1 <- c(0,0,1,0,0,1,1,1,0,0) %>% as.integer ()
v2_attr2 <- c(0,0,0,1,0,1,1,1,0,0) %>% as.integer ()
df <- data.frame (v1_attr1, v1_attr2, v2_attr1, v2_attr2)
How can I set a filter for the attr of each v[[x]]?
I tried the following code to get the number of rows in each data.frame filtered by attr.
library(dplyr)
# create list for vs
list_vs <- list ("v1", "v2")
# set multiple attr filter for each v[[x]] to get the respective number of rows in each filtered data.frame (presented in a list)
filtered <- lapply (list_vs, function (x){
df %>% filter (noquote(paste0(list_vs[[x]], "_attr1")) == 1 | noquote(paste0(list_vs[[x]], "_attr2")) == 1) %>%
nrow ()
})
Although this code doesn't return an error, the result for filtered[[x]] is always 0. How do I need to set the filter arguments correctly to get the desired number of rows in each data.frame? I used noquote because otherwise filtering arguments would be pasted in quotes.
One dplyr and purrr option could be:
map(.x = list_vs,
~ df %>%
filter_at(vars(starts_with(.x)), any_vars(. == 1)))
[[1]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 1 0 0 0
2 0 1 0 0
3 1 1 0 0
4 0 1 1 1
5 0 1 1 1
6 0 1 1 1
7 1 1 0 0
8 1 1 0 0
[[2]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 0 0 1 0
2 0 0 0 1
3 0 1 1 1
4 0 1 1 1
5 0 1 1 1
An option is to convert to 'long' format with pivot_longer by automatically picking up the patterns from the column names, and then do a group_by, filter_at
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything(), names_sep = "_",
names_to = c('group', '.value' )) %>%
group_by(group) %>%
filter_at(vars(-group_cols()), any_vars(. == 1))

separate_columns for tidyr

Let's say I had a survey question that read:
What did you eat?
[ ] apple
[ ] pear
[x] banana
[x] grapes
Now, I have the endorsed options as comma-separated strings in one variable.
I wrote myself a little helper to turn this comma-separated list of answers into boolean dummies showing whether each box was checked.
df <- data.frame(
x = 1:5,
ate = c("apple", "apple, pear, banana", "banana, grapes", NA_character_, ""),
stringsAsFactors = FALSE
)
separate_columns <- function(df, col, convert = TRUE, sep = ", ") {
colname <- deparse(substitute(col))
# sorry about this ugly non-rlang approach, hoping not to reuse this
df$.splitcol <- df %>% pull(colname)
separate_rows(df, .splitcol, convert = convert, sep = sep) %>%
mutate(.splitcol = stringr::str_c(colname, "_", .splitcol), value = 1) %>%
mutate(.splitcol = if_else(is.na(.splitcol), stringr::str_c(colname, "_nonresponse"), .splitcol)) %>%
spread(.splitcol, value, fill = 0) %>%
select(-colname)
}
separate_columns(df, ate)
Gets me to this:
x ate_apple ate_banana ate_grapes ate_nonresponse ate_pear
1 1 0 0 0 0
2 1 1 0 0 1
3 0 1 1 0 0
4 0 0 0 1 0
5 0 0 0 1 0
Writing the helper felt clunky, and I feel like I'm missing a more tidyverse way of accomplishing the same transformation (despite lots of searching).
Also, I found no easy way for missings to propagate using this method (I'd prefer if all dummies would be missing if the response was NA, but 0 if it was an empty string). So, I'd rather get this
x ate_apple ate_banana ate_grapes ate_pear
1 1 0 0 0
2 1 1 0 1
3 0 1 1 0
4 NA NA NA NA
5 0 0 0 0
Is there a nicer tidyverse way?
After changing into 'long' format by splitting the 'ate' column by the delimiter ,, create a column of 1 and spread from 'long' to 'wide'
library(tidyverse)
df %>%
separate_rows(ate, sep=", ", convert = TRUE) %>%
mutate(ate = replace(ate, is.na(ate), "NA"),
n = paste(NA ^ (ate == "NA")),
ate = paste0("ate_", replace(ate, ate == "", "nonresponse" ))) %>%
spread(ate, n, fill = "0") %>%
mutate_at(vars(-x, -ate_NA),
funs(replace(as.integer(.), ate_NA=="NA", NA_integer_))) %>%
select(-ate_NA)
# x ate_apple ate_banana ate_grapes ate_nonresponse ate_pear
#1 1 1 0 0 0 0
#2 2 1 1 0 0 1
#3 3 0 1 1 0 0
#4 4 NA NA NA NA NA
#5 5 0 0 0 1 0
I take a different approach, by first extracting want was there to eat and then matching it in the data:
total_eat_list <- map(df$ate, str_split, patter = ",") %>%
unlist() %>%
str_trim() %>%
na.exclude() %>%
unique()
Remove empty strings:
total_eat_list <- total_eat_list[total_eat_list != ""]
total_eat_list
# [1] "apple" "pear" "banana" "grapes"
Now lets map everything in the original data:
map_df(total_eat_list, ~
df %>%
mutate(ate_what = str_c("ate_", .x),
ind = case_when(str_detect(string = df$ate, .x) ~ 1,
!str_detect(string = df$ate, .x) ~ 0,
TRUE ~ NA_real_))) %>%
spread(ate_what, ind) %>%
select(-ate)
# A tibble: 5 x 5
# x ate_apple ate_banana ate_grapes ate_pear
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 0 0
# 2 2 1 1 0 1
# 3 3 0 1 1 0
# 4 4 NA NA NA NA
# 5 5 0 0 0 0
The nice thing is that NAs are infectious for the str_-functions.
As function:
who_ate_what <- function(data, col) {
col <- enquo(col)
col_name <- quo_name(col)
match_list <- data %>%
select(!!col) %>%
map(str_split, patter = ",") %>%
unlist() %>%
str_trim() %>%
na.exclude() %>%
unique()
match_list <- match_list[match_list != ""]
map_df(match_list, ~
data %>%
mutate(matches = str_c(!!col_name, "_", .x),
ind = case_when(str_detect(string = !!col, .x) ~ 1,
!str_detect(string = !!col, .x) ~ 0,
TRUE ~ NA_real_)
)) %>%
spread(matches, ind) %>%
select(-!!col)
}
This is way too verbose I'm sure, but I guess its a start.
library(tidyverse)
df <- data.frame(
x = 1:5,
ate = c("apple", "apple, pear, banana", "banana, grapes", NA_character_, ""),
stringsAsFactors = FALSE
)
df %>%
nest(-x) %>%
mutate(data = map(data, ~str_split(.x$ate, ",") %>% unlist())) %>%
unnest() %>%
group_by(x, data) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(data, n, fill = NA) %>%
select(-`<NA>`) %>%
mutate(rs = rowSums(.[2:ncol(.)],na.rm = TRUE)) %>%
gather(nm, val, -x, -rs) %>%
mutate(val = case_when(
is.na(val) & rs > 0 ~ "0",
is.na(val) & rs == 0 ~ "NA",
!is.na(val) ~ as.character(val)
), val = as.numeric(val)) %>%
spread(nm, val, fill = NA) %>%
select(-rs, -V1)
#> # A tibble: 5 x 6
#> x ` banana` ` grapes` ` pear` apple banana
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 1 0
#> 2 2 1 0 1 1 0
#> 3 3 0 1 0 0 1
#> 4 4 NA NA NA NA NA
#> 5 5 0 0 0 0 0
EDIT
Lets wrap this into a function and take care of the name issue. I adopted the splitting from your original function to make the use of quosures easier.
my_sep_fun <- function(data, col){
col <- enquo(col)
col_name <- quo_name(col)
data %>%
separate_rows(!!col, sep =', ', convert = TRUE) %>%
group_by(x, !!col) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(!!col, n, fill = NA) %>%
select(-`<NA>`) %>%
mutate(rs = rowSums(.[2:ncol(.)],na.rm = TRUE)) %>%
gather(nm, val, -x, -rs) %>%
mutate(val = case_when(
is.na(val) & rs > 0 ~ "0",
is.na(val) & rs == 0 ~ "NA",
!is.na(val) ~ as.character(val)
), val = as.numeric(val)) %>%
spread(nm, val, fill = NA) %>%
select(-rs, -V1) %>%
rename_at(vars(2:ncol(.)), funs(paste0(!!col_name,"_", .)))
}
my_sep_fun(df, ate)
#> # A tibble: 5 x 5
#> x ate_apple ate_banana ate_grapes ate_pear
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 0 0
#> 2 2 1 1 0 1
#> 3 3 0 1 1 0
#> 4 4 NA NA NA NA
#> 5 5 0 0 0 0
Created on 2018-08-20 by the reprex
package (v0.2.0).
One solution, much less verbose, in just three lines. Once you have the dataframe:
First, separate the values in each cell:
df <- separate_rows_(df, 'ate')
Second, dummify every answer using the function dummify from DataExplorer:
df <- DataExplorer::dummify(df, 'ate')
Third, aggregate the redundant rows like that:
df <- aggregate(df[,2:6], by=df$x, FUN= sum)
(you could also apply a max function here since you want to capture all 1's in the columns).
Done!

R - create dynamic indicator columns from values in character columns

I have data that looks like this:
library(dplyr)
d<-data.frame(ID=c(1,1,2,3,3,4), Quality=c("Good", "Bad", "Ugly", "Good", "Good", "Ugly"), Area=c("East", "North", "North", "South", "East", "North"))
What I'd like to do is create one new column for each unique value in Quality and populate it with whether the ID matches that value and then aggregate the ID's. I want to do the same for Area.
This is what I have for when Quality == Good:
d$Quality.Good <- 0
d$Quality.Good[d$Quality=="Good"] <- 1
e <- d %>%
group_by(ID) %>%
summarise(n=n(), MAX.Quality.Good = max(Quality.Good))
e
Output
A tibble: 4 x 3
ID MAX.Quality.Good
<dbl> <dbl>
1 1 1
2 2 0
3 3 1
4 4 0
Is it possible to build a function that will loop through each character column and build an indicator column for Good, Bad, Ugly, North, East, South instead of copy pasting the above many more times?
Here's where I'm stuck:
library(stringr)
#vector of each Quality
e <-d %>%
group_by(Quality) %>%
summarise(n=n()) %>%
select(Quality)
e<-as.data.frame(e)
#create new column names
f <- str_c(names(e),".",e[,1])
#initialize list of new columns
d[f] <- 0
#I'm stuck after this...
Thank you!
We can do this in base R using table by replicating the 'ID' column by the number of columns of dataset minus 1, and pasteing the column names with the unlisted values (excluding the 'ID' column)
table(rep(d$ID, 2), paste0(names(d)[-1][col(d[-1])], unlist(d[-1])))
# AreaEast AreaNorth AreaSouth QualityBad QualityGood QualityUgly
# 1 1 1 0 1 1 0
# 2 0 1 0 0 0 1
# 3 1 0 1 0 2 0
# 4 0 1 0 0 0 1
or with tidyverse, gather into 'long' format, unite the 'key', 'val' columns to a single column, get the distinct rows, and spread into 'wide' format after creating a column of 1s.
library(tidyverse)
gather(d, key, val, -ID) %>%
unite(kv, key, val) %>%
distinct %>%
mutate(n = 1) %>%
spread(kv, n, fill = 0)
#ID Area_East Area_North Area_South Quality_Bad Quality_Good Quality_Ugly
#1 1 1 1 0 1 1 0
#2 2 0 1 0 0 0 1
#3 3 1 0 1 0 1 0
#4 4 0 1 0 0 0 1
1) Base R Create the model matrix for each column (using function make_mm) and bind them together as a data frame m. Finally aggregate on ID. No packages are used.
make_mm <- function(nm, data) model.matrix(~ . - 1, data[nm])
m <- do.call("data.frame", lapply(names(d)[-1], make_mm, d))
with(d, aggregate(. ~ ID, m, max))
giving:
ID QualityBad QualityGood QualityUgly AreaEast AreaNorth AreaSouth
1 1 1 1 0 1 1 0
2 2 0 0 1 0 1 0
3 3 0 1 0 1 0 1
4 4 0 0 1 0 1 0
2) dplyr/purrr This could alternately be written as the following which is close to the code in the question but generalizes to all required columns. Note that here we make model data frames using make_md rather than making model matrices with make_mm. Also note that the dot in group_by(m, ID = .$ID) refers to d and not to m.
library(dplyr)
library(purrr)
make_md <- function(nm, data) {
data %>%
select(nm) %>%
model.matrix(~ . - 1, .) %>%
as.data.frame
}
d %>% {
m <- map_dfc(names(.)[-1], make_md, .)
group_by(m, ID = .$ID) %>%
summarize_all(max) %>%
ungroup
}

Resources