Concatenate columns if they don't contain a zero - r

Im trying to concatenate 4 Columns into a single column named "tags" for later use of multilabel classification. I would like to concate the columns in a way that gives a an output only pasting columns that are not zero and thereto seperate them with a comma.
An example would be that the cell in last row would be {1,2} instead of {1,2,0,0}
I currently have no code that works as needed and haven't been able to find something on the internet. Do you guys have a tip to do this?
Current code:
df$TV[df$TV==1] = '1'
df$Internet[df$Internet ==1] = '2'
df$Mobil[df$Mobil==1] = '3'
df$Fastnet[df$Fastnet==1] = '4'
df$tags = paste(df$TV,df$Internet,df$Mobil,df$Fastnet, sep=",")

Base R option using apply -
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
#df <- read.csv('stack.csv')
df$tags <- apply(df[cols], 1, function(x) toString(x[x!= 0]))
df
In dplyr we can use rowwise -
library(dplyr)
df <- df %>%
rowwise() %>%
mutate(tags = {
tmp <- c_across(all_of(cols))
toString(tmp[tmp != 0])
}) %>%
ungroup
df

We may use dapply from collapse
library(collapse)
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
df$tags <- dapply(slt(df, cols) MARGIN = 1, FUN = function(x) toString(x[x != 0]))
data
df <- data.frame(TV = c(1, 3, 2, 0), Internet = c(1, 0, 1, 4), Mobil = c(0, 1, 3, 2), Fastnet = c(1, 5, 3, 2))

Related

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:
Why is map_df() doing this and how can I avoid it? OR
How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).
Here's my code
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT
For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post.
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values. Because those columns in my dataset will never have missing values, this solution works in my case.

How can column names by used in a complex dplyr function that uses base R elements?

First, this question is likely a repeat, however the solutions I've tried (eg using enquo() and !!, or get(), or {{}}) have not yielded results.
The Problem
I have a function that needs to take column names passed to it in a pipeline, perform a series of dplyr-based functions with some base R components, and return the new dataframe. The problem is that the function will not take the column names passed to it as variables in the referenced dataframe, treating them as strings instead.
The Data
This code will create a usable dataframe for this problem:
df_ext <- tibble(ID = c(rep(1, 5), rep(2, 5)),
TIME = rep(c(1, 2, 3, 4, 5), 2),
VAL = c(0, 1, 2, 2, 3, 1, 5, 0, 1, 4))
The Current Function
Here's a version of the function that I can share. It's designed to create a series of categories for the data I pass to it, but this is a simplified version that just calculates some basic groupings (ie, it doesn't do much of anything).
my_fun <- function(.data, id, time, val){
require("dplyr")
df <- df |>
group_by(id) |>
mutate(val_lag = if_else(val > 0, time - lag(time, default = 0), 0)) |>
mutate(first_time = min(time),
last_time = max(time),
first_val_pos = ifelse(any(val), min(time[val > 0]), NA),
last_val_pos = ifelse(any(val > 0), max(time[val > 0]), NA)) |>
group_by(grp = cumsum(val_lag == 0)) |>
mutate(val_pos_run = cumsum(val_lag)) |>
ungroup() |>
group_by(id) |>
mutate(ada_bl = ifelse(first_val_pos <= 0, val[time == first_val_pos], 0)) |>
ungroup()
df
}
df_ext |>
my_fun(id = ID, time = TIME, val = VAL)
If anyone can get the columns from the dataframe to pass into the function and get treated as columns in the pipe-referenced dataframe, you'd be ending a very frustrating headache.

How can I loop a command that adds a column to a data frame?

I want to add a dummy variable for each year, with the value 1 if a person was retired in that year and the value 0 otherwise. So far I've been doing this:
df$DummyRetired1987 <- with(df, ifelse(Empl_1987 == 4, 1, 0))
df$DummyRetired1988 <- with(df, ifelse(Empl_1988 == 4, 1, 0))
df$DummyRetired1989 <- with(df, ifelse(Empl_1989 == 4, 1, 0))
df$DummyRetired1990 <- with(df, ifelse(Empl_1990 == 4, 1, 0))
df$DummyRetired1991 <- with(df, ifelse(Empl_1991 == 4, 1, 0))
df$DummyRetired1992 <- with(df, ifelse(Empl_1992 == 4, 1, 0))
And that works fine, but I'm aware that there must be a much cleaner way to do this, probably as a for loop.
I've tried this:
for(year in c(1987,1988,1989,1990,1991,1992)){
nam1 <- paste("df$DummyRetired", year, sep = "")
nam2 <- paste("Empl_", year, sep = "")
assign(nam1, with(df, ifelse(nam2 == 4, 1, 0)))
}
But it doesn't work.
Would appreciate some help. Thanks!
We can subset the columns of interest ('nm1') with grep, create new column names with paste and assign the output of logical expression converted to binary
nm1 <- grep("^Empl_\\d+$", names(df), value = TRUE)
nm2 <- paste0("DummyRetired", sub("\\D+", "", nm1))
df[nm2] <- +(df[nm1] == 4)
If we need a loop
for(i in seq_along(nm1)) {
df[[nm2[i]]] <- as.integer(df[[nm1[i]]] == 4)
}

How to exclude 0 in the search for the max and min values?

I have a dataframe with some price values. No I want to have one or in best case two data frames with the max and min values for each article without 0 values.
I tried it this way with DT (For maxValue everything works perfect):
minValue <- setDT(df)[, .SD[which.min(price > 0)], by=number]
maxValue <- setDT(df)[, .SD[which.max(price)], by=number]
But the minValue Df shows 0 Values. I have also tried it with:
do.call(rbind, tapply(df$price, df$number, FUN = function(x) c(max = max(x), min = min(x))))
But here I dont know how to use the > 0 condition.
In the best case I would like to have to dfs maxvalue and minvalue for each product.
You can use dplyr like:
library(dplyr)
df %>%
group_by(number) %>%
filter(price != 0) %>%
summarise(minPrice = min(price),
maxPrice = max(price))
Using base R
f1 <- function(x) c(minPrice = min(x), maxPrice = max(x))
aggregate(price ~ number, FUN = f1, df, subset = price != 0))
Or with by
do.call(rbind, by(df, df$number, FUN = f1))
data
df <- data.frame(number = c(1, 1, 1, 2, 2, 3, 3, 3),
price = c(0, 3, 2, 4, 3, 1, 2, 0))
Is this working?
minValue <- setDT(df)[price!=0, .(MinPrice=min(price)), by=number]
maxValue <- setDT(df)[price!=0, .(MaxPrice=max(price)), by=number]

How to use dplyr to make modifications in a dataframe equivalent to the use of a 'which'?

If I have a data frame, say
df <- data.frame(x = c(1, 2, 3), y = c(2, 4, 7), z = c(3, 6, 10))
then I can modify entries with the which function:
w <- which(df[,"y"] == 7)
df[w,c("y", "z")] <- data.frame(6, 9)
One way I see to do this with the package dplyr is the following:
df <- df %>%
mutate(W = (y==7),
y = ifelse(W, 6, y),
z = ifelse(W, 9, z)) %>%
select(-W)
But I find it a bit unelegant, and I am not so sure it would replace all kinds of which uses. Ideally I would imagine something like:
df <- df %>%
keep(y == 7) %>%
mutate(y = 6) %>%
unkeep()
where keep would provisionally select rows where modifications are to be made, and unkeep would unselect them to recover the full data frame.

Resources