Find the shortest string by categories R - r

I am quite a beginner in R, and I faced this problem.
I would like to find the length of the shortest string per each category in my tibble and then,
truncate all strings from the category according to width of the shortest one.
ex = tibble( category = c("A", "A", "C", "B", "C", "A"),
string = c("cat", "bird", "apple", "cloud", "banana", "elephant"))
I presume how to solve the problem theoretically, however, I am not able to put it together.
ex %>%
group_by(category) %>%
mutate(length = lapply(ex, function(x) min(nchar(x))) ) %>%
somehow str_trunc() ?
At the end I would like to see something like that:
ex = tibble( category = c("A", "A", "C", "B", "C", "A"),
string = c("cat", "bir", "apple", "cloud", "banan", "ele"))

This should do what you need
ex %>%
group_by(category) %>%
mutate(length = min(nchar(string)),
string = str_sub(string, 1, length))
We don't need the lapply inside the mutate to find the length. We can just run that transformation on the string column directly. And here I used stringr::str_sub to get the substring with the right number of characters since you already seem to be using tidyverse functions. You could also use the base substr function instead.

You can do this in base R with
aggregate(ex$string, list(ex$category),
function(s) min(nchar(as.character(s))))

Related

How do you find the date when one column stopped equaling another column in R?

I have the following variables in a dataset:
UserID | Date | Workplace | First_Workplace
Users are assigned to a workplace (Remote or Office). I'm trying to figure out when each user returned to the office and Workplace no longer equaled First_Workplace. I need to add this data to a new column named Date_Changed.
Sudo code would be something like the below. I just can't figure it out in R:
data %>%
mutate(data, Date_Changed = Date WHEN Workplace != First_Workplace FOR EACH UserID)
We may use which on the logical values to get the index and select the first index with [1]
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(Date_changed = Date[which(Workplace !=
first(Workplace))[1]]) %>%
ungroup
NOTE: If there is no change, it returns NA
data
data <- data.frame(UserID = rep(1:3, each = 3),
Workplace = c("A", "A", "B", "A", "A", "A", "C", "C", "C"),
Date = Sys.Date() + 0:8)

Automatically create data frames based on factor levels of a column

I have some fake case data with a manager id, type, and location. I'd like to automatically create data frames with the average number of cases a manager has at a given location.
# create fake data
manager_id <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
type <- c("A", "A", "B", "B", "B", "A", "A", "A", "C", "A", "B", "B", "C", "C", "C")
location <- c("Beach", "Beach", "Beach", "Beach", "Beach", "City", "City", "City", "Farm", "Farm", "Farm", "Farm", "Farm", "Farm", "City")
manager_id <- data.frame(manager_id)
type <- data.frame(type)
location <- data.frame(location)
df <- cbind(manager_id, type, location)
After creating fake data, I created a function that finds this average. The function works.
avgs_function <- function(dat){
dat1 <- dat %>% group_by(manager_id) %>% summarise(total = n())
total <- mean(dat1$total)
total <- round(total, 0)
total
}
I then loop through each location, create data frames using the avgs_function, and store them in a list. Then I call the data frames into my global environment. Something is going wrong here that I can't figure out. The weird thing is that is was working fine yesterday.
df_list <- unique(df$location) %>%
set_names() %>%
map(~avgs_function(df))
names(df_list) <- paste0(names(df_list), "_avg")
list2env(df_list, envir = .GlobalEnv)
Right now, the code is giving these values:
Beach_avg = 5
City_avg = 5
Farm_avg = 5
I would like:
Beach_avg = 5
City_avg = 2
Farm_avg = 3
I believe the issue is happening with the purrr package. Any help would be greatly appreciated!
I don't think you need purrr at all (just dplyr): this gets your desired output
result <-(df
%>% count(manager_id, location)
%>% group_by(location)
%>% summarise(across(n, mean))
)
(although without the _avg added to the location names: you could add mutate(across(location, paste0, "_avg")) (or something with glue) if you wanted)
This also doesn't create the separate variables you wanted (although obviously you can add more stuff e.g. with(result, setNames(list(n), location)) %>% list2env(), but in general workflows that populate your global workspace with a bunch of different named variables are a bad idea - collections like this can usually be handled better by keeping them inside a list/data frame/tibble ...

Find matching values in different row of dataframe with group_by and str_detect

Let's say I have a table:
library(tidyverse)
df <- tibble(group = c("a", "a", "b", "b", "c", "c"),
code = c("foo", "bar", "fuz", "baz", "fiz", "boz"),
child_code = c("bar", "", "baz", "", "biz", ""))
I'd like to group_by group and then search for the code in the child_codecolumn to get something like this:
group
code
child_code
code_in_child_code
a
foo
bar
FALSE
a
bar
TRUE
b
fuz
baz
FALSE
b
baz
TRUE
c
fiz
biz
FALSE
c
boz
FALSE
I've tried:
df %>% group_by(group) %>% mutate(code_in_child_code = str_detect(child_code, code))
But (I suppose obviously) that's just looking for the child_code in the same row's code column. I want to seach the child_code column for the any value in the whole group's codes.
Any help would be much appreciated.
I've found an answer:
df %>% group_by(group) %>%
mutate(codes = paste(code, collapse = "|"),
code_in_child_code = str_detect(child_code, codes))
Seems fairly facile now - of course I needed to get the codes in the same row as the value I was searching for. The values TRUE and FALSE are the "wrong" way round in this answer, but it's as useful either way.

Convert character column to factor preserving column label

I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.
Make yourself some random data and save it as some_data.xlsx.
library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself
tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)
I've created simple function to prepare my data the way I want it.
library(sjlabelled) #to play with labelled data
label_it <- function(data = NULL, split = "___"){
#This basically makes an array of two columns (of codes and descriptions respectively)
k.n <- data %>%
names() %>%
stri_split_fixed(pattern = split, simplify = T)
data%>%
set_label(k.n[,2]) %>% #set description as each column's label
setNames(k.n[,1]) #set code as each column's name
}
First I read the data from XLSX file. Then I label it.
library(readxl) #name speaks for itself again
data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>%
label_it()
Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:
label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part
Let's say I would like to change first and third column to factor.
To do this I have tried two things:
data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))
this changes all of those two columns values to NA_integer_.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i")))
this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.
I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.
The solution is to lose those attributes but then regain ('copy' in fact) them.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i"))) %>%
copy_labels(data)
copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.
P.S.
I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.

Retrieve number of factor levels from columns within a function in R

I am trying to create a function that performs several statistical tests on specific columns in a dataframe. Some of the tests require more than one level. I would like to test how many levels are in a specific column, but can't seem to get it right.
In my actual code this section would be followed by an ifelse that returns a string saying 'only one level' if single, or continues to the statistical test if > 1.
require("dplyr")
df <- data.frame(A = c("a", "b", "c"), B = c("a", "a", "a"), C = c("a", "b", "b")) %>%
mutate(A = factor(A)) %>%
mutate(B = factor(B)) %>%
mutate(C = factor(C))
my_funct <- function(data_f, column){
n_fact <- paste("data_f", column, sep = "$")
n_levels <- do.call("nlevels",
list(x = as.name(n_fact)))
print(n_levels)
}
```
Then I call my function with the dataframe and column
my_funct(df, "A")
I get the following error:
Error in levels(x) : object 'data_f$A' not found
If I remove the as.name() wrapper it returns a value of 0.
One reason your code is not working is because data_f$A is not the name of any object available to the function.
But I would recommend you don't even try to parse code as strings. It's the wrong way to do it. All you need is double bracket indexing [[. So the body of your function can be the following single line:
nlevels(data_f[[column]])
And for all the columns:
sapply(data_f, nlevels)

Resources