How to change a dataframe's column types using tidy selection principles - r

I'm wondering what are the best practices to change a dataframe's column types ideally using tidy selection languages.
Ideally you would set the col types correctly up front when you import the data but that isn't always possible for various reasons.
So the next best pattern that I could identify is the below:
#random dataframe
df <- tibble(a_col=1:10,
b_col=letters[1:10],
c_col=seq.Date(ymd("2022-01-01"),by="day",length.out = 10))
My current favorite pattern involves using across() because I can use tidy selection verb to select variables that I want and then can "map" a formula to those.
# current favorite pattern
df<- df %>%
mutate(across(starts_with("a"),as.character))
Does anyone have any other favorite patterns or useful tricks here? It doesn't have to mutate. Often times I have to change the column types of dataframes with 100s of columns so it becomes quite tedious.

Yes this happens. Pain is where dates are in character format and if you once modify them and try to modify again (say in a mutate / summarise) there will be error.
In such a cases, change datatype only when you get to know what kind of data is there.
Select with names of columns id there is a sense in them
Check before applying the as.* if its already in that type with is.*
Applying it can be be by map / lapply / for loop, whatever is comfortable.
But it would be difficult to have a single approach for "all dataframes" as people try to name fields as per their choice or convenience.
Shared mine. Hope others help.

Related

Using "count" function in a loop in R

I'm quite new to R and I've been learning with the available resources on the internet.
I came across this issue where I have a vector (a) with vars "1", "2", and "3". I want to use the count function to generate a new df with the categories for each of those variables and its frequencies.
The function I want to use in a loop is this
b <- count(mydata, var1)
However, when I use this loop below;
for (i in (a)) {
'j' <- count(mydata[, i])
print (j)
}
The loop happens but the frequencies which gets saved on j is only of the categorical variable "var 3".
Can someone assist me on this code please?
TIA!
In R there are generally better ways than to use loops to process data. In your particular case, the “straightforward” way fails, because the idea of the “tidyverse” is to have the data in tidy format (I highly recommend you read this article; it’s somewhat long but its explanation is really fundamental for any kind of data processing, even beyond the tidyverse). But (from the perspective of your code) your data is spread across multiple columns (wide format) rather than being in a single column (long form).
The other issue is that count (like many other tidyverse functions) expect an unevaluated column name. It does not accept the column name via a variable. akrun’s answer shows how you can work around this (using tidy evaluation and the bang-bang operator) but that’s a workaround that’s not necessary here.
The usual solution, instead of using a loop, would first require you to bring your data into long form, using pivot_longer.
After that, you can perform a single count on your data:
result <- mydata %>%
pivot_longer(all_of(a), names_to = 'Var', values_to = 'Value') %>%
count(Var, Value)
Some comments regarding your current approach:
Be wary of cryptic variable names: what are i, j and a? Use concise but descriptive variable names. There are some conventions where i and j are used but, if so, they almost exclusively refer to index variables in a loop over vector indices. Using them differently is therefore quite misleading.
There’s generally no need to put parentheses around a variable name in R (except when that name is the sole argument to a function call). That is, instead of for (i in (a)) it’s conventional to write for (i in a).
Don’t put quotes around your variable names! R happens to accept the code 'j' <- … but since quotes normally signify string literals, its use here is incredibly misleading, and additionally doesn’t serve a purpose.

R: Conditionally mutating a variable (tricky)

We are currently working on a project for school, and we do not have that much experience with coding and R. The dataset that we are working on contains the variable operationtype, which has a lot of combinations between several operation types. We want to recode this into the variable operationcategory. These are the categories we want to recode the many operations into:
"AVR/P+other"
"AVR/P+MVP/R+other"
"MVR/P+other"
"CABG+other"
"CABG+AVR/P+other"
"CABG+MVR/P+other"
If none of above then > ~ "Remaining"
We were wondering if this can be done somewhat automatically, where we can specify the following for AVR/P+other: If it includes AVR/P, however does not include MVP/R then classify as AVR/P+other, if it does include MVP/R then classify as "AVR/P+MVP/R+other". Since these are two categories that are closely related. Doing this by hand would take forever, so hopefully this is possible.
Thank you for your help in advance.
Koen
Assuming that operationtype contains the exact string, what I would probably do is something like this:
library(dplyr)
library(stringr)
transformed_df <- df %>%
mutate(operationcategory = case_when(str_detect(operationtype, "AVR/P") & str_detect(operationtype, "MVP/R") ~ "AVR/P+MVP/R+other",
str_detect(operationtype, "AVR/P") ~ "AVR/P+other",
TRUE ~ "Remaining"))
Just beware that they are evaluated as they come, so the most restrictive contidions should be on top.
You could use regular expressions to use a single str_detect, but this is probably easier to understand and use.

Does ggplot2 always require that variables be named?

Is it possible to use ggplot without first assigning names to the variables in the data.frame? My intended use is early exploration of large datasets, the kind that involves trying one form of a question or possibility then moving on to the next. I record short comments about what is being explored but seldom name the rows and columns of the one-time-use variables I derive.
To see the data, I make quick and dirty plots:
plot(matrix.name)
or more often
plot(x = array.name[1, , 1], y=array.name[ , , 1])
At times I’d like to use ggplot2’s features instead. The requisite as.data.frame(matrix.name) conversion is quick, but then is there a way to pass the necessary arguments to aes() without assigning row and column names? As best I have been able to research, aes requires variable names.
Thank you very much for your help. I will repeatedly have occasion to use any answers.

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

How to pass a name to a function like dplyr::distinct()

I have a list of five data frames full of user responses to a survey.
In each of these data frames, the second column is the user id number. Some of the users took the survey multiple times, and I am trying to weed out the duplicate responses and just keep the first record.
The naming conventions are fairly standard, so the column in the first data frame is called akin to survey1_id and the second is survey2_id, etc. with the exception being that the column in the third data frame is called survey3a_id.
So basically what I tried to do was this:
for (i in seq(1,5)) {
newdata <- distinct(survey_list[[i]], grep(names("^survey.*_id$", survey_list[[i]]), value = TRUE))
}
But this doesn't work.
I originally thought it was just because the grep output had quotes around it, but I tried to strip them with noquote() and that didn't work. I then realized that distinct() doesn't actually evaluate the second argument, it just takes it literally, so I tried to force it to evaluate using eval(), but that didn't work. (Not sure I really expected it to.)
So now I'm kind of stuck. I don't know if the best solution is just to write five individual lines of code or, for a more generalizable solution, to sort and compare item-by-item in a loop? Was just hoping for a cleaner solution. I'm kind of new to this stuff.

Resources