R equivalent of Stata * - r

In Stata, if I have these variables: var1, var2, var3, var4, var5, and var6, I can select all of them with the command var*. Does R have a similar functionality?

The select function from the "dplyr" package offers several flexible ways to select variables. For instance, using #Marius's sample data, try the following:
library(dplyr)
df %>% select(starts_with("var")) # At the start
df %>% select(num_range("var", 1:3)) # specifying range
df %>% select(num_range("var", c(1, 3))) # gaps are allowed

You can grep to do this kind of regexp matching among the column names:
x = c(1, 2, 3)
df = data.frame(var1=x, var2=x, var3=x, other=x)
df[, grep("var*", colnames(df))]
Output:
var1 var2 var3
1 1 1 1
2 2 2 2
3 3 3 3
So, basically just making use of the usual df[rows_to_keep, columns_to_keep]
indexing syntax, and feeding in the results of grep as the columns_to_keep.

Related

How to run a custom function in R multiple times using a data frame to get the argument for each run

I wrote a function to output all values for a specified categorical variable into an output data frame. I'm more used to using call execute in SAS. This works if I manually write out the function with the variable name, but it does not if I try to use mapply. I'm not very experienced (yet) using apply, lapply, mapply, etc, but I have a lot of values I want to run this on (so would like to use a dataframe to specify the function arguments).
Does anyone have any suggestions? REPREX below:
This works (outputs a table listing all variables and all values associated with each one):
a<-data.frame(var1=c("one","two","three"),var2=c("ants","moths","cows"),var3=c("Sam","Sally","Jugdish"))
b<-data.frame(VNAME=c("var1","var2","var3"))
getvals<-function(varb){
temp<-a %>% mutate(VNAME=quo_name(enquo(varb))) %>% mutate(VALUE={{varb}}) %>% select(c(VNAME,VALUE)) %>% distinct()
Values<-bind_rows(Values,temp)
Values<-Values %>% filter(VNAME != 'delete' & !is.na(VALUE))
Values<<-Values
}
Values<-data.frame(VNAME='delete')
getvals(var1)
getvals(var2)
getvals(var3)
But this does not - in fact it just outputs a table listing the variables in both columns)
a<-data.frame(var1=c("one","two","three"),var2=c("ants","moths","cows"),var3=c("Sam","Sally","Jugdish"))
b<-data.frame(VNAME=c("var1","var2","var3"))
getvals<-function(varb){
temp<-a %>% mutate(VNAME=quo_name(enquo(varb))) %>% mutate(VALUE={{varb}}) %>% select(c(VNAME,VALUE)) %>% distinct()
Values<-bind_rows(Values,temp)
Values<-Values %>% filter(VNAME != 'delete' & !is.na(VALUE))
Values<<-Values
}
Values<-data.frame(VNAME='delete')
mapply(getvals,b$VNAME)
Thank you!
I tried using apply and lapply, but got the same results.
tidyr::pivot_longer(a, everything())
Result
# A tibble: 9 × 2
name value
<chr> <chr>
1 var1 one
2 var2 ants
3 var3 Sam
4 var1 two
5 var2 moths
6 var3 Sally
7 var1 three
8 var2 cows
9 var3 Jugdish
Or if you just want certain variables included:
include <- c("var1", "var3")
tidyr::pivot_longer(dplyr::select(a, include), everything())
# A tibble: 6 × 2
name value
<chr> <chr>
1 var1 one
2 var3 Sam
3 var1 two
4 var3 Sally
5 var1 three
6 var3 Jugdish
Or as a function for the same output:
extract_vars <- function(df, cols) {
tidyr::pivot_longer(dplyr::select(df, cols), everything())
}
extract_vars(a, include)

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

Pivoting CreateTableOne in R to show levels as column headers?

I'm trying to generate some descriptive summary tables in R using the CreateTableOne function. I have a series of variables that all have the same response options/levels (Yes or No), and want to generate a wide table where the levels are column headings, like this:
Variable
Yes
No
Var1
1
7
Var2
5
2
But CreateTableOne generates nested long tables, with one column for Level where Yes and No are values in rows, like this:
Variable
Level
Value
Var1
Yes
1
Var1
No
7
Is there a way to pivot the table to get what I want while still using this function, or is there a different function I should be using instead?
Here is my current code:
vars <- c('var1', 'var2')
Table <- CreateTableOne(vars=vars, data=dataframe, factorVars=vars)
Table_exp <- print(Table, showAllLevels = T, varLabels = T, format="f", test=FALSE, noSpaces = TRUE, printToggle = FALSE)
write.csv(Table_exp, file = "Table.csv")
Thanks!
You could use only the pivot_wider to make that table. Here is your data:
library(tidyverse)
dataframe = data.frame(Variable = c("Var1", "Var1", "Var2", "Var2"),
Level = c("Yes", "No", "Yes", "No"),
Value = c(1, 7, 5, 2))
Your data:
Variable Level Value
1 Var1 Yes 1
2 Var1 No 7
3 Var2 Yes 5
4 Var2 No 2
You can use this code to make the wider table:
dataframe %>%
pivot_wider(names_from = "Level", values_from = "Value")
Output:
# A tibble: 2 × 3
Variable Yes No
<chr> <dbl> <dbl>
1 Var1 1 7
2 Var2 5 2
So I got the answer to this question from a coworker, and it's very similar to what Quinten suggested but with some additional steps to account for the structure of my raw data.
The example tables I provided in my question were my desired outputs, not examples of my raw data. The number values weren't values in my dataset, but rather calculated counts of records, and the solution below includes steps for doing that calculation.
This is what my raw data looks like, and it's actually structured wide:
Participant_ID
Var1
Var2
Age
1
Yes
No
20
2
No
No
30
We started by creating a subset with just the relevant variables:
subset <- data |> select(Participant_Id, Var1, Var2)
Then pivoted the data longer first, in order to calculate the counts I wanted in my output table. In this code, we specify that we don't want to pivot Participant_ID and create columns called Vars and Response.
subsetlong <- subset |> pivot_longer(-c("Participant_Id"), names_to = "Vars", values_to="Response")
This is what subsetlong looks like:
Participant_ID
Vars
Response
1
Var1
Yes
1
Var2
No
2
Var1
No
2
Var2
No
Then we calculated the counts by Vars, putting that into a new dataframe called counts:
counts <- subsetlong |> group_by(Vars) |> count(Response)
And this is what counts looks like:
Vars
Response
n
Var1
Yes
1
Var1
No
7
Var2
Yes
5
Var2
No
2
Now that the calculation was done, we pivoted this back to wide again, specifying that any NAs should appear as 0s:
counts_wide <- counts |> pivot_wider(names_from="Response", values_from="n", values_fill = 0)
And finally got the desired structure:
Vars
Yes
No
Var1
1
7
Var2
5
2

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Resources