Using R's dplyr is there a clean way to use the across function to select variables not already captured in other across statements?
For example, I could have a data set that I want to summarise on a 1-to-1 basis (i.e. the output columns have the same structure as the input data set), applying the first function to the character fields, the mean function to a select number of numeric fields and then sum to all remaining fields.
Typically I have a large number of columns with various selection methods, so explicitly working out the remaining columns can be overall onerous. To date, I have found workarounds given the particular data set I am working with but having a generic solution would be very useful.
The code below shows what I would like to run and the resulting output. remaining_columns() is made up.
library(dplyr)
first.if.unique = function(x) if(length(unique(x))==1) x[1] else NA
x = starwars %>% select(species,sex,mass,height) %>% head(10)
cols_to_average = c("mass")
x %>%
group_by(sex) %>%
summarise(
across(where(is.character),first.if.unique),
across(any_of(cols_to_average),mean),
across(remaining_columns(),sum)
)
Input / Output data sets:
> print(x)
# A tibble: 10 x 4
species sex mass height
<chr> <chr> <dbl> <int>
1 Human male 77 172
2 Droid none 75 167
3 Droid none 32 96
4 Human male 136 202
5 Human female 49 150
6 Human male 120 178
7 Human female 75 165
8 Droid none 32 97
9 Human male 84 183
10 Human male 77 182
> print(y)
# A tibble: 3 x 4
sex species mass height
<chr> <chr> <dbl> <int>
1 female Human 62 315
2 male Human 98.8 917
3 none Droid 46.3 360
I personally don't see a way to achieve this apart from specifying the negative previous selections.
x %>%
group_by(sex) %>%
summarise(
across(where(is.character),first.if.unique),
across(any_of(cols_to_average),mean),
across(!where(is.character)&!any_of(cols_to_average),sum)
)
# A tibble: 3 x 4
sex species mass height
<chr> <chr> <dbl> <int>
1 female Human 62 315
2 male Human 98.8 917
3 none Droid 46.3 360
Related
Thank you in advance for any assistance.
Aim: I have a 5-day food intake survey dataset that I am trying to analyse in R. I am interested in calculating the mean, se, min and max intake for the weight of a specific food consumed per day.
I would more easily complete this in excel, but due to the scale of data, I require R to complete this.
Example question: What is a person's daily intake (g) of lettuce? [mean, standard deviation, standard error, min, and max]
Example extraction dataset: please note the actual dataset includes a number of foods and a large no. of participants.
participant
day
code
foodname
weight
132
1
62
lettuce
53
84
3
62
lettuce
23
132
3
62
lettuce
32
153
4
62
lettuce
26
142
2
62
lettuce
23
123
3
62
lettuce
23
131
3
62
lettuce
30
153
5
62
lettuce
16
At present:
# import dataset
foodsurvey<-read.spss("foodsurvey.sav",to.data.frame=T,use.value.labels=T)
summary(foodsurvey)
# keep my relevant columns
myvariables = subset(food survey, select = c(1,2,3,4,5) )
# rename columns
colnames(myvariables)<-c('participant','day','code','foodname','foodweight')
# create values
day<-myvariables$day
participant<-myvariables$participant
foodcode<-myvariables$foodcode
foodname<-myvariables$foodname
foodweight<-myvariables$foodweight
# extract lettuce by ID code to be analysed
lettuce<- filter(myvariables, foodcode == "62")
dim(lettuce)
str(lettuce)
# errors arise attempting to analyse consumption (weight) of lettuce per day using ops.factor function
# to analyse the outputs
summary(lettuce/days)
quantile(lettuce/foodweight)
max(lettuce)
min(lettuce)
median(lettuce)
mean(lettuce)
this should give you the mean, standard deviation, standard error, min, and max
food weight for each participant and food type combinantion along these days:
library(dplyr)
myvariables %>%
filter(foodname == "lettuce") %>%
group_by(participant) %>%
summarise(mean = mean(foodweight, na.rm = T),
max_val = max(foodweight),
min_val = min(foodweight),
sd = sd(foodweight, na.rm = T),
se = sqrt(var(foodweight, na.rm = T)/length(foodweight))
Here's a method that groups by participant and food itself to give summaries across everything.
dplyr
library(dplyr)
dat %>%
group_by(participant, foodname) %>%
summarize(
across(weight, list(min = min, mean = mean, max = max,
sigma = sd, se = ~ sd(.)/n()))
) %>%
ungroup()
# # A tibble: 6 x 7
# participant foodname weight_min weight_mean weight_max weight_sigma weight_se
# <int> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 84 lettuce 23 23 23 NA NA
# 2 123 lettuce 23 23 23 NA NA
# 3 131 lettuce 30 30 30 NA NA
# 4 132 lettuce 32 42.5 53 14.8 7.42
# 5 142 lettuce 23 23 23 NA NA
# 6 153 lettuce 16 21 26 7.07 3.54
Once you have those summaries, you can easily filter for one participant, a specific food, etc. If you need to also group by code, just add it to the group_by.
The premise of using summarise(across(...)) is that the first argument includes whichever variables you want to summarize (just weight here, but you can add others if it makes sense), and the second argument is a list of functions in various forms. It accepts just a function symbol (e.g., mean), a tilde-function facilitate by rlang (e.g., ~ sd(.) / n(), where n() is a dplyr-special function), or regular anonymous functions (e.g., function(z) sd(z)/length(z), not shown here). The "name" on the LHS of each listed function is used in the resulting column name.
I have this tibble
host_id district availability_365
<dbl> <chr> <dbl>
1 8573 Fatih 280
2 3725 Maltepe 365
3 1428 Fatih 355
4 6284 Fatih 164
5 3518 Esenyurt 0
6 8427 Esenyurt 153
7 4218 Fatih 0
8 5342 Kartal 134
9 4297 Pendik 0
10 9340 Maltepe 243
# … with 51,342 more rows
I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.
I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?
Use summarise() function along with group_by() in dplyr.
library(dplyr)
df %>%
group_by(district) %>%
summarise(Zero_Availability = sum(availability_365==0)/n())
# A tibble: 5 x 2
district Zero_Availability
<chr> <dbl>
1 Esenyurt 0.5
2 Fatih 0.25
3 Kartal 0
4 Maltepe 0
5 Pendik 1
It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work
library(data.table)
setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
prop_no_avail = sum(no_avail) / .N
), .(district)]
I'm trying to work out total volume remaining and the average volume for a large data set, which I thought would be a simple case of using rowSums and rowMeans on the data frame I created using pivot wider but I keep encountering the same errors.
df<-data.frame("parent"=c("001","001","001","001","002","002","002","002",
"003","003","003","003","004","004","004","004"),"tube"=c("tube1",
"tube2","tube3","tube4","tube1","tube2","tube3","tube4",
"tube1","tube2","tube3","tube4","tube1","tube2","tube3","tube4"),
"microlitres"=c(100,120,60,100,NA,200,100,120,
60,100,120,40,100,120,400,NA))
pivot_wider(df,names_from = tube,values_from = microlitres)->df
df$sum<-rowSums(df,na.rm=TRUE)
I get "Error: x must be numeric", and then when I alter the code to
df$sum<-rowSums(as.numeric(df),na.rm=TRUE)
I get "Error: List object cannot be coerced to double".
I've spent a long time googling and haven't come across anything that helps. I'm sure there's a simple fix but I just can't see it. I've tried using mutate with nested rowSums, I've tried unlist(), and converting it to a matrix. I'd be very grateful for any help and advice!
I hope the output is the one you had in mind:
library(dplyr)
df %>%
rowwise() %>%
mutate(sum_cols = sum(c_across(tube1:tube4), na.rm = TRUE),
mean_cols = mean(c_across(tube1:tube4), na.rm = TRUE))
# A tibble: 4 x 7
# Rowwise:
parent tube1 tube2 tube3 tube4 sum_cols mean_cols
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 100 120 60 100 380 95
2 002 NA 200 100 120 420 140
3 003 60 100 120 40 320 80
4 004 100 120 400 NA 620 207.
This should work:
df$sum <- rowSums(sapply( df, as.numeric), na.rm=TRUE)
The problem is that pivot_wider treats some of the columns as character by default and as.numeric() takes a vector as inputs.
> df
# A tibble: 4 x 6
parent tube1 tube2 tube3 tube4 sum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 100 120 60 100 762
2 002 NA 200 100 120 422
3 003 60 100 120 40 646
4 004 100 120 400 NA 624
First of all - apologies, I'm new to all of this, so I may write things in a confusing way.
I have multiple .csv files that I need to read, and to save a lot of time I am looking to find an automated way of doing this.
I am looking to read different rows of the .csv and store the information as two separate files, based on the information stored in the last column.
My data is specifically areas, and slices of a 3D image, which I will use to compile volumes. If two rows have the same "slice" then I need to separate them, as the area found in row 1 corresponds to a different structure to the one with an area in row 2, on the same slice.
Eg:
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183
So slice structure 1 has an area at slice 180 (area = 50) and 181 (area = 49), whereas structure 2 has an area at each slice from 180 to 183.
I want to be able to store all the bold data in one .csv, and all the other data in another .csv
There may be .csv files with more or less overlapping slice values, adding complexity to this.
Thank you for the help, please let me know if I need to clarify anything.
Use duplicated:
dat <- read.csv(text="
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183")
dat[duplicated(dat$slice),]
# Row area slice
# 2 2 52 180
# 4 4 53 181
dat[!duplicated(dat$slice),]
# Row area slice
# 1 1 50 180
# 3 3 49 181
# 5 5 65 182
# 6 6 60 183
(Whether you write each of these last two frames to files or store them for later use is up to you.)
duplicated normally returns TRUE for the second and subsequent incidents of the field(s). Your logic of 2,4,5,6 is more along the lines of "last of the dupes or "no dupes", which is a little different.
library(dplyr)
dat %>%
group_by(slice) %>%
slice(-n()) %>%
ungroup()
# # A tibble: 2 x 3
# Row area slice
# <int> <int> <int>
# 1 1 50 180
# 2 3 49 181
dat %>%
group_by(slice) %>%
slice(n()) %>%
ungroup()
# # A tibble: 4 x 3
# Row area slice
# <int> <int> <int>
# 1 2 52 180
# 2 4 53 181
# 3 5 65 182
# 4 6 60 183
Similarly, with data.table:
library(data.table)
as.data.table(dat)[, .SD[.N,], by = .(slice)]
# slice Row area
# 1: 180 2 52
# 2: 181 4 53
# 3: 182 5 65
# 4: 183 6 60
as.data.table(dat)[, .SD[-.N,], by = .(slice)]
# slice Row area
# 1: 180 1 50
# 2: 181 3 49
I am trying to add the sum (of all the counts in a specific vector) in my data frame in R. Specifically, I want to keep all the counts and then add a sum at the end. In excel, you would do =sum(A1:A5232). Additionally, I don't know the length of the specific vector. See below:
#sumarize by colname
NewDepartment <- List %>%
group_by(NewDepartment) %>%
tally(sort=TRUE)
The above code will give me the following:
NewDepartment n
<chr> <int>
1 <NA> 709
2 Collections 454
3 Telesales 281
4 Operations Control Management 93
5 Underwriting 92
I want a total count at the end like this:
NewDepartment n
<chr> <int>
1 <NA> 709
2 Collections 454
3 Telesales 281
4 Operations Control Management 93
5 Underwriting 92
6 Total Sum 1721
How do I get the row # 6 above??
Try this:
NewDepartment = rbind(NewDepartment,
data.frame(NewDepartment = "Total Sum", n = sum(NewDepartment$n)))