Subsets in R studio (basic questions)? - r

I am terrible with R and I am trying to figure out subsets. I have entered the data file into R studio via:
> Vehicle_Data <-read.table("VehicleData.txt.txt", header=T,sep="\t",quote="")
> attach(Vehicle_Data)
I'm confused about subsets. One of the columns in my data is Type which includes a variety of vehicle types. I need to narrow down Car within the type column so I can calculate the mean MPG value of the cars only.
Here's what I have tried:
> TypeCar<-subset(Vehicle_Data, Type=="Car")
I think this worked to subset the data, but I'm not sure. Also I have no idea how to calculate the mean MPG from the subset?

The code for subsetting appears to be fine. To calculate the mean, you need to use the mean() function in this way:
mean_mpg <- mean(TypeCar$MPG, na.rm = TRUE)
This code will also take care of any NA values present in your data

You can use tidyverse perform data transformations such as subsetting (filtering)
Vehicle_Data %>%
filter(Type=="Car")
You can also calculate the mean MPG per Type like so:
Vehicle_Data %>%
group_by(Type) %>%
summarise(mean.MPG=mean(MPG, na.rm = TRUE))
If you'd like to calculate the mean of an existing subset of data (i.e. TypeCar), you can just run mean(TypeCar$MPG, na.rm = TRUE)

Related

Using summarize_all() for multiple data types

I am attempting to use the new Dplyr scoped summarize() verbs to search through my data table and create a summary dataframe, grouped by treatment arm, for each one of a set of multiple outcomes that contains statistics for both numeric (2.5th, 50th, 97.5th percentiles) & categorical predictor variables (counts). I appear to have been successful with these computations (thanks to this massively helpful post - r summarize_if with multiple conditions).
However my dataframes are not visually friendly, as the quantile() & table() functions insert lists into each dataframe cell so I am unable to scroll through the dataframe within the R-Studio viewer to browse through the results. Does anybody have any suggestions regarding how to reorganize or view this dataframe in R-Studio in order to see the full results of these lists more clearly?
Thank you kindly!
outcome.dfs <- list()
dt <- data.table(short.vars.full) # convert to data table to allow for subsetting with column name stored in variable
for (ae.outcome in outcomes.list) {
outcome.dfs[[ae.outcome]] <- dt[get(ae.outcome) == ae.outcome, ] %>%
select(-USUBJID) %>%
group_by(XDARM) %>%
summarise_all(~ if(is.numeric(.))
list(format(round(quantile(., probs = c(.025,0.50,0.975), na.rm = TRUE), 2), nsmall=2))
else if (is.factor(.))
list(table(.)))
}

summarizing across multiple variables and assign to new variable names

I am using the 'across' function to get the summary statistics for a series of variables (for example, all variables that starts with 'f_'. Since the across function will store the summarised results back to the original variable names, having multiple across functions with different summarising functions would overwrite the results (as shown below).
I can think of a work-around by renaming the variables after summarise() and cbind the resulting individual tables. However, that seems cumbersome, and I am wondering if there is a tidy (pun intended) way to store the series of summarised results to new variable names.
var_stats = data %>%
summarise(
across(starts_with('f_'), max, na.rm = T),
across(starts_with('f_'), min, na.rm = T)
)
With across you can use multiple summary functions at the same time and control the names of the variables. Does the following example help you?
mtcars %>%
summarise(across(starts_with("c"), list(custom_min = min, custom_max = max), .names = "{.col}_{.fn}"))
cyl_custom_min cyl_custom_max carb_custom_min carb_custom_max
1 4 8 1 8

How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

I'm trying to find the mean and standard deviation for C and P separately.
I have toyed around with this so far:
C <- rowMeans(dplyr::select(total, C1:41), na.rm=TRUE)
This didn't yield what I needed it to.
Then I thought about just using the summary, but again it didn't give me what I needed.
So then I thought of using na.omit:
Of course though, this would take out all of the data since I have NAs throughout the dataframe.
What am I missing here? Is this a matter of aggregating my data into certain groups?
I know describeby could force these descriptives, but again I'm not sure how to do that.
So, I think the angle I want to take is to order these, then aggregate and find totals, and then find the descriptives using describeby in order to avoid NAs. I'm stuck though. Where am I going wrong?
Try using this :
library(dplyr)
total %>%
#Select only columns that have S in their name
#i.e SP and SC
select(starts_with('S')) %>%
#Get the data in long format, remove NA values
tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
#Create a group for each participant
group_by(grp = c('Participant1', 'Participant2')[grepl('C\\d+', name) + 1]) %>%
#Take mean and standard deviation for each group
summarise(mean = mean(value), sd = sd(value))

In R, how do I compute mean and standard error of a subset of data, grouped by multiple columns, and output this into a new data frame?

I have a dataset (named 'gala') that has the columns "Day", "Tree", "Trt", and "LogColumn". The data was collected over time, so each numbered tree is the same tree for each treatment is the same across all days. The tree numbers are repeated for each treatment (e.g. there is a tree "1" for multiple treatments).
I would like to compute the mean and standard error for the 'LogColumn' column, for each tree per each treatment per each day (e.g. I will have a mean + standard error for day 1, tree one, treatment x, etc.), and output the mean and standard error results into a new data frame that also includes the original day, Tree, Trt values.
I have been unsuccessfully trying to make a Frankenstein of codes from other Stack Overflow answers, but I cannot seem to find one that has all the components at once. If I missed this, I am sorry, and please let me know with a link to this answer. I am new to coding, and R, and do not understand well how other codes not directly relating to what I would like to do can be applied.
At this point, I have this, but do not know if it is anywhere near correct (I am also currently getting the error message "object of type 'closure' is not subsettable"):
TreeAverages <- data.table[, MeanLog=mean(gala$LogColumn), se=std.error(gala$LogColumn), by=c("Day","Tree","Trt")]
Any help is greatly appreciated. Thank you!
If you're using data.table, remember to convert gala into a data.table object first.
gala = data.table(gala)
gala_output = gala[, .("MeanLog" = mean(LogColumn),
"std" = std.error(LogColumn)),
by = c("Day", "Tree", "Trt")]
You were really close, but data.table works like dplyr does, so it already knows variable names. You don't need to specify gala$LogColumn throughout, just do it by name.
.() is just a shorthand for list(), so I'm specifying that data.table should return the columns MeanLog and std grouped by Day, Tree, and Trt.
Using base R aggregate:
aggregate(LogColumn ~ Day + Tree + Trt, data = gala,
FUN = function(x) c(mean = mean(x), se = std.error(x)))
Using dplyr
library(dplyr)
df <- gala %>%
group_by(Day, Tree, Trt) %>%
summarise(mean = mean(LogColumn),
std = sd(LogColumn))

Getting "NA" when I run a standard deviation

Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.

Resources