R - Weighted Average by Group dplyr Not Working - r

I am trying to calculate the weighted average by group in R, but it is only returning the weighted average of the whole dataset and I haven't been able to determine where my issue is occurring. Below is my code. Note, in the weighted.mean function, if I do not specify the data frame name for the column type, nothing is returned, so not sure if the way I am referencing the data is causing the issue.
unit_averages = selected_units %>%
group_by(`Length x Width`,Date) %>%
summarise(index_mean = weighted.mean(selected_units$"Wtd Avg Price",w=selected_units$"Unit Count"))

#akrun provided the answer, but they posted as a comment and not an answer, so posting this to close out the inquiry.
Remove the selected_units$ and use backquotes for the column names with spaces –
akrun
Jun 10 at 18:03

Related

Looking for an R function to divide data by date

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How to calculate the average of different groups in a dataset using R

I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)

Subsetting dates from colnames

I have a dataframe as follows:
TAS1 2000 obs. of 9862 variables
Each of these variables (columns) represent daily temperatures from 1979-01-01 to 2005-12-31. The colnames have been set with these dates. I now wish to separate the dataframe into twelve separate monthly data frames - containing Jan, Feb, Mar etc.
I have tried:
TAS1.JAN = subset(TAS1, grepl("-01-"), colnames(TAS1))
But get the error:
Error in grepl("-01-") : argument "x" is missing, with no default
Is there a relatively quick solution for this? I feel there must be but haven't cracked it despite trying various solutions.
I would subset January data like below.
Jan_df <- subset(MyDatSet, select=(grepl("-01-, colnames(MyDatSet))))
I have assumed that your parent dataset is called MyDatSet and a pattern "-01-" defines that it is January data.
You may repeat the process for other 11 months or come up with intelligent loop.
Like Roland, in the comments, suggested, I would opt for melting mechanism too. However, since I do not know your use case, here you go based on what you posted and asked for.
As your error says, you are missing an argument there:
tas1.jan <- subset(df, grepl("-01-", df$tas1))
Another way to do it with the help of stringr and dplyr would be:
library(stringr)
library(dplyr)
tas1.jan <- df %>% filter(str_detect(tas1, "-01-"))
Bottom side of this approach: you need to run a loop or do this 12 times for all months.

Method to compare previous day to current day values

I am looking for a better way to compare a value from a day (day X) to the previous day (day X-1). Here I am using the airquality dataset. Suppose I am interested in comparing the wind from one day to the wind from the previous day. Right now I am using merge() to bring together two dataframes - one current day dataframe and one from the previous day. I am also just subtracting 1 from the Day column to get the PrevDay column:
airquality$PrevDay=airquality$Day-1
airquality.comp <- merge(
airquality[,c("Wind","Day")],
airquality[,c("Temp","PrevDay")],
by.x=c("Day"),by.y=c("PrevDay"))
My issue here is that I'd need to create another dataframe if I wanted to look back 2 days or if I wanted to switch Wind and Temp and look at them the other way. This just seems clunky. Can anyone recommend a better way of doing this?
IMO data.table may be harder to get used to compared to dplyr, but it will save your tail later when you need robust analysis:
setDT(airquality)[, shift(Wind, n=2L, type="lag") < Wind]
In base R, you can add an NA value and eliminate the last for comparison:
with(airquality, c(NA,head(Wind,-1)) < Wind)
Whar kind of comparison do you need?
For example, to check if the followonf values is greater you could use:
library(dplyr)
with(airquality, lag(Wind) < Wind)
Or with two lags:
with(airquality, lag(Wind, 2) < Wind)
It depends on what questions you are trying to answer, but I would look into Autocorrelation (the correlation of a time series with its own lagged values). You may want to look into the acf() function to compare the time series to itself since this will help you highlight which lags are significantly correlated.
Or if you want to compare 2 different metrics (such as Wind and Temp), then you may want to try the ccf() function since it allows you to input 2 different vectors and it will compute the cross correlation with lags. For example:
ccf(airquality$Wind,airquality$Temp)
If you are interested in autocorrelation or cross-correlation, in particular, then you might also consider something like mutual information, which will work for non-Gaussian data as well. Both the infotheo and entropy (more here) packages for R have built-in functions to do so.

Resources