Finding the mean of a subset with 2 variables in R - r

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.

It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.

I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

Related

Looking for an R function to divide data by date

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

Recoding a variable across waves of a longitudinal dataset

I realised recently that a longitudinal variable in my dataset (whether people stated they were on furlough after being asked why their working hours were reduced from the previous wave of the study) was coded incorrectly. Right now, the variable is coded as “1” if a respondent reported furlough as the reason for fewer working hours than the last survey wave, and “0" even if a respondent was on furlough but whose working hours did not change from the last survey wave. Therefore, I want to recode this variable so that after the first report of a furlough-related decrease in working hours (“1”), the rest of the data (i.e. the proceeding waves) would also be coded “1". This may be a simple change to execute in R, but I spent a few hours this morning trying if-else statements and dplyr with no success.
TLDR: I would like to recode a variable so that if the variable equals 1 for a specified wave of my longitudinal dataset, it also equals 1 for the rest of the waves of the dataset.
Can I please ask for any suggestions you have for resolving this? Thank you so much!
It is very hard to help without seeing your data. I made a basic data following your explanation. Check if this is what you have in mind.
library(dplyr)
data <- tibble(ID = c(1,2,3,1,2,3,1,2,3),
Wave= c(1,1,1,2,2,2,3,3,3),
Reduced = c(1,1,0,1,0,0,1,0,0))
data2 <- data %>% filter(Wave == 1)
data <- data %>% group_by(Wave) %>% mutate(Reduced_new = data2$Reduced)
rm(data2)

Trouble with summing specific rows and columns

I have a problem and was wondering if there was a code that would let me solve my problem faster than doing it manually.
So for my example, I have 100 different funds with numerous securities in the fund. Within each fund, I have the Name of each type of security in the fund, the Date which shows the given quarter, the State where the security is issued, and the Weighting of each security of the total fund. The Name is not important, just the State from where it was issued is.
I was wondering if there was a way that would allow me to add up the Weighting from each different fund based on the specific State I want for each quarter. So let's say from Fund1, I need the sum of the Weighting just for the state SC and AZ in 16-1Q. The sum would be (.18 + .001). I do not need to include the weighting for KS because I am not interested in that specific state. I would only be interested in the states SC and AZ for every FundId. However, in my real problem I am interested in ~30 states. I would then do the same task for Fund1 for 16-2Q and so on until 17-4Q. My end goal is to find the sum of every portfolio weighting for the states I'm interested in and see how it changes over time. I can do this manually by each fund, but is there a way to automatically sum up the Weighing for each FundId based on the State I want and for each Date (16-1Q, 16-2Q, etc.)?
In the end I would like a table such as:
(.XX) is the sum of portfolio weight
Example of Data
The Example of Data link you sent has a much better data format than the "XX is the sum of portfolio weight" example... only in Excel would you prefer this other kind of format
so using the Example data frame, do this operation
library(dplyr)
example_data <- example_data %>%
group_by(Fund_Id) %>%
summarize(sum = sum(Weighting))
We can use aggregate in base R
aggregate(Weighting ~ Fund_id, example_data, sum)

Creating a new factor level based on combination of measurement values from two other factor levels

I am doing an analysis of automated volumetric fat measurements in different body compartments on abdominal CT scans. Measurements are taken at consecutive vertebral levels for each patient's scan, and each patient has multiple compartments measured separately (subcutaneous and visceral). Prior research has identified the ratio of visceral/subcutaneous fat measurements to be of particular interest.
I am having a difficult time trying to calculate this ratio in my dataset. In this example code there are six entries per patient. Each entry is associated with a measured fat volume of a compartment at a vertebral level.
What I want to do is create a new measurement type - 'vat/sat' - that is just a ratio of the two measures at each of the three vertebral levels. In essence, I am trying to insert three new observations per patient that are associated with a new factor level and value that is an operation of other values. Any help is greatly appreciated.
library(data.table)
data <- data.table(ID = rep(c(1:4),each = 6), value = rnorm(24, 1000, 500),
level = rep(c('l1','l2','l3')),
measure = rep(c(rep('vat',3),rep('sat',3)),4))
EDIT: I have been using data.table for this project and am familiar with the basic operations, but can't seem to figure this one out.
I would consider going to wide format, where it's more natural:
res = dcast(data, ID + level ~ measure)[, rat := vat/sat][]
To go back to long, there's
melt(res, id=c("ID", "level"))
The [] at the end is needed thanks to a quirk of data.table printing. Without it, when you type...
> res = dcast(data, ID + level ~ measure)[, rat := vat/sat][]
> res
# nothing happens
> res
# now it prints
I'm not sure if it's in the function documentation, but you might want to review the vignettes with browseVignettes("data.table"), since they cover quirks like this and help to build intuition for the syntax.

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources