I realised recently that a longitudinal variable in my dataset (whether people stated they were on furlough after being asked why their working hours were reduced from the previous wave of the study) was coded incorrectly. Right now, the variable is coded as “1” if a respondent reported furlough as the reason for fewer working hours than the last survey wave, and “0" even if a respondent was on furlough but whose working hours did not change from the last survey wave. Therefore, I want to recode this variable so that after the first report of a furlough-related decrease in working hours (“1”), the rest of the data (i.e. the proceeding waves) would also be coded “1". This may be a simple change to execute in R, but I spent a few hours this morning trying if-else statements and dplyr with no success.
TLDR: I would like to recode a variable so that if the variable equals 1 for a specified wave of my longitudinal dataset, it also equals 1 for the rest of the waves of the dataset.
Can I please ask for any suggestions you have for resolving this? Thank you so much!
It is very hard to help without seeing your data. I made a basic data following your explanation. Check if this is what you have in mind.
library(dplyr)
data <- tibble(ID = c(1,2,3,1,2,3,1,2,3),
Wave= c(1,1,1,2,2,2,3,3,3),
Reduced = c(1,1,0,1,0,0,1,0,0))
data2 <- data %>% filter(Wave == 1)
data <- data %>% group_by(Wave) %>% mutate(Reduced_new = data2$Reduced)
rm(data2)
Related
Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns
By comparison to most on this site, I am an extreme newbie when it comes to R, and would appreciate any possible help. I am looking to sample my data with replacement, but given how my data is set up, I am not sure how to go about that. I have 11 plant species. For each species I took 5 plant cuttings, and sampled 10 leaves from each cutting totaling 50 leaves per plant species. I need to sample with replacement within species. I was looking at using the sample function for this, but considering I need to sample within species I am not sure if I can. Attached is a photo of my data for context.
Data image
Apologies in advance for the naivety of my question and thanks in advance for any help!
you can group by and then sample. This is assuming you have 5 cuttings per species. If not you may want to remove this condition from the group by
library(dplyr)
data %>%
group_by(species, cutting) %>%
slice_sample(weight_by = `leaf size`, n=10, replace = TRUE) %>%
ungroup()
I am working in a dataframe in RStudio trying to understand if there is a correlation between doing exercises and the general health of the person. There is three main variables:
exerof1: this variable is related to how many of times the people in the research exercised in the last 30 days.
exerany2: in this variable, the participants responded if they practiced exercises in the last month, therefore they can say yes, no or refuse to answer.
genhlth: a factor variable which split the observations in 5 levels.
I have already transformed the exeroft1 variable, but 30% of this variable are NA's and most of them are NA's because they answered "No" in the "exerany2" question.
My objective is to identificate the people who said "No" in the "exerany" variable and are listed in the exerof1 as "NAs" to transform those "NAs" in 0.
I don't know if my analysis is the best way because I am a beginner. I tried to do what I want using ifelse, but I am struggling. I also tried to check if there is another thread with the same question, but I coundn't find.
I will await for your feedback.
Assuming your data frame is called data:
data[(is.na(data$exerof1) & data$exerany2=="No"),"exerof1"] <- 0
Basically we select the rows the satisfy your condition, then pick the column exerof1, and asign those the value 0.
I have a data set of cases, when patients were registered and when they had blood test results.
I only want to get the blood test result which were made are less than a year after they registered for the trial.
The registration date is unique to each person.
data2=data1[!(data1$START.DATE> data1$READING.DATE),]
This is as close as I can can think to get but it doesn't work.
How can I do this?
A reproducible example would help get to an answer quicker.
This might work:
data2 <- data1[data1$START.DATE - data1$READING.DATE < 365, ]
I am looking for a better way to compare a value from a day (day X) to the previous day (day X-1). Here I am using the airquality dataset. Suppose I am interested in comparing the wind from one day to the wind from the previous day. Right now I am using merge() to bring together two dataframes - one current day dataframe and one from the previous day. I am also just subtracting 1 from the Day column to get the PrevDay column:
airquality$PrevDay=airquality$Day-1
airquality.comp <- merge(
airquality[,c("Wind","Day")],
airquality[,c("Temp","PrevDay")],
by.x=c("Day"),by.y=c("PrevDay"))
My issue here is that I'd need to create another dataframe if I wanted to look back 2 days or if I wanted to switch Wind and Temp and look at them the other way. This just seems clunky. Can anyone recommend a better way of doing this?
IMO data.table may be harder to get used to compared to dplyr, but it will save your tail later when you need robust analysis:
setDT(airquality)[, shift(Wind, n=2L, type="lag") < Wind]
In base R, you can add an NA value and eliminate the last for comparison:
with(airquality, c(NA,head(Wind,-1)) < Wind)
Whar kind of comparison do you need?
For example, to check if the followonf values is greater you could use:
library(dplyr)
with(airquality, lag(Wind) < Wind)
Or with two lags:
with(airquality, lag(Wind, 2) < Wind)
It depends on what questions you are trying to answer, but I would look into Autocorrelation (the correlation of a time series with its own lagged values). You may want to look into the acf() function to compare the time series to itself since this will help you highlight which lags are significantly correlated.
Or if you want to compare 2 different metrics (such as Wind and Temp), then you may want to try the ccf() function since it allows you to input 2 different vectors and it will compute the cross correlation with lags. For example:
ccf(airquality$Wind,airquality$Temp)
If you are interested in autocorrelation or cross-correlation, in particular, then you might also consider something like mutual information, which will work for non-Gaussian data as well. Both the infotheo and entropy (more here) packages for R have built-in functions to do so.