Looking for an R function to divide data by date - r

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))

I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

Related

How do I find a mean of a value in a row where each value is in a different data set?

I am a beginner with R and I'm having a hard time finding any info related to the task at hand. Basically, I am trying to calculate 5- year averages for 15 categories of NHL player statistics and add each of the results as a column in the '21-22 Player Data. So, for example, I'd want 5-year averages for a player (ex. Connor McDavid) to be displayed in the same dataset as the 21-22 player data, but each of the numbers needed to calculate the mean lives in its own spreadsheet that has been imported into R. I have an .xlsx worksheet for each year from 17-18 to 21-22 so 5 sheets in total. I have loaded each of the sheets in to Rstudio, but the next steps are very difficult for me to figure out.
I think I have to use a pipe, locate one specific cell (ex. Connor McDavid, goals) in 5 different data frames, use a mean function to find the average for this one particular cell (ex. Connor McDavid, goals), assign that as a vector 5_year_average_goals, then add that vector as a column in the original 21-22 dataset so I can compare the production for each player last season to their 5-year averages. Then repeat that step for each column (assists, points, etc.) Would I have to repeat these steps for each player (row)? Is there an easy way to use a placeholder that will calculate these averages for every player in the 21-22 dataset?
This is a guess, and just a start ...
I suggest as you're learning that using dplyr and friends might make some of this simpler.
library(dplyr)
files <- list.files(pattern="xlsx$")
lapply(setNames(nm = files), read_xlsx) %>%
bind_rows(.id = "filename") %>%
arrange(filename) %>%
group_by(`Player Name`) %>%
mutate(across(c(GPG, APG), ~ cumsum(.) / seq_along(.), .names = "{.col}_5yr_avg"))
Brief walk-through:
list.files(.) should produce a character vector of all of the filenames. If you need it to be a subset, filter it as needed. If they are in a different directory, then files <- list.files("path/to/dir", pattern="xlsx$", full.names=TRUE).
lapply(..) reads the xlsx file for each of the filenames found by list.files(.), returns a list of data.frames.
bind_rows(.) combines all of the nested frames into a single frame, and adds the names of the list as a new column, filename, which will contain the file from which each row was extracted.
arrange(.) sorts by the year, which should put things in chronological order, required for a running average. I'm assuming that the files sort correctly, you may need to adjust this if I got it wrong.
group_by(..) makes sure that the next expressions only see on player at a time (across all files).
mutate calculates (I believe) the running average over the years. It's not perfectly resilient to issues (e.g., gaps in years), but it's a good start.
Hope this helps.

How to combine multiple survey rounds into one panel data (R)?

I am analysing a longitudinal survey (https://microdata.worldbank.org/index.php/catalog/3712) with around 2k participating households (dropping with each round). There were 11 waves/rounds, each divided into around 6-8 datasets based on theme of the questions. To analyse it, I need it in proper panel data format, with each theme being in one file, combining all the waves.
Please see the excel snippets (with most columns removed for simplicity) of how it looks: Round 1 vs round 9 (The levels of categorical variables have change names, from full names to just numbers but it's the same question). Basically, the format looks something like this:
head(round1.csv)
ID
INCOME SOURCE
ANSWER
CHANGE
101
1.Business
1. YES
3. Reduced
101
2.Pension
2. NO
102
1.Business
1. YES
2. No change
102
2. Assistance
1. YES
3. Reduced
So far I have only been analysing seperate waves by their own, but I do not know how to:
Combine so many data frames together.
Convert it to the format where each ID appears only once per wave. I used spread to use modelling in single files. I think I can imagine what the data frame would look like if the question was only whether they receive the income source (maybe like this?:
WAVE
ID
Business
Pension
:1
101
1. YES
1. NO
:1
102
1. YES
1. YES
:2
101
1. NO
1. YES
:2
102
1. YES
1. YES
), but I do not understand how it is supposted to look like with also the change to that income included.
How to deal with weights - there are weights added to one of the files for each wave. Some are missing, and they change per wave, as fewer households agree to participate each round.
I am happy to filter and only use houesholds that participated in every round, to make it easier.
I looked for an aswer here: Panel data, from wide to long with multiple variables and Transpose a wide dataset into long with multiple steps but I think my problem is different.
I am a student, definitely not an expert, so I apologise for my skill-level.
There's too many questions at once, I'll ignore the weights (it should be a separate question, after the merging is resolved).
How to merge? For sure you'll be doing something called a left join. The leftmost dataset should be the longest one (the first wave). The others will be joined by ID, and the IDs missing in the next ones will get NAs instead of the values. I'll be using tidyverse code examples - left_join docs here`.
You'll have to deal with a few things on the way.
duplicate column names
you can use the suffix argument like suffix = c(".wave1", ".wave2")
different coding of the data (seen in your samples eg s7q1 1. YES vs 1)
use something like extract() to get the same representation
When you're done with the joining, you need to re-shape your data. That would be something like pivot_longer(), followed by extract() to get the .wave# suffix into a separate column. Then you can pivot_wider() back into a wider format, keeping your wave column.
R-like pseudocode, illustrates how it could be done .. does not work (as I don't have your datasets):
library(tidyverse)
library(readxl)
read_excel("wave1.xlsx") -> d_w1
read_excel("wave2.xlsx") -> d_w2
d_w1 %>%
extract(s7q1, into = "s7q1", regex = "([0-9]+)") %>%
d_w1fix
d_w1fix %>%
left_join(d_w2, by = "ID", suffix = c(".wave1", ".wave2")) %>%
pivot_longer(-ID, names_to = "question", values_to = "answer") %>%
extract(question, into = c("question", "wave"), regex = "([[:alnum:]]+).wave([0-9])") %>%
pivot_wider(names_from = "question", values_from = "answer") ->
d_final

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

R coding, I'm trying to correctly order the variables in my dataframe from 1 to 13 but it goes like 201501, 2015010, 011,012,013, 02...09

I have a large dataframe sorted by fiscal year and fiscal period. I am trying to create a time plot starting at fiscal period 1 of 2015, ending at fiscal period 13 of 2019. I have two columns, one for FY, one for FP. They look like this.
I merged the two columns together separated by a 0 in a new column (C) using the code:
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
This ensures that my new column is a numeric variable.
It looks like this (check column C)
Then since I want to plot a time plot of total sales per period, I aggregated all sales to the level of C, so all rows ending with the same C aggregate together. I used this code for the aggregation.
MarkP11 <- MarkP %>%
group_by(C) %>%
summarise(Sales=sum(Sales))
This is what MarkP11 looks like.
The problem i'm having is that the row's are out of order so when I plot them, it gives me an incorrect plot. It has period 10 coming after period 1.
I've done some research and discovered that the sprintf function may work but i'm not sure how I can incorporate that into the code for my data frame.
The code below is how my C column is created by merging two columns. I believe I need to edit this line with the 'sprintf' function but i'm not sure how to get that to work.
R programming
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
I expect the ordering of the MarkP dataframe to look something like this:
sprintf is indeed what you want:
sprintf("%0.0f%02.0f", 2019, c(1,10))
# [1] "201901" "201910"
This assumes that FP's range is 0-99. It would not be incorrect to use sprintf("%d%02d", 2019, c(1,10)) since you're intending to use integers, but sometimes I find that seemingly-integer values can trigger Error ... invalid format '%02d', so I just strong-arm it. You could also use as.integer on each set of values ... another workaround.
I was speaking with a colleague of mine and he helped me figure out the solution. Like r2evans commented, sprintf is the correct function. The syntax that worked for me was:
MarkP$C = paste(MarkP$FY, sprintf("%02d", MarkP$FP), sep-"")
What that did in my code was concatenate the two cells FY and FP together in a new cell titled "C".
-It first added my FY column to the new cell.
-Then, since sep="" there was no separator character so FY and FP were simply merged together.
-Since I added the sprintf function with
("%02d",
it padded the FP column with 0 zero prior to tacking on my FP column.

Assigning a variable in one dataset to multiple fields in another dataset

I'm trying to assign a variable in one dataframe into multiple rows of another dataframe - namely the AWND variable here (average wind speed).
I'm trying to obtain the AWND from
here
And I am trying to match it with multiple dates based on the date
here
Here's what I've tried so far.
dfNew <- merge(dfWeather, dfFlight, by="DATE")
I'm not sure how to proceed with this.
Should I do a join?
(EDIT: Here's the data- https://shrib.com/#-7dXevTkb12Bt6Kdfxim (this is the dput output of the data I am getting AWND from)
I got the flights data (that I am trying to match dates with) from the nycflights13 package, and then I subset the flights data to include only the carriers that had at least 1000 flights depart from LaGuardia.
The flights data has the date-time class as shown in your tibble. First, make sure that the elements you want to join between are the same i.e. 2013-01-01 05:00:00 will not match with 2013-01-01 in your dfWeather data.frame
# Make sure dates match between data.frames
dfFlight$DATE <- stringr::str_extract(dfFlight$DATE, "\\S*")
# Join AWND wherever dates match to left-hand side
dfNew <- dplyr::left_join(dfFlight, dfWeather, by = "DATE")
I did assume some things about your data since I couldn't fully see what you're working with from screenshot. This is my first answer on Stack Overflow, so feel free to edit or leave me suggestions

Resources