R data long-wide restructure - r

I am currently working with data in R. My dataset looks like this (except with about 3 million observations):
This is the current structure of the data:
And I have two objectives...
Objective 1 is to structure it so that it looks like this:
then my second objective is to go back to the original structure and make it look like this:
I have tried variations of aggregate with dcast (which is apparently being deprecated...?)
so, for example, I have tried this:
df2 <- dcast(df1, Store + Sales~ Year, value.var = "Sales")
or even
df %>%
group_by(Store, Year) %>%
summarise(across(starts_with('Sales'), sum))
And I get a diagonal of sales totals across years, but then I'm unable to summarize them so that it looks like
Store Year1 Year2
A $$ $$
B $$ $$
Since there is so much data, it looks like a bunch of stacked identity matrices ...except, instead of 1's there are sales values for the years (there are many many years, not just two).
I am looking for suggestions on how to proceed. One package I found was 'pivottabler' and I have not used it yet, I wanted to see if anyone had any better suggestions first.
:::Much appreciated:::

Related

3-way Contingency Table R: How to get marginal sum, percentages per group

I have been trying to create a contingency table in R with percentage distribution of education for husbands (6 category) and wives (6 category) BY marriage cohort (total 4 cohorts). My ideal is something like this: IdealTable.
However, what I have been able to get at max is: CurrentTable.
I am not able to figure out how to convert my row and column sums to percentages (similar to the ideal). The current code that I am using is:
three.table = addmargins(xtabs(~MarriageCohort + HerEdu + HisEdu, data = mydata))
ftable(three.table)
Is there a way I can turn the row and column sums into percentages for each marriage cohort?
How can I add labels to this and export the ftable?
I am relatively new to R and tried to find solutions to my questions above on google, but havent been successful. Posting my query on this platform for the first time and any help with this will be greatly appreciated! Thank you!
One approach would be to create separate xtab runs for each MarriageCohort:
Cohorts <- lapply( mydata, mydata["MarriageCohort"],
function(z) xtabs( ~HerEdu + HisEdu, data = z) )
Then get totals in each Cohorts item before dividing the cohort addmargins(.) result by those totals and multiplying by 100 to get percent values:
divCohorts <- lapply(Cohorts, function(tbl) 100*addmargins(tbl)/sum(tbl) )
Then you will need to clean those items up to your desires. You have not included data so the cleanup remains your responsibility. (I did not use sapply because that could give you a big matrix that might be difficult to manage, but you could try it and see if you in the second stepwere satisfied with that approach.)

Looking for an R function to divide data by date

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

How to summarize two data frames by matching date columns?

I have two data frames: Original and Base......
Original<- data.frame(Bond = c("A","B","C","D"),Date = c("19-11-2021","19-11-2021","19-11-2021","17-11-2021"),
Rate =c("O_11","O_12","O_13","O_31"))
base<- data.frame(Date = c("19-11-2021","18-11-2021","17-11-2021"), Rate =c("B_1","B_2","B_3"))
Here I would like to calculate the rate differential between Original and Base for each bond of each date w.r.t. the base rate. The output should be in the following format -
Note: The original data frame contains numerical values of the Original and Base Rates
I was trying using group_by() but wasn't able to proceed much further. Please help me with this. Even suggestion will also work
Seems like you want to join on date, with dplyr you can with an inner_join, assuming that there exist a date in base for every record in Original:
Output <- Original %>%
inner_join(base, by="Date") %>%
mutate(Rate_Diff = paste0(Rate.x,"-",Rate.y), Rate=Rate.x) %>%
select(-Rate.x, -Rate.y)
> Output
Bond Date Rate_Diff Rate
1 A 19-11-2021 O_11-B_1 O_11
2 B 19-11-2021 O_12-B_1 O_12
3 C 19-11-2021 O_13-B_1 O_13
4 D 17-11-2021 O_31-B_3 O_31
Edit: Is see the note now, then you could just replace the paste0 function with the actual columns:
mutate(Rate_Diff = Rate.x - Rate.y, Rate=Rate.x)

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How can I add the populations of males and females together to remove gender as a variable in a demographics table. In R Studio [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
This is my first time posting a question, so may not have the correct info to start, apologies in advance. Am new to R. Prefer to use dplyr or tidyverse because those are the packages we've used so far. I did search for a similar question, but most gender/sex related questions are around separating the data, or performing operations on each separately.
I have a table of population counts, with variables (factors) Age Range, Year and Sex, with Population as the dependent variable. I want to create a plot to show if the population is aging - that is, showing how the relative proportion of different ages groups changes over time. But gender is not relevant, so I want to add together the population counts for males and females, for each year and age range.
I don't know how to provide a copy of the raw data .csv file, so if you have any suggestions, please let me know.
This is a sample of the data(output table):
And here is the code so far:
file_name <- "AusPopDemographics.csv"
AusDemo_df = read.table(file_name,",", header=TRUE)
(grp_AusDemo_df <- AusDemo_df %>% group_by(Year, Age))
I am guessing it may be something like pivot(wider) to bring male and female up as column headings, then transmute() to sum them and create a new population column.
Thanks for your help.
With dplyr you could do something like this
library(dplyr)
grp_AusDemo_df <- AusDemo_df %>%
group_by(Year, Age) %>%
summarise(Population = sum(Population, na.rm = TRUE))

Resources