How to combine multiple survey rounds into one panel data (R)? - r

I am analysing a longitudinal survey (https://microdata.worldbank.org/index.php/catalog/3712) with around 2k participating households (dropping with each round). There were 11 waves/rounds, each divided into around 6-8 datasets based on theme of the questions. To analyse it, I need it in proper panel data format, with each theme being in one file, combining all the waves.
Please see the excel snippets (with most columns removed for simplicity) of how it looks: Round 1 vs round 9 (The levels of categorical variables have change names, from full names to just numbers but it's the same question). Basically, the format looks something like this:
head(round1.csv)
ID
INCOME SOURCE
ANSWER
CHANGE
101
1.Business
1. YES
3. Reduced
101
2.Pension
2. NO
102
1.Business
1. YES
2. No change
102
2. Assistance
1. YES
3. Reduced
So far I have only been analysing seperate waves by their own, but I do not know how to:
Combine so many data frames together.
Convert it to the format where each ID appears only once per wave. I used spread to use modelling in single files. I think I can imagine what the data frame would look like if the question was only whether they receive the income source (maybe like this?:
WAVE
ID
Business
Pension
:1
101
1. YES
1. NO
:1
102
1. YES
1. YES
:2
101
1. NO
1. YES
:2
102
1. YES
1. YES
), but I do not understand how it is supposted to look like with also the change to that income included.
How to deal with weights - there are weights added to one of the files for each wave. Some are missing, and they change per wave, as fewer households agree to participate each round.
I am happy to filter and only use houesholds that participated in every round, to make it easier.
I looked for an aswer here: Panel data, from wide to long with multiple variables and Transpose a wide dataset into long with multiple steps but I think my problem is different.
I am a student, definitely not an expert, so I apologise for my skill-level.

There's too many questions at once, I'll ignore the weights (it should be a separate question, after the merging is resolved).
How to merge? For sure you'll be doing something called a left join. The leftmost dataset should be the longest one (the first wave). The others will be joined by ID, and the IDs missing in the next ones will get NAs instead of the values. I'll be using tidyverse code examples - left_join docs here`.
You'll have to deal with a few things on the way.
duplicate column names
you can use the suffix argument like suffix = c(".wave1", ".wave2")
different coding of the data (seen in your samples eg s7q1 1. YES vs 1)
use something like extract() to get the same representation
When you're done with the joining, you need to re-shape your data. That would be something like pivot_longer(), followed by extract() to get the .wave# suffix into a separate column. Then you can pivot_wider() back into a wider format, keeping your wave column.
R-like pseudocode, illustrates how it could be done .. does not work (as I don't have your datasets):
library(tidyverse)
library(readxl)
read_excel("wave1.xlsx") -> d_w1
read_excel("wave2.xlsx") -> d_w2
d_w1 %>%
extract(s7q1, into = "s7q1", regex = "([0-9]+)") %>%
d_w1fix
d_w1fix %>%
left_join(d_w2, by = "ID", suffix = c(".wave1", ".wave2")) %>%
pivot_longer(-ID, names_to = "question", values_to = "answer") %>%
extract(question, into = c("question", "wave"), regex = "([[:alnum:]]+).wave([0-9])") %>%
pivot_wider(names_from = "question", values_from = "answer") ->
d_final

Related

How do I find a mean of a value in a row where each value is in a different data set?

I am a beginner with R and I'm having a hard time finding any info related to the task at hand. Basically, I am trying to calculate 5- year averages for 15 categories of NHL player statistics and add each of the results as a column in the '21-22 Player Data. So, for example, I'd want 5-year averages for a player (ex. Connor McDavid) to be displayed in the same dataset as the 21-22 player data, but each of the numbers needed to calculate the mean lives in its own spreadsheet that has been imported into R. I have an .xlsx worksheet for each year from 17-18 to 21-22 so 5 sheets in total. I have loaded each of the sheets in to Rstudio, but the next steps are very difficult for me to figure out.
I think I have to use a pipe, locate one specific cell (ex. Connor McDavid, goals) in 5 different data frames, use a mean function to find the average for this one particular cell (ex. Connor McDavid, goals), assign that as a vector 5_year_average_goals, then add that vector as a column in the original 21-22 dataset so I can compare the production for each player last season to their 5-year averages. Then repeat that step for each column (assists, points, etc.) Would I have to repeat these steps for each player (row)? Is there an easy way to use a placeholder that will calculate these averages for every player in the 21-22 dataset?
This is a guess, and just a start ...
I suggest as you're learning that using dplyr and friends might make some of this simpler.
library(dplyr)
files <- list.files(pattern="xlsx$")
lapply(setNames(nm = files), read_xlsx) %>%
bind_rows(.id = "filename") %>%
arrange(filename) %>%
group_by(`Player Name`) %>%
mutate(across(c(GPG, APG), ~ cumsum(.) / seq_along(.), .names = "{.col}_5yr_avg"))
Brief walk-through:
list.files(.) should produce a character vector of all of the filenames. If you need it to be a subset, filter it as needed. If they are in a different directory, then files <- list.files("path/to/dir", pattern="xlsx$", full.names=TRUE).
lapply(..) reads the xlsx file for each of the filenames found by list.files(.), returns a list of data.frames.
bind_rows(.) combines all of the nested frames into a single frame, and adds the names of the list as a new column, filename, which will contain the file from which each row was extracted.
arrange(.) sorts by the year, which should put things in chronological order, required for a running average. I'm assuming that the files sort correctly, you may need to adjust this if I got it wrong.
group_by(..) makes sure that the next expressions only see on player at a time (across all files).
mutate calculates (I believe) the running average over the years. It's not perfectly resilient to issues (e.g., gaps in years), but it's a good start.
Hope this helps.

Looking for an R function to divide data by date

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How to assign numerical values in one column to text in other column

Basically I'm trying to sort through a large dataset where the first 3 numbers correspond to different texts. Before I can filter through, I'm trying to assign the different strings values.
Crops 101
Fishing 102
Livestock 103
Movies, TV, & Stage 201
In the larger dataset, there are hundreds of numbers such as 1018347 where the first three numbers correspond to crops and include the number of times that value appeared. The numbers after specify what type of crops, but for the purpose of my work I need to sort through the entire thing by the first three numbers and sum the amount for each time occurred. I'm fairly new to R and wasn't able to find a sufficient answer, so any help would be appreciated.
Not sure if i am getting your question correctly but it seems you are looking for a way to first create a new variable based on the first three numbers in the variable and afterwards summarize the results in a sum per category
What could work is
data %>%
mutate(first_part = substr(variable,1,3)) %>%
group_by(first_part) %>%
summarize(occurrences= n())
Code above counts the amount of times the "first_part"(which is the first 3 numbers) is occurring. This can also be reproduces for the second part or both together.

Calculating the proportions of Yes or No responses in R

I am new to R and really trying to wrap my head around everything (even taking online course--which so far has not helped at all).
What I started with is a large data frame containing 97 variables pertaining to compliance with regulations.
I have created multiple dataframes based on the various geographic locations (there is probably an easier way to do it).
In each of these dataframes, I have 7 variables I would like to find the mean of "Yes" and "No" responses.
I first tried:
summary(urban$vio_bag)
Length Class Mode
398 character character
However, this just tells me nothing useful except that I have 398 responses.
So I put this into a table:
urbanbag<-table(urban$vio_bag)
This at least provided me with the number of Yes and No responses
Var1 Freq
1 No 365
2 Yes 30
So I then converted to a data.frame:
urbanbag = as.data.frame(urbanbag)
Then viewed it:
summary(urbanbag)
Var1 Freq
No :1 Min. : 30.0
Yes:1 1st Qu.:113.8
Median :197.5
Mean :197.5
3rd Qu.:281.2
Max. :365.0
And the output still definitely did not help.. much more useless actually.
I am not building these Matrices in R. It is a table imported from excel.
I am just so lost and frustrated having spent days trying to figure out something that seems so elementary and googling help which did not work out.
Is there a way to actually do this?
We can use prop.table to get the proportion
v1 <- prop.table(table(urban$vio_bag))
then use barplot to plot it
barplot(v1)
Try with dplyr's n() (perfomrs counts) within sumarisse()
library(dplyr)
data %>% group_by(yes_no_column) %>% summarise(my_counts = n())
This will give you the counts you're looking for. Adjust the group_by() variables as needed -multiple variables can be used at the time for grouping purposes. Just like with n(), a function such as mean and sd can be passed to summarise. If you want to make a column out of each calculated metric, use mutate()
Oscar.
prop.table is a useful way of doing this. You can also solve this using mean:
mean(urban$vio_bag == "Yes")
mean(urban$vio_bag == "No")

Resources