Calculate frequency and percentage within a subgroup in R - percentage

I have a data frame called measlescases comprised of columns like "Final Classification' 'ProvinceofResidence', 'AgeCategory', 'Vaccinationstatus', etc.
I would like to calculate the number of measles cases (obtained from the 'Final Classification' column) each by 'ProvinceofResidence', 'AgeCategory', 'Vaccinationstatus'.
Can someone give me a line of code using one of these columns as examples?
Thanks
YEt to try. Still new to R

Related

How to calculate the average of different groups in a dataset using R

I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

Estimating new columns with n-1 cases in R

I am trying to build a code for a fund analysis, which starts from the returns of the fund at different frequencies. I have been able to split the data by frequencies, that is, daily, weekly, monthly, quarterly and yearly, but the next step is not quite working for me. I have done it many times on excel and SPSS, but since R is a new language for me, it is proving to be challenging. A sample of my data is given herewith :
Date Dat1 Dat2
30/06/2009 54.26 1307.16
31/07/2009 65.28 1425.40
31/08/2009 70.71 1498.97
30/09/2009 76.18 1552.84
30/10/2009 71.92 1532.74
30/11/2009 77.14 1559.57
What I wish to do is to have two more columns, with n-1 elements in them, starting at the second index. So the entries in the new columns for the 30/06/2009 would be '-' and '-' and for the 31/07/2009 row would be the values of (65.28-54.26)/54.26 and (1425.40-1307.16)/1307.16 and so forth, until the very last case. But when I run the simple code
Daily$Dat1.Return <- diff(log(Daily$Dat1)) I get the following error :
Error in `$<-.data.frame`(`*tmp*`, Dat1, value = c(-0.0616981144702153, :
replacement has 2161 rows, data has 2162
How can I get the columns I want?

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

Resources