How to create a data.frame with 3 factors? - r

I hope you won't find my question too silly, i did a lot of research but it seems that i can't figure how to solve this really annoying issue.
Well, i have datas for 6 participants (P) in an experiment, with 50 trials (T) per participants and 10 condition (C). So i'd like to create a dataframe in r allowing me to put these datas.
This data.frame should have 3 factors (P, T and C) and so a number of total row of (P*T*C). The difficulty for me is to create this one, since i have the datas for the 6 participant in 6 data.frame of 100 obs(T) by 10 varibles(C).
I'd like first to create the empty dataset with these factors, and then copy the values of the 6 data.set according to the factors P, T and C.
Any help would be greatly appreciated, i'm novice in r.
Thank you.

OK; First we create one big dataframe for all participants:
result<-rbind(dfrforparticipant1, dfrforparticipant2,...dfrforparticipant6) #you'll have to fill out the proper names of the original data.frames
Next, we add a column for the participant ID:
numTrials<-50 #although 100 is also mentioned in your question
result$P<-as.factor(rep(1:6, each=numTrials))
Finally, we need to go from 'wide' format to 'long' format (I'm assuming your column names holding the results for each condition are called C1, C2 etc. ; I'm also assuming your original data.frames already held a column named T to denote the trial), like this (untested, since you did not provide example data):
orgcolnames<-paste("C", 1:10, sep="")
result2<-reshape(result, varying=list(orgcolnames), v.names="val", idvar=c("T","P"), timevar="C", times=seq_along(orgcolnames), direction="long")
What you want is now in result2.

Related

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

aggregate data.frame with formula and date variable goes wrong

I want to aggregate and count how often in my dataset is a special kind of disease at one date. (I don't use duplicate because I want all rows, not only the duplicated ones)
My original data set looks like:
id dat kinds kind
AE00302 2011-11-20 valv 1
AE00302 2011-10-31 vask 2
(of course my data.frame is much larger)
I try this:
xagg<-aggregate(kind~id+dat+kinds,subx,length)
names(xagg)<-c("id","dat","kinds","kindn")
and get:
id dat kinds kindn
AE00302 2011-10-31 valv 1
AE00302 2011-11-20 vask 1
I wonder why R is going wrong by the 'date' resp. the 'kinds'-column.
Has anybody an idea?
I still don't know why.
But I found out, aggregate goes wrong, because of columns I don't use for aggregating.
Therefor these steps solve the problem for me:
# 1st step: reduce the data.frame to only the needed columns
# 2nd Step: aggregate the reduced data.frame
# 3rd Step: merge aggregated data to reduced dataset
# 4th step: remove duplicated rows from reduced dataset (if they occur)
# 5th step: merge reduced dataset without dublicated data to original dataset
Maybe the problem occurs, if there are duplicated datasets in the aggregated data.frame.
Thanks for all your help, questions and attempts to solve my problem!
elchvonoslo

R: Function: generate and save multiple matrices based on multiple conditions

I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)

R sum list of data.frame with almost same structure

Here is my question,
I have a list of data.frames. It's produced by same piece of codes with different data.
All of the data.frames looks like
US 100 (not guarantee to exist in another data.frame because data is different)
CA 50
...
Is there any fast/neat way to sum over all the data.frames?
I am not sure whether I have understood your problem correctly, but here a possible solution:
Try to put all your dataframes in a list, e.g., your_list=list(df1,df2,...)
Then use total_df=do.call(rbind,your_list) to combine all dataframes (row-wise).
After that you can use ddply(total_df,"country",function (x) sum(x$value)) to aggregate the data. Here, I have assumed that US and CA stand for entries in a country column and 100 and 50 for entries in a value column.

Resources