wrong result in comparison - r

R beginner here. I have a data.frame that contains information on trotting horses (their wins, earnings, time records and such). I have a subsetted data.frame organised in a way that every row contains information for every specific year the horse competed. I have a variable called Competition.age that states what age the horses were each year they competed.
I'm writing down my summary statistics stratified by age and sex of the horse using both the summary() function and describe() from package psych. For example:
summary(Data_year[Data_year$Competition.age>="3"&
Data_year$Competition.age<="6"& Data_year$Sex=="Mare", ])
This works perfectly fine. But when I try to get a range between 7 and 10 years (instead of 3 and 6), it only returns NA's. The str() function with this line of code returns a blank list of variables-for some reason it won't read the data.
I've even created separate subsetted data.frames with only these years (7, 8, 9 and 10 respectively) and there are no problems with those, individually. I created subsetted data frames with ranges 7-8, 7-9 and they were fine! But 7-10 created an empty data.frame.
Any help will be greatly appreciated!!

In your comment you wrote Data_year$Competition.age is an integer. Now it is the following fact: "7" is not numeric. If you compare a numeric value with a non-numeric value (e.g. character) then the numeric value is coerced to character and the comparison is done for characters (alphabetical order). In alphabetical order "3"is greater (after) "10"
See this example:
age <- 1:15
sort(as.character(age))
You want Data_year$Competition.age>=3and Data_year$Competition.age<=6 and so on.

Related

NAs introduced by coercion whith a line code that some days before worked

I am trying to convert a variable codified as factor to numeric. I have two variables that are equal in this way in the data base, for one of them the code that I tried works but for the other one all the values are converted to NAs and I donĀ“t know why because till some days ago it worked too.
The database is called 'Visitas'
The variables are 'result' and 'dose' both of them contain numbers representing the result of a medical trial and the administered dose in mg.
The code I used is the next one:
Visitas[, 'result'] <- as.numeric(as.character(Visitas[, 'result']))
Visitas[, 'dose'] <- as.numeric(as.character(Visitas[, 'dose']))
I have to convert them first to character type in order to not get the number of the level of the factor, remember this is how it is codified when I import the database, and get the number itself.

After subsetting data frame, functions do not work on subset

My initial dataset was a csv containing information about the number of bikes that were rented in a certain city with other variables being temperature,season, etc...
I was creating a subset based on conditionals to get a set that would have seasons be "3" or "4" and annee be "1". I tried the following:
P<- subset(velo,saison>2&annee==1)
I also tried
W<- velo[which(velo$annee==1 & velo$saison>2),]
Which both returned the same dataframe/subset of 183 obs 5 variables
I then wanted to summarise the data through
summary(W$velos[saison==3])
summary(W$velos[saison==4])
It gives me the following outputs
In the data set I can see that the column season is not full of NaN and doing the class() returns integer for that column.
The issue was because of not extracting the column
summary(W$velos[W$saison==3])

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

How to create a data.frame with 3 factors?

I hope you won't find my question too silly, i did a lot of research but it seems that i can't figure how to solve this really annoying issue.
Well, i have datas for 6 participants (P) in an experiment, with 50 trials (T) per participants and 10 condition (C). So i'd like to create a dataframe in r allowing me to put these datas.
This data.frame should have 3 factors (P, T and C) and so a number of total row of (P*T*C). The difficulty for me is to create this one, since i have the datas for the 6 participant in 6 data.frame of 100 obs(T) by 10 varibles(C).
I'd like first to create the empty dataset with these factors, and then copy the values of the 6 data.set according to the factors P, T and C.
Any help would be greatly appreciated, i'm novice in r.
Thank you.
OK; First we create one big dataframe for all participants:
result<-rbind(dfrforparticipant1, dfrforparticipant2,...dfrforparticipant6) #you'll have to fill out the proper names of the original data.frames
Next, we add a column for the participant ID:
numTrials<-50 #although 100 is also mentioned in your question
result$P<-as.factor(rep(1:6, each=numTrials))
Finally, we need to go from 'wide' format to 'long' format (I'm assuming your column names holding the results for each condition are called C1, C2 etc. ; I'm also assuming your original data.frames already held a column named T to denote the trial), like this (untested, since you did not provide example data):
orgcolnames<-paste("C", 1:10, sep="")
result2<-reshape(result, varying=list(orgcolnames), v.names="val", idvar=c("T","P"), timevar="C", times=seq_along(orgcolnames), direction="long")
What you want is now in result2.

Resources