Attempting to run cor.test with outliers - r

I am new to R so I apologize this if this a fairly basic question. I am analyzing the nycflights13 data set and I am attempting to run a cor. test on the distance and departure delay (dep_delay). I want to take the out any outliers prior to running the correlation. However when I do this, I end up getting an error due to the lengths not being the same. I am just wondering how to fix this problem. Do I need to replace those missing values with NA? If so, how do I do that. Do I need to create new parameters that removes all the rows with missing info? Below is a photo of my code
I tried rewriting my code several ways to see if there was a better way to do this but I ultimately got the same error. I think I just want replacing the outliers properly.

You could try to union the indices that are not outliers, like this:
delay_thresh = quantile(flights$dep_delay,p=c(0.003, 0.997), na.rm=T)
dist_thresh = quantile(flights$distance,p=c(0.003, 0.997), na.rm=T)
indices = union(
which(flights$dep_delay>delay_thresh[1] & flights$dep_delay<delay_thresh[2]),
which(flights$distance>dist_thresh[1] & flights$distance<dist_thresh[2])
)
cor.test(flights$dep_delay[indices], flights$distance[indices])

Related

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

Anova Test- Separate Groups with Two Factor Comparison

Good morning,
I am trying to run some ANOVA tests on my dataset (using R) and I keep getting errors. I am trying to compare the average percentage of correct responses, as a factor of what "group" the subjects were in and what session/day it was. However, I have two separate conditions that I need to analyze separately.
So essentially, I need to compare PctCorrect in Condition 1, between groups and sessions and then do the same thing for condition 2.
I attempted using this code:
aov(ext$Pct.Correct[ext$Condition=="NC-EXT"]~ext$Group*ext$Session, data=ext)
and I got the following error:
Error in model.frame.default(formula = ext$Pct.Correct[ext$Condition
== : variable lengths differ (found for 'ext$Group')
I ran this code to make sure that all of my values were even:
mytable <- table(ext$Session, ext$Group, ext$Condition)
ftable(mytable)
And they were all the same value (which was to be expected), so I am not sure what's wrong.
I am very new to R so it's entirely possible that I am doing this completely wrong. Any help would be greatly appreciated.
You are filtering the left side of the equation and not filtering the right side, thus the "variable length error".
You could try filter your dataframe in the data= option like this:
aov(Pct.Correct ~ Group* Session, data=ext[ext$Condition=="NC-EXT",])

Discretization by only one column of the dataset using mdlp()

library(discretization)
data("CO2")
disc<- mdlp(CO2[4])
I just need to discretize the 4th column of the data set provided. Then it is getting an Error in data[1, ] : incorrect number of dimensions error. Could you please help me to fix this.
I don't know if this is what you're going for, but 1) mdlp needs more than just one column of data, and 2) it also has trouble working with complex objects like CO2. Here is one way to make it execute:
CO2.df <- as.data.frame(CO2) # strips the extra info
mdlp(CO2.df[,4:5])

Dealing with points vs. rows

I fear I've missed some crucial point in my education thus far.
I have a table HR and I've performed functions on it.
For example HR$FTE <- HR$'Std Hrs' / 38 gives me a new column for each employee; working as intended.
However, whenever I try to perform a function when creating a new column it doesn't like that. The question that I posted yesterday is similar in nature where the error result was from returning the whole row.
An example function that doesn't work would be HR$FYEnd <- as.Date(paste(HR$FY + 1,"06","30", sep = "-")). In this case, non-numeric argument to binary operator is returned, as HR$FY is not numeric but rather a column of numeric data. What should be outputted is a set of dates on 30/06.
In Excel (which I'm trying to train myself to leave) the equivalent when dealing with tables would be [#[FY Start]] or something to that effect which demonstrates that you're working with the figure on that row rather than the whole row.
Worked it out - couple of days later.
The step that I was missing was using the mapply/sapply commands. Using these has sorted everything out.

Creating vectors between specific values in a dataset with R

I have a quite special case with a dataset and what I want to do with it.To make it comprehensive I have to give a brief description of the background:
I have a sensor producing data, which needs maintenance every-now-and-then. Between every maintenance the data produced has a decreasing trend which I want to get rid of, and since maintenance is carried out quite often, I want to automate this procedure.
The sensor is turned off when carrying out maintenance but the telemetry system still produces readings marked with " * ". Therefore the subsets of data to be detrended can be easily spotted between batches of "*" readings.
I have been (unsuccessfully) trying to create a vector (on which I can then carry out a detrending procedure) with this data by selecting the desired values by looping through the data using conditional statements. To begin selecting the values I used to following statement:
if((tryp[i-2,2]="*")&(tryp[i-1,2]="*")&(tryp[i,2]!="*"))
and to finish the selection (exit the loop):
if((tryp[i-2,2]!="*")&(tryp[i-1,2]!="*")&(tryp[i,2]="*"))
However, this last statement gives an error of "argument is of length zero" and the first statement doesn't seem to be working properly either.
This is how the data looks like
So for example, one subset of data that I would like to select for de-trending is between data points 9686 and 9690. Obviously this is very small subset, but it shows well what I am trying to communicate.
I would really appreciate if someone could let me know about an elegant way of doing this, including anything way different from what I was trying to do originally.
Many thanks!
library(dplyr)
my_df <- data.frame(a = LETTERS[1:10], b = c('+','*','*', '+', '*', '*', '+', '+', '*', '*'))
my_df %>% filter(b != '*')
Suppose the '+'-signs are your data points, you can easily get rid of the '*'-signs with filtering the rows which does not contain it.
And of course a solution without the dplyr-package:
my_df[which(my_df$b!='*'),]

Resources