Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
So here's my problem:
I have a bunch of data about sound production and where the emphasis falls in a word. What I'm trying to do is determine if the difference between production on stressed and unstressed syllables is significant. The problem is that when I try to use the cor() function, the data sets aren't the same length. I have about 500 instances of stressed syllables, but only 400 of unstressed syllables. I'm very new to r, but here's the code I've attempted:
data <- read.csv('D:/blaaah/Stressed.csv', header=TRUE)
var1 <- data$intdiff
data <- read.csv('D:/blaaah/Unstressed.csv', header=TRUE)
var2 <- data$intdiff
cor(var1, var2)
Of course, I get an error because the data sets are different lengths. So how do I check for significance between the sets without having them be the same length?
Thanks a bunch!
P.S. Just ask if my question isn't clear. I'm afraid I sometimes assume everyone knows what I'm doing...
Using cor() would be appropriate if you expected there to be a relationship between var1 and var2, for instance if you'd expect the value of an item in var2 to be larger if the corresponding item in var1 is larger. There is a difficulty when the data sets are not the same length, because there are no corresponding items to compare once you get past the end of the shorter dataset.
I think, in this case, that a comparison of the two data sets to establish if their means are different is more likely to be useful to you. For that, you'd want to use a t test, as described, with examples in R, here. You'd also want to confirm that the assumptions for using the t test are valid for this case, e.g. see here.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding a command. We used in a class runif to create a training set, that should contain 50% of the data set. (we developed a decision tree based on this training set). But I still can't understand the logic behind this command, could someone explain to me how this works?
I understand the decision trees, and the logic behind splitting up a data set, my question is just explicitly about how this command works.
inTrain <- runif(nrow(USArrests)) < 0.5
You have a dataset named USArrests with length nrow(USArrests), let's say for the sake of simplification 100. So runif(nrow(USArrests)) creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.
Next your expression runif(nrow(USArrests)) < 0.5 checks, if the number is < 0.5 or not returning TRUE or FALSE. This gives you a logical vector of length 100 (or nrow(USArrests)) that indicates, if a row belongs to the training or to the test dataset.
It's not shown but finally you select your training data by
USArrests[inTrain,]
and your test data by
USArrests[-inTrain,]
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am new to r and have a data set containing a column with 3 states (1,2,3). The problem is i dont know to split the data set with respective dummy variables as to create box plots and ultimately a linear model.
PLease help!! :'(
So I think you can specify which feature is categorical.
Say
data<- read.csv(filename)
data$feature <- factor(data$feature)
Where feature is the feature you want to convert to categorical data?
Is that what you are looking for?
If I get your problem, you have 2 columns, one with factor levels (1, 2, 3) in your example, and another response variable. Is there it? (An example with part of your data would be very helpful). In any case, if your data has this structure you don't need to split it. For a boxplot just run
boxplot(data$variable~data$factor)
You can use the same approach for a linear model:
lm(data$variable~data$factor)
If your data has other structure, you will need to explain it before someone can give further help...
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am using the edgeR and Limma packages to analyse a RNA-seq count data table.
I only need a subset of the data file, therefore my question is: Do I need to normalize my data within all the samples, or is it better to subset my data first and normalize the data then.
Thank you.
Regards Lisanne
I think it depends on what you want to proof/show. If you also want to take into account your "darkcounts" than you should normalize it at first such that you also take into account the percentage in which your experiment fails. Here your total number of experiments ( good and bad results) sums up to one.
If you want to find out the distribution of your "good events" than you should first produce your subset of good samples and normalize afterwards. In this case your number of good events sums up to 1
So once again, it depends on what you want to proof. As a physicist I would prefer the first method since we do not remove bad data points.
Cheers TL
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a species abundance dataset with quite a few zeros in it and even when I set trymax = 1000 for metaMDS() the program is unable to find a stable solution for the stress. I have already tried combining data (collapsing multiple years together to reduce the number of zeros) and I can't do any more. I was just wondering if anyone knows - is it scientifically valid to pick what R gives me at the end (the lowest of the 1000 solutions) or should I not be using NMDS because it cannot find a stable spot? There seems to be very little information about this on the internet.
One explanation for this is that you are trying to use too few dimensions for the mapping. I presume you are using the default k = 2? If so, try k = 3 and compare the stress from the best solution you got from the 1000 tries for the k = 2 solution.
I would be a little concerned to take one solution out of 1000 just because it had the best/lowest stress.
You could also try 1000 more random starts and see if it converges if you run more iterations. When you saved the output from metaMDS(), you can supply that object to another call to metaMDS() via the previous.best argument. It will then do trymax further random starts but compare any lower-stress solutions with the previous best and converge if it finds one similar to it, rather than have to find two similar low-stress solutions in the 1000 starts.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
My question deals with the fracdiff.sim function in R (in the fracdiff package) for which the help document, just like for arima.sim, is not really clear concerning initial values.
It's ok that stationary processes do not depend on their initial values when time grows, but my aim is to see in my simulations the return of my long memory process (fitted with arfima) to its mean.
Therefore, I need to input at least the p final values of my in-sample process (and eventually q innovations) if it is ARFIMA(p,d,q). In other words, I would like to set the burn-in period's length to 0 and give starting values instead.
Nevertheless, I'm currently not able to do this. I know that fracdiff.sim makes it possible for the user to chose the length of a burning period (which leads to the stationnary behavior) and the mean of the simulated process (it is simulated and then translated to make the means match). There is also a condition: the length of the burn-in period must be >= p+q. What I suppose is that there is something to do with the innov argument but I'm really not sure.
This idea is inspired by the arima.sim function which has a start.innov argument. However, even if my aim was to simulate an ARMA(p,q), I'm not sure of the exact use of this argument (the help is quite poor) : must we input only q innovations ? put with them the p last values of the in-sample process ? In which order ?
To sum up, I want to simulate ARFIMA processes starting from a specific value and having a specific mean in order to see the return to the mean and not only the long term behavior. I fund beginnings of solutions for arima.sim on the internet but nobody clearly answered, and if the solution uses start.innov, how to solve the problem for ARFIMA processes (fracdiff.sim doesn't have the start.innov argument) ?
Hopping I have been clear enough,