Friedman test unreplicated complete block design error - r

I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!

I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.

I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.

Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)

I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.

Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.

I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)

I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)

Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))

Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.

Related

Interpreting t 'arulesSequences' Results in R

I used a 'arulesSequences' in R to analyze the user's sequence of actions.
And I got results like this
'#' | sequence | support
1 | <{A,B}> | 0.87
2 | <{A},{B}> | 0.68
They seem to be the same sequence. What's the difference?
Does this mean that session has changed?
The sequence 1 means that A and B happened together at the same time. The sequence 2 means that first A happened, then B happened

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

I'm really struggling with how to format my data to a suitable one in R.
At the moment, I have my data in the format of:
ParticipantNo | Sex | Age | IV1(0)_IV2(0)_DV1 | IV1(1)_IV2(0)_DV1 | etc
There are two levels for IV1, and 3 for IV2, so 6 columns per DV.
I've stacked them, so that I compare all IV1 results with each other, and the same for IV2 using a Friedman test.
However, I'd like to compare across groups like Sex and Age, and was told ANOVA is the best for this. I've used ANOVA directly before in SPSS, which accepts this data format.
The problem I have is getting this data into the correct format in R.
As I understand it, it should look like:
1 | M | 40 | IV1(0)_IV2(0)_DV1_Result
1 | M | 40 | IV1(1)_IV2(0)_DV1_Result
1 | M | 40 | IV1(0)_IV2(1)_DV1_Result
1 | M | 40 | IV1(1)_IV2(1)_DV1_Result
1 | M | 40 | IV1(0)_IV2(2)_DV1_Result
1 | M | 40 | IV1(1)_IV2(2)_DV1_Result
Then I can do
aov(sex~DV1_result, data=data)
Does this seem like the correct thing to do, and if so, how can I convert from the format I have to the one I need in R?
Figured it out!
I used stack on my data, and then separate (i.e. s = separate(stack(data), "ind", c("IV1", "IV2").
Then I could do the ANOVA by aov(values ~ IV1 * IV2, data = s)
Hope this helps someone!

How to get rid of circular/cyclic dependencies in TIBCO Spotfire?

I have two tables that are linked via a relation (edit -> data table properties -> relations). One contains some raw data, and the other contains aggregated data (calculation on the value).
You can see some examples below. Here, data are linked on "category" column.
RAW DATA
category | id | value
---------+----+------
A | 1 | 10
A | 2 | 20
A | 3 | 30
A | 4 | 30
B | 1 | 20
B | 2 | 20
COMPUTED DATA
category | any_calculation //aggregation of raw data based on category
---------+----------------
A | 10
B | 20
To do the calculation, I use a R/TERR function that take raw data as an input, and that output computed data.
Then I display raw data in a scatter plot (one per category), and I add a curve that is taken from the column "any_calculation" of the computed data.
My main problem is that my table with computed data isn't filled by the R/TERR script. The cause is, in my opinion, the cyclic dependency between those two tables.
Do you have any idea/workaround/fix ?
I should also add that I can't do the calculation in the scatter plot (huge calculation). I use Spotfire 7.8.0.
It seems like a table can't be modified/edited by different sources, that is to say multiple scripts (R and Python) can't have the same table as an output.
To fix my problem, I created a new table in one of my script. Then I created a relation between this table and the other one from the other script.

Issues solving a Regression with numeric and categorical variables in R

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

In R, how to efficiently de-dupe a data.frame while processing the duplicates?

(The title of the question is terrible, I'm sorry. I was having a hard time finding a pithy way to express it.)
I have a "tall" data.frame that I have compiled. It looks like this:
id | rating
-----------
3 | 5.5
4 | 6
4 | 7
5 | 3
5 | 5
6 | 7.5
7 | 9
...
I want to turn that into this:
id | avg rating
-----------
3 | 5.5
4 | 6.5
5 | 4
6 | 7.5
7 | 9
...
I don't just want to remove duplicates. I want to take the rows that have the same duplicate id, remove the duplicates, but update the rating field to be the average.
I'm not sure how to go about this. I'm not even sure whether I should be modifying the original data frame or instead creating a new one with the modified data.
(Note: I think a good answer would be a bit agnostic to the specifics of the operation. Like, if I wanted to do something similar but instead have the resulting rating column be a sum or a count, hopefully your answer would apply to those situations as well.)
You also have the option to use SQL language is you are familiar with it.
You will require the sqldf package library(sqldf)
sqldf("
select id, avg(rating) `avg_rating`
from your_data
group by id
")
A version using dplyr and including a sum example.
library(dplyr)
df %>%
group_by(id) %>%
summarize(avg_rating = mean(rating),
sum_rating = sum(rating))

Resources