Interpreting t 'arulesSequences' Results in R - r

I used a 'arulesSequences' in R to analyze the user's sequence of actions.
And I got results like this
'#' | sequence | support
1 | <{A,B}> | 0.87
2 | <{A},{B}> | 0.68
They seem to be the same sequence. What's the difference?
Does this mean that session has changed?

The sequence 1 means that A and B happened together at the same time. The sequence 2 means that first A happened, then B happened

Related

Generate variable by country for missing years?

Stata and R:
I have two cross-sectional datasets I'm merging. The two datasets have an equal amount of countries and only one dataset has zero missing years (year). The problem is that the missing years are simply not recorded, so I need to make a new variable that would add the years where there is no other data. Otherwise, I cannot merge the datasets according to the two keys, country and year.
Not so -- in Stata (and I would be surprised at a problem in R, but others must speak to that).
Missing observations -- in this context and any similar better called absent -- are not a problem. Here's a demonstration. merge is smart enough to notice gaps and make them explicit as missings. You could "fix" them yourself ahead of the merge, but that is pointless.
clear
input state year y
1 2019 1
1 2020 2
2 2019 3
2 2020 4
end
save tomerge
clear
input state year x
1 2019 42
2 2019 84
end
merge 1:1 state year using tomerge
list
Results
. merge 1:1 state year using tomerge
Result Number of obs
-----------------------------------------
Not matched 2
from master 0 (_merge==1)
from using 2 (_merge==2)
Matched 2 (_merge==3)
-----------------------------------------
.
. list
+----------------------------------------+
| state year x y _merge |
|----------------------------------------|
1. | 1 2019 42 1 Matched (3) |
2. | 2 2019 84 3 Matched (3) |
3. | 1 2020 . 2 Using only (2) |
4. | 2 2020 . 4 Using only (2) |
+----------------------------------------+
Otherwise put, 1:1 as syntax specifies the overall pattern and doesn't rule out 0:1 or 1:0 matches. merge will actually append if identifiers don't match at all. You do need the key variables to exist under identical names in both datasets.

DynamoDB Table/Index Modeling + Querying

Basic requirements:
I have a table with a bunch of attributes (20-30), but only 3 are used in querying: User, Category, and Date, and would be structured something like this...
User | Category | Date | ...
1 | Red | 5/15
1 | Green | 5/15
1 | Red | 5/16
1 | Green | 5/16
2 | Red | 5/18
2 | Green | 5/18
I want to be able to query this table in the following 2 ways:
Most recent rows (based on Date) by User. e.g., User=1 returns the 2 rows from 5/16 (3rd and 4th row)
Most recent rows (based on Date) by User and Category. e.g., User=1, Category=Red returns the 5/16 row only (the 3rd row).
Is the best way to model this with a HASH on User, RANGE on Date, and a GSI with HASH on User+Category and RANGE on Date? Is there anything else that might be more efficient? If that's the path of least resistance, I'd still need to know how many rows to return, which would require doing a count against distinct categories or something?
I've decided that it's going to be easier to just change the way I'm storing the documents. I'll move the category and other attributes into a sub-document so I can easily query against User+Date and I'll do any User+Category+Date querying with some client-side code against the User+Date result set.

Issues solving a Regression with numeric and categorical variables in R

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

Friedman test unreplicated complete block design error

I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!
I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.
I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.
Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)
I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.
Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.
I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)
I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)
Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))
Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.

R heatmap with different color scales for different rows

I am wondering what would be an easy solution to produce heatmaps() with of composed data that require different scaling for different rows.
So in my case the columns represent different events of the same type, and the rows are different observations of these events that can be binary or diff. continuous data.
F.ex:
Event: ev1 | ev2 | ev3 | ev4 | ev5 | ev6
Obs1: 1 | 0 | 1 | 1 | 0 | 0
Obs2: 5.6 | 0.2 | 4.8 | 7.1 | 0.1 | 0.8
Thanks in advance for hints and help
To have a single heatmap where different rows have been scaled differently would be rather confusing, since you would need a legend for each shading. This would quickly clutter up the plot.
If you suggest that you don't need a legend, then simply scale your values to range between zero/one and you should get would you are after. So for Obs2 you would have something like:
scaled_obs2 = (Obs2 - min(Obs2))/(Obs - min(Obs2))

Resources