3-way tabulation in R - r

I have a dataset that looks like
| ID | Category | Failure |
|----+----------+---------|
| 1 | a | 0 |
| 1 | b | 0 |
| 1 | b | 0 |
| 1 | a | 0 |
| 1 | c | 0 |
| 1 | d | 0 |
| 1 | c | 0 |
| 1 | failure | 1 |
| 2 | c | 0 |
| 2 | d | 0 |
| 2 | d | 0 |
| 2 | b | 0 |
This is data where each ID potentially ends in a failure event, through an intermediate sequence of events {a, b, c, d}. I want to be able to count the number of IDs for which each of those intermediate events occur by failure event.
So, I would like a table of the form
| | a | b | c | d |
|------------+---+---+---+---|
| Failure | 4 | 5 | 6 | 2 |
| No failure | 9 | 8 | 6 | 9 |
where, for example, the number 4 indicates that in 4 of the IDs where a occurred ended in failure.
How would I go about doing this in R?

You can use table for example:
dat <- data.frame(categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
res <- table(dat$failure,dat$categ)
rownames(res) <- c('Failure','No failure')
res
a b c d
Failure 3 2 2 1
No failure 1 2 4 5
you can plot it using barplot:
barplot(res)
EDIT to get this by ID, you can use by for example:
dat <- data.frame(ID=c(rep(1,9),rep(2,11)),categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
by(dat,dat$ID,function(x)table(x$failure,x$categ))
dat$ID: 1
a b c d
0 1 2 1 3
1 1 1 0 0
---------------------------------------------------------------------------------------
dat$ID: 2
a b c d
0 1 2 3 0
1 1 3 1 0
EDIT using tapply
Another way to get this is using tapply
with(dat,tapply(categ,list(failure,categ,ID),length))

Related

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

R how to summarize the % breakdown of a column by other columns

I have a dataframe like this:
VisitID | No_Of_Visits | Store A | Store B | Store C | Store D|
A1 | 1 | 1 | 0 | 0 | 0 |
B1 | 2 | 1 | 0 | 0 | 1 |
C1 | 4 | 1 | 2 | 1 | 0 |
D1 | 3 | 2 | 0 | 1 | 0 |
E1 | 4 | 1 | 1 | 1 | 1 |
In R how can I summarize the Dataframe to show the % of visits of each Store Category by Visit count lvl? Expected result:
| No_Of_Visits | Store A | Store B | Store C | Store D|
| 1 | 100% | 0 | 0 | 0 |
| 2 | 50% | 0 | 0 | 50% |
| 3 | 67% | 0% | 33% | 0 |
| 4 | 25% | 38% | 25% | 13% |
I'm thinking of group_by(No_Of_Visits) and mutate_all?
We can get the data in long format and calculate the sum for each No_Of_Visits and Store and then calculate their ratio before getting the data to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Store')) %>%
group_by(No_Of_Visits, name) %>%
summarise(value = sum(value)) %>%
mutate(value = round(value/sum(value) * 100, 2)) %>%
pivot_wider()
# No_Of_Visits Store.A Store.B Store.C Store.D
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 100 0 0 0
#2 2 50 0 0 50
#3 3 66.7 0 33.3 0
#4 4 25 37.5 25 12.5

Subtract column values using coalesce

I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;

Cross-table for subset in R

I have the following data frame (simplified):
IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2
How can I create a cross table (using the crosstable function in gmodels, because I need to do a chi-square test), but only if Type equals 1.
You probably want this.
library(gmodels)
with(df.1[df.1$Type==1, ], CrossTable(IPET, Task))
Yielding
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 3
| Task
IPET | 1 | 3 | Row Total |
-------------|-----------|-----------|-----------|
1 | 1 | 1 | 2 |
| 0.083 | 0.167 | |
| 0.500 | 0.500 | 0.667 |
| 0.500 | 1.000 | |
| 0.333 | 0.333 | |
-------------|-----------|-----------|-----------|
2 | 1 | 0 | 1 |
| 0.167 | 0.333 | |
| 1.000 | 0.000 | 0.333 |
| 0.500 | 0.000 | |
| 0.333 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 2 | 1 | 3 |
| 0.667 | 0.333 | |
-------------|-----------|-----------|-----------|
Data
df.1 <- read.table(header=TRUE, text="IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2")

update with query of multiple fields from various tables

I have the following tables:
book_tbl:
book_instance_id | book_type_id | library_instance_id | location_id | book_index
1 | 70000 | 2 | 0 | 1
2 | 70000 | 2 | 0 | 2
3 | 70000 | 2 | 0 | 3
4 | 70000 | 3 | 0 | 1
5 | 70000 | 3 | 0 | 2
6 | 70000 | 3 | 0 | 3
7 | 70000 | 4 | 1 | 1
8 | 70000 | 4 | 1 | 2
9 | 70000 | 4 | 1 | 3
and library_tbl:
library_instance_id | library_type_id | location_id
2 | 1000 | 0
3 | 1001 | 0
4 | 1000 | 1
I would like to update the field book_type_id in book_tbl only for the first element (index) in library_type_id 1000
To retrieve this information I used sqlite query:
SELECT * FROM ( ( SELECT *
FROM library_tbl
WHERE library_type_id=1000 ) t1
join book_tbl t2 on t1.location_id=t2.location_id
AND t1.library_instance_id=t2.library_instance_id
AND book_index=1 )
How could I use the query above with UPDATE query to update rows 1 and 7:
UPDATE book_tbl SET book_type_id=15000 WHERE ????
Use EXISTS with a correlated subquery to check whether the corresponding library row exists:
UPDATE book_tbl
SET book_type_id = 15000
WHERE EXISTS (SELECT 1
FROM library_tbl
WHERE library_type_id = 1000
AND location_id = book_tbl.location_id
AND library_instance_id = book_tbl.library_instance_id)
AND book_index = 1;

Resources