R how to summarize the % breakdown of a column by other columns - r

I have a dataframe like this:
VisitID | No_Of_Visits | Store A | Store B | Store C | Store D|
A1 | 1 | 1 | 0 | 0 | 0 |
B1 | 2 | 1 | 0 | 0 | 1 |
C1 | 4 | 1 | 2 | 1 | 0 |
D1 | 3 | 2 | 0 | 1 | 0 |
E1 | 4 | 1 | 1 | 1 | 1 |
In R how can I summarize the Dataframe to show the % of visits of each Store Category by Visit count lvl? Expected result:
| No_Of_Visits | Store A | Store B | Store C | Store D|
| 1 | 100% | 0 | 0 | 0 |
| 2 | 50% | 0 | 0 | 50% |
| 3 | 67% | 0% | 33% | 0 |
| 4 | 25% | 38% | 25% | 13% |
I'm thinking of group_by(No_Of_Visits) and mutate_all?

We can get the data in long format and calculate the sum for each No_Of_Visits and Store and then calculate their ratio before getting the data to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Store')) %>%
group_by(No_Of_Visits, name) %>%
summarise(value = sum(value)) %>%
mutate(value = round(value/sum(value) * 100, 2)) %>%
pivot_wider()
# No_Of_Visits Store.A Store.B Store.C Store.D
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 100 0 0 0
#2 2 50 0 0 50
#3 3 66.7 0 33.3 0
#4 4 25 37.5 25 12.5

Related

Skipping number of rows after a certain value in R

I have a data looks like below, I would like to skip 2 rows after max index of certain types (3 and 4). For example, I have two 4s in my table, but I only need to remove 2 rows after the second 4. Same for 3, I only need to remove 2 rows after the third 3.
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 94 | 1 |
-----------------
| 57 | 1 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 99 | 1 |
-----------------
| 99 | 1 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
| 97 | 1 |
-----------------
| 96 | 1 |
-----------------
The desired output would be:
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
Here is the code of my example:
data <- data.frame(grade = c(93,90,54,36,31,94,57,16,11,12,99,99,9,10,97,96), type = c(2,2,2,4,4,1,1,3,3,3,1,1,3,3,1,1))
Could anyone give me some hints on how to approach this in R? Thanks a bunch in advance for your help and your time!
data[-c(max(which(data$type==3))+1:2,max(which(data$type==4))+1:2),]
# grade type
# 1 93 2
# 2 90 2
# 3 54 2
# 4 36 4
# 5 31 4
# 8 16 3
# 9 11 3
# 10 12 3
Using some indexing:
data[-(nrow(data) - match(c(3,4), rev(data$type)) + 1 + rep(1:2, each=2)),]
# grade type
#1 93 2
#2 90 2
#3 54 2
#4 36 4
#5 31 4
#8 16 3
#9 11 3
#10 12 3
Or more generically:
vals <- c(3,4)
data[-(nrow(data) - match(vals, rev(data$type)) + 1 + rep(1:2, each=length(vals))),]
The logic is to match the first instance of each value to the reversed values in the column, then spin that around to give the original row index, then add 1 and 2 to the row indexes, then drop these rows.
Similar to Ric, but I find it a bit easier to read (way more verbose, though):
idx = data %>% mutate(id = row_number()) %>%
filter(type %in% 3:4) %>% group_by(type) %>% filter(id == max(id)) %>% pull(id)
data[-c(idx + 1, idx + 2),]

Subtract column values using coalesce

I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;

Cross-table for subset in R

I have the following data frame (simplified):
IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2
How can I create a cross table (using the crosstable function in gmodels, because I need to do a chi-square test), but only if Type equals 1.
You probably want this.
library(gmodels)
with(df.1[df.1$Type==1, ], CrossTable(IPET, Task))
Yielding
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 3
| Task
IPET | 1 | 3 | Row Total |
-------------|-----------|-----------|-----------|
1 | 1 | 1 | 2 |
| 0.083 | 0.167 | |
| 0.500 | 0.500 | 0.667 |
| 0.500 | 1.000 | |
| 0.333 | 0.333 | |
-------------|-----------|-----------|-----------|
2 | 1 | 0 | 1 |
| 0.167 | 0.333 | |
| 1.000 | 0.000 | 0.333 |
| 0.500 | 0.000 | |
| 0.333 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 2 | 1 | 3 |
| 0.667 | 0.333 | |
-------------|-----------|-----------|-----------|
Data
df.1 <- read.table(header=TRUE, text="IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2")

R data.table apply increasing values to specific row indices

My data is like this:
Time | State | Event
01 | 0 |
02 | 0 |
03 | 0 |
04 | 2 | A_start
05 | 2 |
06 | 2 |
07 | 2 |
08 | 2 |
09 | 1 | A_end
10 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 2 | B_start
15 | 2 |
16 | 2 |
17 | 2 |
18 | 2 |
19 | 0 | B_end
20 | 0 |
21 | 0 |
22 | 0 |
23 | 0 |
24 | 2 | A_start
25 | 2 |
26 | 2 |
27 | 2 |
28 | 2 |
29 | 2 |
30 | 2 |
31 | 1 | A_end
32 | 1 |
33 | 1 |
34 | 1 |
35 | 1 |
36 | 1 |
37 | 2 | B_start
38 | 2 |
39 | 2 |
40 | 2 |
The cycle can repeat with any number of 0s, 1s and 2s in between. Sometimes, 0s, 1s or 2s can be missing entirely. I want to get the difference in the Time column between every A_start and the A_end immediately after it. Similarly, I want the difference in Time between every B_start and the B_end that immediately follows.
For this, I thought it would help if I made a "group" for each cycle, as follows:
Time | State | Event | Group
01 | 0 | |
02 | 0 | |
03 | 0 | |
04 | 2 | A_start | 1
05 | 2 | |
06 | 2 | |
07 | 2 | |
08 | 2 | |
09 | 1 | A_end | 1
10 | 1 | |
11 | 1 | |
12 | 1 | |
13 | 1 | |
14 | 2 | B_start | 1
15 | 2 | |
16 | 2 | |
17 | 2 | |
18 | 2 | |
19 | 0 | B_end | 1
20 | 0 | |
21 | 0 | |
22 | 0 | |
23 | 0 | |
24 | 2 | A_start | 2
25 | 2 | |
26 | 2 | |
27 | 2 | |
28 | 2 | |
29 | 2 | |
30 | 2 | |
31 | 1 | A_end | 2
32 | 1 | |
33 | 1 | |
34 | 1 | |
35 | 1 | |
36 | 1 | |
37 | 2 | B_start | 2
38 | 2 | |
39 | 2 | |
40 | 2 | |
However, because there are sometimes missing values in the State column, this isn't working out too well.
The correct cycle sequence is 0 -> 2 -> 1 -> 2 -> 0. Sometimes, a cycle may miss a 2 and be like this: 0 -> 1 -> 2 -> 0. Various combinations of the cycle 0 -> 2 -> 1 -> 2 -> 0 are possible (44 in total). How should I go about this?
Here is a base solution:
#identify the times where there is a change in the State
timeWithChanges <- which(abs(diff(dat$State)) > 0) + 1
#pivot those times into a m * 2 matrix
startEnd <- matrix(dat$Time[timeWithChanges], ncol=2, byrow=TRUE)
#calculate the time difference and label them as A, B
data.frame(AB=rep(c("A", "B"), nrow(startEnd)/2),
TimeDiff=startEnd[,2] - startEnd[,1])
Please let me know if this works generally enough for you.
data:
dat <- read.table(text="Time | State
01 | 0
02 | 0
03 | 0
04 | 2
05 | 2
06 | 2
07 | 2
08 | 2
09 | 1
10 | 1
11 | 1
12 | 1
13 | 1
14 | 2
15 | 2
16 | 2
17 | 2
18 | 2
19 | 0
20 | 0
21 | 0
22 | 0
23 | 0
24 | 2
25 | 2
26 | 2
27 | 2
28 | 2
29 | 2
30 | 2
31 | 1
32 | 1
33 | 1
34 | 1
35 | 1
36 | 1
37 | 2
38 | 2
39 | 2
40 | 2
41 | 0", sep="|", header=TRUE)

3-way tabulation in R

I have a dataset that looks like
| ID | Category | Failure |
|----+----------+---------|
| 1 | a | 0 |
| 1 | b | 0 |
| 1 | b | 0 |
| 1 | a | 0 |
| 1 | c | 0 |
| 1 | d | 0 |
| 1 | c | 0 |
| 1 | failure | 1 |
| 2 | c | 0 |
| 2 | d | 0 |
| 2 | d | 0 |
| 2 | b | 0 |
This is data where each ID potentially ends in a failure event, through an intermediate sequence of events {a, b, c, d}. I want to be able to count the number of IDs for which each of those intermediate events occur by failure event.
So, I would like a table of the form
| | a | b | c | d |
|------------+---+---+---+---|
| Failure | 4 | 5 | 6 | 2 |
| No failure | 9 | 8 | 6 | 9 |
where, for example, the number 4 indicates that in 4 of the IDs where a occurred ended in failure.
How would I go about doing this in R?
You can use table for example:
dat <- data.frame(categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
res <- table(dat$failure,dat$categ)
rownames(res) <- c('Failure','No failure')
res
a b c d
Failure 3 2 2 1
No failure 1 2 4 5
you can plot it using barplot:
barplot(res)
EDIT to get this by ID, you can use by for example:
dat <- data.frame(ID=c(rep(1,9),rep(2,11)),categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
by(dat,dat$ID,function(x)table(x$failure,x$categ))
dat$ID: 1
a b c d
0 1 2 1 3
1 1 1 0 0
---------------------------------------------------------------------------------------
dat$ID: 2
a b c d
0 1 2 3 0
1 1 3 1 0
EDIT using tapply
Another way to get this is using tapply
with(dat,tapply(categ,list(failure,categ,ID),length))

Resources