Easy way to convert long to wide format with counts [duplicate] - r

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have the following data set:
sample.data <- data.frame(Step = c(1,2,3,4,1,2,1,2,3,1,1),
Case = c(1,1,1,1,2,2,3,3,3,4,5),
Decision = c("Referred","Referred","Referred","Approved","Referred","Declined","Referred","Referred","Declined","Approved","Declined"))
sample.data
Step Case Decision
1 1 1 Referred
2 2 1 Referred
3 3 1 Referred
4 4 1 Approved
5 1 2 Referred
6 2 2 Declined
7 1 3 Referred
8 2 3 Referred
9 3 3 Declined
10 1 4 Approved
11 1 5 Declined
Is it possible in R to translate this into a wide table format, with the decisions on the header, and the value of each cell being the count of the occurrence, for example:
Case Referred Approved Declined
1 3 1 0
2 1 0 1
3 2 0 1
4 0 1 0
5 0 0 1

The aggregation parameter in the dcast function of the reshape2-package defaults to length (= count). In the data.table-package an improved version of the dcastfunction is implemented. So in your case this would be:
library('reshape2') # or library('data.table')
newdf <- dcast(sample.data, Case ~ Decision)
or with using the parameters explicitly:
newdf <- dcast(sample.data, Case ~ Decision,
value.var = "Decision", fun.aggregate = length)
This gives the following dataframe:
> newdf
Case Approved Declined Referred
1 1 1 0 3
2 2 0 1 1
3 3 0 1 2
4 4 1 0 0
5 5 0 1 0
If you don't specify an aggregation function, you get a warning telling you that dcast is using lenght as a default.

You can accomplish this with a simple table() statement. You can play with setting factor levels to get your responses the way you want.
sample.data$Decision <- factor(x = sample.data$Decision,
levels = c("Referred","Approved","Declined"))
table(Case = sample.data$Case,sample.data$Decision)
Case Referred Approved Declined
1 3 1 0
2 1 0 1
3 2 0 1
4 0 1 0
5 0 0 1

Here's a dplyr + tidyr approach:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
sample.data %>%
count(Case, Decision) %>%
spread(Decision, n, fill = 0)
## Case Approved Declined Referred
## (dbl) (dbl) (dbl) (dbl)
## 1 1 1 0 3
## 2 2 0 1 1
## 3 3 0 1 2
## 4 4 1 0 0
## 5 5 0 1 0

We can use the base R xtabs
xtabs(Step~Case+Decision, transform(sample.data, Step=1))
# Decision
# Case Approved Declined Referred
# 1 1 0 3
# 2 0 1 1
# 3 0 1 2
# 4 1 0 0
# 5 0 1 0

Related

Create variable that flags an ID if it has existed in any previous month

I am unsure of how to create a variable that flags an ID in the current month if the ID has existed in any previous month.
Example data:
ID<-c(1,2,3,2,3,4,1,5)
Month<-c(1,1,1,2,2,2,3,3)
Flag<-c(0,0,0,1,1,0,1,0)
have<-cbind(ID,Month)
> have
ID Month
1 1
2 1
3 1
2 2
3 2
4 2
1 3
5 3
want:
> want
ID Month Flag
1 1 0
2 1 0
3 1 0
2 2 1
3 2 1
4 2 0
1 3 1
5 3 0
a data.table approach
library(data.table)
# set to data.table format
DT <- as.data.table(have)
# initialise Signal column
DT[, Signal := 0]
# flag duplicates with a 1
DT[duplicated(ID), Signal := 1, by = Month][]
ID Month Signal
1: 1 1 0
2: 2 1 0
3: 3 1 0
4: 2 2 1
5: 3 2 1
6: 4 2 0
7: 1 3 1
8: 5 3 0
The idea is presented from akrun in the comments. Here is the dplyr application:
First use as_tibble to bring matrix in tibble format
then use an ifelse statement with duplicated as #akrun already suggests.
library(tibble)
library(dplyr)
have %>%
as_tibble() %>%
mutate(flag = ifelse(duplicated(ID),1,0))
ID Month flag
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 0
4 2 2 1
5 3 2 1
6 4 2 0
7 1 3 1
8 5 3 0

Add a column that divides another column into n chunks, R

There's no easy way to describe my question, that's probably why I was not able to find answer through search.
So I have a data frame with 3 columns, one of the columns is Subject number, the other two columns are Correctness and Block. There are 2 participants, each was exposed to 2 blocks of 3 stimuli in each block.
subj corr block
1 1 1 1
2 1 0 1
3 1 1 1
4 1 1 2
5 1 1 2
6 1 1 2
7 2 0 1
8 2 1 1
9 2 1 1
10 2 0 2
11 2 1 2
12 2 1 2
So what I want to do is to create another column that look at a specific subj number and divide the block columns corresponding to the subj into 3 even chunks (the original df has 2 chunks). In general, I want to know how to divide the stimuli each subj is exposed to in to N chunks and input the chunk number into another column.
subj corr block newblock
1 1 1 1 1
2 1 0 1 1
3 1 1 1 2
4 1 1 2 2
5 1 1 2 3
6 1 1 2 3
7 2 0 1 1
8 2 1 1 1
9 2 1 1 2
10 2 0 2 2
11 2 1 2 3
12 2 1 2 3
Something like this:
library(dplyr)
n_chunks = 3
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = ceiling(n() / n_chunks))[1:n()])
How much of this is necessary depends on your use case. If you can guarantee that n_chunks evenly divides the number of observations for each subject you can simplify to:
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = n() / n_chunks))

Carry Forward First Observation for a Variable For Each Patient

My dataset has 3 variables:
Patient ID Outcome Duration
1 1 3
1 0 4
1 0 5
2 0 2
3 1 1
3 1 2
What I want is the first observation for "Duration" for each patient ID to be carried forward.
That is, for patient #1 I want duration to read 3,3,3 for patient #3 I want duration to read 1, 1.
Here is one way with data.table. You take the first number in Duration and ask R to repeat it for each PatientID.
mydf <- read.table(text = "PatientID Outcome Duration
1 1 3
1 0 4
1 0 5
2 0 2
3 1 1
3 1 2", header = T)
library(data.table)
setDT(mydf)[, Duration := Duration[1L], by = PatientID]
print(mydf)
# PatientID Outcome Duration
#1: 1 1 3
#2: 1 0 3
#3: 1 0 3
#4: 2 0 2
#5: 3 1 1
#6: 3 1 1
This is a good job for dplyr (a data.frame wicked-better successor to plyr with far better syntax than data.table):
library(dplyr)
dat %>%
group_by(`Patient ID`) %>%
mutate(Duration=first(Duration))
## Source: local data frame [6 x 3]
## Groups: Patient ID
##
## Patient ID Outcome Duration
## 1 1 1 3
## 2 1 0 3
## 3 1 0 3
## 4 2 0 2
## 5 3 1 1
## 6 3 1 1
Another alternative using plyr (if you will be doing lots of operations on your dataframe though, and particularly if it's big, I recommend data.table. It has a steeper learning curve but well worth it).
library(plyr)
ddply(mydf, .(PatientID), transform, Duration=Duration[1]) PatientID
# Outcome Duration
# 1 1 1 3
# 2 1 0 3
# 3 1 0 3
# 4 2 0 2
# 5 3 1 1
# 6 3 1 1

R Counting duplicate values and adding them to separate vectors [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
x <- c(1,1,1,2,3,3,4,4,4,5,6,6,6,6,6,7,7,8,8,8,8)
y <- c('A','A','C','A','B','B','A','C','C','B','A','A','C','C','B','A','C','A','A','A','B')
X <- data.frame(x,y)
Above I have a data frame where I want to identify the duplicates in vector x, while counting the number of duplicate instances for both (x,y)....
For example I have found that ddply and this post here is similar to what I am looking for (Find how many times duplicated rows repeat in R data frame).
library(ddply)
ddply(X,.(x,y), nrow)
This counts the number of instances 1 - A occurs which is 2 times... However I am looking for R to return the unique identifier in vector x with the counted number of times that x matches in column y (getting rid of vector y if necessary), like below..
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
Any help will be appreciated, thanks
You just need the table function :)
> table(X)
y
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
7 1 0 1
8 3 1 0
This is fairly straightforward by casting your data.frame.
require(reshape2)
dcast(X, x ~ y, fun.aggregate=length)
Or if you'd want things to be faster (say working on large data), then you can use the newly implemented dcast.data.table function from data.table package:
require(data.table) ## >= 1.9.0
setDT(X) ## convert data.frame to data.table by reference
dcast.data.table(X, x ~ y, fun.aggregate=length)
Both result in:
x A B C
1: 1 2 0 1
2: 2 1 0 0
3: 3 0 2 0
4: 4 1 0 2
5: 5 0 1 0
6: 6 2 1 2
7: 7 1 0 1
8: 8 3 1 0

In R, how can I make a running count of runs?

Suppose I have an R dataframe that looks like this, where end.group signifies the end of a unique group of observations:
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
I want to return the following, where group.count is a running count of the number of observations in a group, and group is a unique identifier for each group, in number order. Can anyone help me with a piece of R code to do this?
end.group group.count group
0 1 1
0 2 1
1 3 1
0 1 2
0 2 2
1 3 2
1 1 3
0 1 4
0 2 4
0 3 4
1 4 4
1 1 5
1 1 6
0 1 7
1 2 7
You can create group by using cumsum and rev. You need rev because you have the end points of the groups.
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
# create groups
x$group <- rev(cumsum(rev(x$end.group)))
# re-number groups from smallest to largest
x$group <- abs(x$group-max(x$group)-1)
Now you can use ave to create group.count.
x$group.count <- ave(x$end.group, x$group, FUN=seq_along)
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
ends <- which(as.logical(x$end.group))
ends2 <- c(ends[1],diff(ends))
transform(x, group.count=unlist(sapply(ends2,seq)), group=rep(seq(length(ends)),times=ends2))
end.group group.count group
1 0 1 1
2 0 2 1
3 1 3 1
4 0 1 2
5 0 2 2
6 1 3 2
7 1 1 3
8 0 1 4
9 0 2 4
10 0 3 4
11 1 4 4
12 1 1 5
13 1 1 6
14 0 1 7
15 1 2 7

Resources