R Counting duplicate values and adding them to separate vectors [duplicate] - r

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
x <- c(1,1,1,2,3,3,4,4,4,5,6,6,6,6,6,7,7,8,8,8,8)
y <- c('A','A','C','A','B','B','A','C','C','B','A','A','C','C','B','A','C','A','A','A','B')
X <- data.frame(x,y)
Above I have a data frame where I want to identify the duplicates in vector x, while counting the number of duplicate instances for both (x,y)....
For example I have found that ddply and this post here is similar to what I am looking for (Find how many times duplicated rows repeat in R data frame).
library(ddply)
ddply(X,.(x,y), nrow)
This counts the number of instances 1 - A occurs which is 2 times... However I am looking for R to return the unique identifier in vector x with the counted number of times that x matches in column y (getting rid of vector y if necessary), like below..
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
Any help will be appreciated, thanks

You just need the table function :)
> table(X)
y
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
7 1 0 1
8 3 1 0

This is fairly straightforward by casting your data.frame.
require(reshape2)
dcast(X, x ~ y, fun.aggregate=length)
Or if you'd want things to be faster (say working on large data), then you can use the newly implemented dcast.data.table function from data.table package:
require(data.table) ## >= 1.9.0
setDT(X) ## convert data.frame to data.table by reference
dcast.data.table(X, x ~ y, fun.aggregate=length)
Both result in:
x A B C
1: 1 2 0 1
2: 2 1 0 0
3: 3 0 2 0
4: 4 1 0 2
5: 5 0 1 0
6: 6 2 1 2
7: 7 1 0 1
8: 8 3 1 0

Related

Putting back a missing column from a data.frame into a list of dta.frames

My LIST of data.frames below is made from my data. However, this LIST is missing the scale column which is available in the original data.
I was wondering how to put back the missing scale column into LIST to achive my DESIRED_LIST?
Reproducible data and code are below.
m3="
scale study outcome time ES bar
2 1 1 0 1 8
2 1 2 0 2 7
1 2 1 0 3 6
1 2 1 1 4 5
2 3 1 0 5 4
2 3 1 1 6 3
1 4 1 0 7 2
1 4 2 0 8 1"
data <- read.table(text = m3, h=T)
LIST <- list(data.frame(study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
DESIRED_LIST <- list(data.frame(scale=c(2,2) ,study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(scale=c(2,2) ,study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(scale=c(1,1,1,1),study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
In base R, you could do:
lapply(LITS, \(x)merge(x, data)[names(data)])

Extended Sorting according to two attributes (I think it is grid sorting) [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have the following data:
id brand quantity
1 a 2
1 b 1
2 b 5
3 c 10
2 d 11
3 a 1
4 b 2
The output should be
a b c d
1 2 1 0 0
2 0 5 10 11
3 1 0 10 0
4 0 2 0 0
How to get this type of sort in R language where the column names are brand types and row names are customer ids and the matrix data are quantity?
This can be done with reshape() and a couple of post hoc fixups:
res <- reshape(df,dir='w',timevar='brand')[-1L];
names(res) <- sub('^quantity\\.','',names(res));
res[is.na(res)] <- 0L;
res;
## a b c d
## 1 2 1 0 0
## 3 0 5 0 11
## 4 1 0 10 0
## 7 0 2 0 0
Data
df <- data.frame(id=c(1L,1L,2L,3L,2L,3L,4L),brand=c('a','b','b','c','d','a','b'),quantity=c(
2L,1L,5L,10L,11L,1L,2L),stringsAsFactors=F);

How to create a table shows frequency of all dummy variables in r

I am a rookie in R.
I want to create a frequency table of all dummy variables and I have a data like this
ID Dummy_2008 Dummy_2009 Dummy_2010 Dummy_2011 Dummy_2012 Dummy_2013
1 1 1 0 0 1 1
2 0 0 1 1 0 1
3 0 0 1 0 0 1
4 0 1 1 0 0 1
5 0 0 0 0 1 0
6 0 0 0 1 0 0
I want to see how total frequency in each variable like this
0 1 sum
Dummy_2008 5 1 6
Dummy_2009 4 2 6
Dummy_2010 3 3 6
Dummy_2011 4 2 6
Dummy_2012 4 2 6
Dummy_2013 2 4 6
I only know to use table() , but I can only do this one variable a time.
I have many time serious dummy variables, and I want to see the trend of them.
Many thanks for the help
Terence
Here is another option using mtabulate and addmargins
library(qdapTools)
addmargins(as.matrix(mtabulate(df1[-1])),2)
# 0 1 Sum
#Dummy_2008 5 1 6
#Dummy_2009 4 2 6
#Dummy_2010 3 3 6
#Dummy_2011 4 2 6
#Dummy_2012 4 2 6
#Dummy_2013 2 4 6
result = as.data.frame(t(sapply(dat[,-1], table)))
result$Sum = rowSums(result)
0 1 Sum
Dummy_2008 5 1 6
Dummy_2009 4 2 6
Dummy_2010 3 3 6
Dummy_2011 4 2 6
Dummy_2012 4 2 6
Dummy_2013 2 4 6
Explanation:
sapply applies a function to each column of a data frame and returns a matrix. So sapply(dat[,-1], table) returns a matrix with the output of table for each column (except the first column, which we've excluded).
The matrix needs to be transposed so that the column names from the original data frame are the rows and the dummy values are the columns, so we use the t (transpose) function for that.
We want a data frame, not a matrix, so we wrap the whole thing in as.data.frame.
Next, we want another column giving the total number of values, so we use the rowSums function.

How can I reshape a dataframe in R using dcast (advanced)? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
I have the following R dataframe
df1=data.frame(x = c(1,1,2,2,2,3), y = c("f","g","g","h","i","f"), z=c(6,7,5,2,1,5))
x y z
1 1 f 6
2 1 g 7
3 2 g 5
4 2 h 2
5 2 i 1
6 3 f 5
and I need to obtain
df2=data.frame(x = c(1,2,3), f=c(6,0,5), g=c(7,5,0), h=c(0,2,0),i=c(0,1,0))
x f g h i
1 1 6 7 0 0
2 2 0 5 2 1
3 3 5 0 0 0
I tried using dcast from reshape2
df3=dcast(df1,x~y,length)
which yields
x f g h i
1 1 1 1 0 0
2 2 0 1 1 1
3 3 1 0 0 0
which is not exactly what I need.
Thanks for your help!
UPDATE
I realize this question was already asked and a complete answer can be found here.
By the way Akrun's answer is exactly what I need in a clear format.
We don't need to specify the fun.aggregate if the values in the 'z' column needs to be populated for each combination of 'x' and 'y' (assuming that there are no duplicate combinations for 'x' and 'y'
dcast(df1, x~y, value.var='z', fill=0)
# x f g h i
#1 1 6 7 0 0
#2 2 0 5 2 1
#3 3 5 0 0 0
Or using spread from library(tidyr)
spread(df1, y, z, fill=0)

Easy way to convert long to wide format with counts [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have the following data set:
sample.data <- data.frame(Step = c(1,2,3,4,1,2,1,2,3,1,1),
Case = c(1,1,1,1,2,2,3,3,3,4,5),
Decision = c("Referred","Referred","Referred","Approved","Referred","Declined","Referred","Referred","Declined","Approved","Declined"))
sample.data
Step Case Decision
1 1 1 Referred
2 2 1 Referred
3 3 1 Referred
4 4 1 Approved
5 1 2 Referred
6 2 2 Declined
7 1 3 Referred
8 2 3 Referred
9 3 3 Declined
10 1 4 Approved
11 1 5 Declined
Is it possible in R to translate this into a wide table format, with the decisions on the header, and the value of each cell being the count of the occurrence, for example:
Case Referred Approved Declined
1 3 1 0
2 1 0 1
3 2 0 1
4 0 1 0
5 0 0 1
The aggregation parameter in the dcast function of the reshape2-package defaults to length (= count). In the data.table-package an improved version of the dcastfunction is implemented. So in your case this would be:
library('reshape2') # or library('data.table')
newdf <- dcast(sample.data, Case ~ Decision)
or with using the parameters explicitly:
newdf <- dcast(sample.data, Case ~ Decision,
value.var = "Decision", fun.aggregate = length)
This gives the following dataframe:
> newdf
Case Approved Declined Referred
1 1 1 0 3
2 2 0 1 1
3 3 0 1 2
4 4 1 0 0
5 5 0 1 0
If you don't specify an aggregation function, you get a warning telling you that dcast is using lenght as a default.
You can accomplish this with a simple table() statement. You can play with setting factor levels to get your responses the way you want.
sample.data$Decision <- factor(x = sample.data$Decision,
levels = c("Referred","Approved","Declined"))
table(Case = sample.data$Case,sample.data$Decision)
Case Referred Approved Declined
1 3 1 0
2 1 0 1
3 2 0 1
4 0 1 0
5 0 0 1
Here's a dplyr + tidyr approach:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
sample.data %>%
count(Case, Decision) %>%
spread(Decision, n, fill = 0)
## Case Approved Declined Referred
## (dbl) (dbl) (dbl) (dbl)
## 1 1 1 0 3
## 2 2 0 1 1
## 3 3 0 1 2
## 4 4 1 0 0
## 5 5 0 1 0
We can use the base R xtabs
xtabs(Step~Case+Decision, transform(sample.data, Step=1))
# Decision
# Case Approved Declined Referred
# 1 1 0 3
# 2 0 1 1
# 3 0 1 2
# 4 1 0 0
# 5 0 1 0

Resources