How can I reshape a dataframe in R using dcast (advanced)? [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
I have the following R dataframe
df1=data.frame(x = c(1,1,2,2,2,3), y = c("f","g","g","h","i","f"), z=c(6,7,5,2,1,5))
x y z
1 1 f 6
2 1 g 7
3 2 g 5
4 2 h 2
5 2 i 1
6 3 f 5
and I need to obtain
df2=data.frame(x = c(1,2,3), f=c(6,0,5), g=c(7,5,0), h=c(0,2,0),i=c(0,1,0))
x f g h i
1 1 6 7 0 0
2 2 0 5 2 1
3 3 5 0 0 0
I tried using dcast from reshape2
df3=dcast(df1,x~y,length)
which yields
x f g h i
1 1 1 1 0 0
2 2 0 1 1 1
3 3 1 0 0 0
which is not exactly what I need.
Thanks for your help!
UPDATE
I realize this question was already asked and a complete answer can be found here.
By the way Akrun's answer is exactly what I need in a clear format.

We don't need to specify the fun.aggregate if the values in the 'z' column needs to be populated for each combination of 'x' and 'y' (assuming that there are no duplicate combinations for 'x' and 'y'
dcast(df1, x~y, value.var='z', fill=0)
# x f g h i
#1 1 6 7 0 0
#2 2 0 5 2 1
#3 3 5 0 0 0
Or using spread from library(tidyr)
spread(df1, y, z, fill=0)

Related

Turning factor variable into a list of binary variable per row (trial) in R [duplicate]

This question already has answers here:
Transform one column from categoric to binary, keep the rest [duplicate]
(3 answers)
Closed 3 years ago.
A while ago I've posted a question about how to convert factor data.frame into a binary (hot-encoding) data.frame here. Now I am trying to find the most efficient way to loop over trials (rows) and binarize a factor variable. A minimal example would look like this:
d = data.frame(
Trial = c(1,2,3,4,5,6,7,8,9,10),
Category = c('a','b','b','b','a','b','a','a','b','a')
)
d
Trial Category
1 1 a
2 2 b
3 3 b
4 4 b
5 5 a
6 6 b
7 7 a
8 8 a
9 9 b
10 10 a
While I would like to get this:
Trial a b
1 1 1 0
2 2 0 1
3 3 0 1
4 4 0 1
5 5 1 0
6 6 0 1
7 7 1 0
8 8 1 0
9 9 0 1
10 10 1 0
What would be the most efficient way of doing it?
here is an option with pivot_wider. Create a column of 1's and then apply pivot_wider with names_from the 'Category' and values_from the newly created column
library(dplyr)
library(tidyr)
d %>%
mutate(n = 1) %>%
pivot_wider(names_from = Category, values_from = n, values_fill = list(n = 0))
# A tibble: 10 x 3
# Trial a b
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 1
# 4 4 0 1
# 5 5 1 0
# 6 6 0 1
# 7 7 1 0
# 8 8 1 0
# 9 9 0 1
#10 10 1 0
The efficient option would be data.table
library(data.table)
dcast(setDT(d), Trial ~ Category, length)
It can also be done with base R
table(d)

Convert factor levels into the column names without changing the structure of the data.frame [duplicate]

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 3 years ago.
I am doing a fairly easy job in R to convert the factor levels into column names.
Let tmp be a data.frame:
tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))
> tmp
x y z
1 x a 1
2 x b 2
3 x c 3
4 y a 3
5 y b 3
6 y c 2
My purpose is to make levels in y column into column names and put 1 in the corresponded column like below:
x y z a b c
x a 1 1 0 0
x b 2 0 1 0
x c 3 0 0 1
y a 3 1 0 0
y b 3 0 1 0
y c 2 0 0 1
Note that what I need is different from dcast or spread function in the R package reshape2 and tidyr
We can use mtabulate from qdapTools
cbind(tmp, qdapTools::mtabulate(tmp$y))
# x y z a b c
#1 x a 1 1 0 0
#2 x b 2 0 1 0
#3 x c 3 0 0 1
#4 y a 3 1 0 0
#5 y b 3 0 1 0
#6 y c 2 0 0 1
Or cSplit_e from splitstackshape
splitstackshape::cSplit_e(tmp, "y", type = "character", fill = 0)

Extended Sorting according to two attributes (I think it is grid sorting) [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have the following data:
id brand quantity
1 a 2
1 b 1
2 b 5
3 c 10
2 d 11
3 a 1
4 b 2
The output should be
a b c d
1 2 1 0 0
2 0 5 10 11
3 1 0 10 0
4 0 2 0 0
How to get this type of sort in R language where the column names are brand types and row names are customer ids and the matrix data are quantity?
This can be done with reshape() and a couple of post hoc fixups:
res <- reshape(df,dir='w',timevar='brand')[-1L];
names(res) <- sub('^quantity\\.','',names(res));
res[is.na(res)] <- 0L;
res;
## a b c d
## 1 2 1 0 0
## 3 0 5 0 11
## 4 1 0 10 0
## 7 0 2 0 0
Data
df <- data.frame(id=c(1L,1L,2L,3L,2L,3L,4L),brand=c('a','b','b','c','d','a','b'),quantity=c(
2L,1L,5L,10L,11L,1L,2L),stringsAsFactors=F);

How to randomly choose only one row in each group [duplicate]

This question already has answers here:
from data table, randomly select one row per group
(4 answers)
Closed 6 years ago.
Say I have a dataframe as follows:
df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
Region Combo
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 D 1
8 D 2
9 D 3
10 D 4
What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.
If the chosen combination were indicated by a binary variable, it would look something potentially like this:
Region Combo RandomlyChosen
1 A 1 1
2 A 2 0
3 A 3 0
4 B 1 0
5 B 2 1
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
I'm aware of the sample function, but just don't know how to choose only one combo within each region.
I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.
Thanks!
In plain R you can use sample() within tapply():
df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
Region Combo Chosen
1 A 1 0
2 A 2 1
3 A 3 0
4 B 1 1
5 B 2 0
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

R Counting duplicate values and adding them to separate vectors [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
x <- c(1,1,1,2,3,3,4,4,4,5,6,6,6,6,6,7,7,8,8,8,8)
y <- c('A','A','C','A','B','B','A','C','C','B','A','A','C','C','B','A','C','A','A','A','B')
X <- data.frame(x,y)
Above I have a data frame where I want to identify the duplicates in vector x, while counting the number of duplicate instances for both (x,y)....
For example I have found that ddply and this post here is similar to what I am looking for (Find how many times duplicated rows repeat in R data frame).
library(ddply)
ddply(X,.(x,y), nrow)
This counts the number of instances 1 - A occurs which is 2 times... However I am looking for R to return the unique identifier in vector x with the counted number of times that x matches in column y (getting rid of vector y if necessary), like below..
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
Any help will be appreciated, thanks
You just need the table function :)
> table(X)
y
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
7 1 0 1
8 3 1 0
This is fairly straightforward by casting your data.frame.
require(reshape2)
dcast(X, x ~ y, fun.aggregate=length)
Or if you'd want things to be faster (say working on large data), then you can use the newly implemented dcast.data.table function from data.table package:
require(data.table) ## >= 1.9.0
setDT(X) ## convert data.frame to data.table by reference
dcast.data.table(X, x ~ y, fun.aggregate=length)
Both result in:
x A B C
1: 1 2 0 1
2: 2 1 0 0
3: 3 0 2 0
4: 4 1 0 2
5: 5 0 1 0
6: 6 2 1 2
7: 7 1 0 1
8: 8 3 1 0

Resources