This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have this data frame :
Var1 var 2 var3
var1 var2 var3
A B 1
B C 2
B A 3
D C 4
B D 5
And I would like to transform it to a matrix And add a column and a row to sum the values associated to each variable like this using R code:
A B C D Total
A 0 1 0 0 1
B 3 0 2 5 10
C 0 0 0 0 0
D 0 0 4 0 4
T 3 1 6 5
Can you suggest me a way of doing it ?
Thanks a lot!!
nms <- sort(unique(c(as.character(df$var1),as.character(df$var2))));
m <- matrix(vector(typeof(df$var3),1L),length(nms),length(nms),dimnames=list(nms,nms));
m[cbind(as.character(df$var1),as.character(df$var2))] <- df$var3;
m;
## A B C D
## A 0 1 0 0
## B 3 0 2 5
## C 0 0 0 0
## D 0 0 4 0
The as.character() coercions can be omitted if the var1 and var2 input columns are already character vectors.
Data
df <- data.frame(var1=c('A','B','B','D','B'),var2=c('B','C','A','C','D'),var3=c(1L,2L,3L,4L,
5L));
Marginal totals can be added as follows:
m <- cbind(m,Total=rowSums(m));
m <- rbind(m,T=colSums(m));
m;
## A B C D Total
## A 0 1 0 0 1
## B 3 0 2 5 10
## C 0 0 0 0 0
## D 0 0 4 0 4
## T 3 1 6 5 15
Related
I have a following table in R
df <- data.frame('a' = c(1,0,0,1,0),
'b' = c(1,0,0,1,0),
'c' = c(1,1,0,1,1))
df
a b c
1 1 1 1
2 0 0 1
3 0 0 0
4 1 1 1
4 0 0 1
What I want is to replace the row value with the column name whenever the row is equal to 1. The output would be this one:
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
4 0 0 c
How can I do this in R? Thanks.
I would use Map and replace:
df[] <- Map(function(n, x) replace(x, x == 1, n), names(df), df)
df
# a b c
# 1 a b c
# 2 0 0 c
# 3 0 0 0
# 4 a b c
# 5 0 0 c
We can use
df[] <- names(df)[(NA^!df) * col(df)]
df[is.na(df)] <- 0
df
# a b c
#1 a b c
#2 0 0 c
#3 0 0 0
#4 a b c
#4 0 0 c
You can try stack and unstack
a=stack(df)
a
values ind
1 1 a
2 0 a
3 0 a
4 1 a
5 0 a
6 1 b
7 0 b
8 0 b
9 1 b
10 0 b
11 1 c
12 1 c
13 0 c
14 1 c
15 1 c
a$values[a$values==1]=as.character(a$ind)[a$values==1]
unstack(a)
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c
We can try iterating over the names of the data frame, and then handling each column, for a base R option:
df <- data.frame(a=c(1,0,0,1,0), b=c(1,0,0,1,0), c=c(1,1,0,1,1))
df <- data.frame(sapply(names(df), function(x) {
y <- df[[x]]
y[y == 1] <- x
return(y)
}))
df
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c
Demo
You can do it with ifelse, but you have to do some intermediate transposing to account for R's column-major order processing.
data.frame(t(ifelse(t(df)==1,names(df),0)))
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c
This question already has answers here:
Search multiple columns for string to set indicator variable
(3 answers)
R model.matrix using same factor set among all columns
(1 answer)
Closed 4 years ago.
I am a beginner in R and looking to implement dummy variables on a dataset.
I am having a data set with few columns like below -
Dataset1
T1 T2 T3
A C B
A C B
A C B
A D C
B D C
B E F
I want to add dummy variables to this like dummy,A; dummy,B; dummy,C and so on.. And assign them values as 1 if it is present in either T1, T2 or T3, else 0.
So the final data set should look like -
T1 T2 T3 dummy,A dummy,B dummy,C dummy,D dummy,E dummy,F
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A D C 1 0 1 1 0 0
B D C 0 1 1 1 0 0
B E F 0 1 0 0 1 1
So can anyone please suggest how I can achieve this?
Any help in this regard is really appreciated. Thanks!
We can use mtabulate from qdapTools. Transpose the 'Dataset1', convert it to data.frame, apply the mtabulate, change its column names (if needed) and cbind with the original 'Dataset1'
library(qdapTools)
d1 <- mtabulate(as.data.frame(t(Dataset1)))
row.names(d1) <- NULL
names(d1) <- paste0("dummy.", names(d1))
cbind(Dataset1, d1)
# T1 T2 T3 dummy.A dummy.B dummy.C dummy.D dummy.E dummy.F
#1 A C B 1 1 1 0 0 0
#2 A C B 1 1 1 0 0 0
#3 A C B 1 1 1 0 0 0
#4 A D C 1 0 1 1 0 0
#5 B D C 0 1 1 1 0 0
#6 B E F 0 1 0 0 1 1
Question
Let's say I have this dataframe:
# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)
Will be something like:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.
Idea
# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations
What I tried
# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible
# what I tried
update.adj <- function( data ) {
for (comb in combn(data$letters,2)) {
# stucked
}
}
test.set %>% group_by(cluster.id) %>% update.adj(.)
Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear
Answer to comment
Answer to #Manuel Bickel:
For the data I gave as example (the table under "will be something like"):
This matrix will be A-->Z for the full dataset, keep that in mind.
A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
I will explain what I did:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
Now I look for each cluster what pairwise combinations I can make:
cluster 3:
A,E
A,D
A,C
E,D
E,C
D,C
Update these combination in the adjacency matrix:
A B C D E
A 0 0 1 1 1
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
Then go to the next cluster
cluster 2
A,E
Update the adjacency matrix again:
A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
As reaction to the huge dataset
library(reshape2)
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x1 <- reshape2::dcast(test.set, cluster.id ~ letters)
x1
#cluster.id A B C D E
#1 1 1 0 0 0 0
#2 2 1 0 0 0 1
#3 3 1 0 1 1 1
#4 4 0 2 0 0 0
#5 5 1 0 0 0 0
x2 <- table(test.set)
x2
# letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
x1.c <- crossprod(x1)
#Error in crossprod(x, y) :
# requires numeric/complex matrix/vector arguments
x2.c <- crossprod(x2)
#works fine
Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.
UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x <- table(test.set)
x
letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x
# letters
# letters
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 5 years ago.
test <- data.frame(
x=rep(letters[1:3],each=2),
y=c(4,4,5,5,5,6)
)
x y
1 a 4
2 a 4
3 b 5
4 b 5
5 c 5
6 c 6
How do i create new columns which contains dummy variables 1 and 0 to indicate the row's observation.
I wish to create something like this.. for column x
x y x_a x_b x_c
1 a 4 1 0 0
2 a 4 1 0 0
3 b 5 0 1 0
4 b 5 0 1 0
5 c 5 0 0 1
6 c 6 0 0 1
Or for column y
x y y_4 y_5 x_6
1 a 4 1 0 0
2 a 4 1 0 0
3 b 5 0 1 0
4 b 5 0 1 0
5 c 5 0 1 0
6 c 6 0 0 1
I managed to this is in base R using ifelse in new columns.
I wish to do this in dplyr so it can work on sql tables.
con <- DBI::dbConnect(RSQLite::SQLite(), path = "")
dbWriteTable(con, "test",test)
testdb <- tbl(con, "test")
testdb %>% mutate(i = row_number(), i2 = 1) %>% spread(x, i2, fill = 0)
the row_number() function do not work on sql tables.
Error: Window function row_number() is not supported by this database. Im using SQLite..
For x:
library(dplyr)
test %>% bind_cols(as_data_frame(setNames(lapply(unique(test$x),
function(x){as.integer(test$x == x)}),
paste0('x_', unique(test$x)))))
x y x_a x_b x_c
1 a 4 1 0 0
2 a 4 1 0 0
3 b 5 0 1 0
4 b 5 0 1 0
5 c 5 0 0 1
6 c 6 0 0 1
For y:
test %>% bind_cols(as_data_frame(setNames(lapply(unique(test$y),
function(x){as.integer(test$y == x)}),
paste0('y_', unique(test$y)))))
x y y_4 y_5 y_6
1 a 4 1 0 0
2 a 4 1 0 0
3 b 5 0 1 0
4 b 5 0 1 0
5 c 5 0 1 0
6 c 6 0 0 1
I have two data frames:
DATA1:
ID com_alc_cd com_liv_cd com_hyee_cd
A 1 0 0
B 0 0 1
D 0 0 0
C 0 1 0
DATA2:
ID com_alc_dd com_liv_dd com_hyee_dd
B 0 2 0
A 1 0 2
C 0 1 0
D 0 1 0
I want to combine the two data frames, so as to obtain the sum of the two:
SUM(DATA1, DATA2):
ID com_alc com_liv com_hyee
A 2 0 2
B 0 2 1
C 0 2 0
D 0 1 0
Try this for example( assuming that your data.frames are matrix of the same size)
d1 <- DATA1[order(DATA1$ID),]
d2 <- DATA2[order(DATA2$ID),]
data.frame(ID=d1$ID,as.matrix(subset(d1,select=-ID)) +
as.matrix(subset(d2,select=-ID)))
ID com_alc_cd com_liv_cd com_hyee_cd
1 A 2 0 2
2 B 0 2 1
4 C 0 2 0
3 D 0 1 0
EDIT general solution
library(reshape2)
## put the data in the long format
res <- do.call(rbind,lapply(list(DATA1,DATA2),melt,id.vars='ID'))
## polish names
res$variable <- gsub('(.*_.*)_.*','\\1',res$variable)
## wide format and aggregate using sum
dcast(ID~variable,data=res,fun.aggregate=sum)
ID com_alc com_hyee com_liv
1 A 2 2 0
2 B 0 1 2
3 C 0 0 2
4 D 0 0 1
You can also use aggregate
names(df1) <- names(df2)
df3 <- rbind(df1, df2)
res <- aggregate(df3[,-1], by=list(df3$ID), sum)