This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a data.frame that looks like this.
x a 1
x b 2
x c 3
y a 3
y b 3
y c 2
I want this in matrix form so I can feed it to heatmap to make a plot. The result should look something like:
a b c
x 1 2 3
y 3 3 2
I have tried cast from the reshape package and I have tried writing a manual function to do this but I do not seem to be able to get it right.
There are many ways to do this. This answer starts with what is quickly becoming the standard method, but also includes older methods and various other methods from answers to similar questions scattered around this site.
tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))
Using the tidyverse:
The new cool new way to do this is with pivot_wider from tidyr 1.0.0. It returns a data frame, which is probably what most readers of this answer will want. For a heatmap, though, you would need to convert this to a true matrix.
library(tidyr)
pivot_wider(tmp, names_from = y, values_from = z)
## # A tibble: 2 x 4
## x a b c
## <fct> <dbl> <dbl> <dbl>
## 1 x 1 2 3
## 2 y 3 3 2
The old cool new way to do this is with spread from tidyr. It similarly returns a data frame.
library(tidyr)
spread(tmp, y, z)
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using reshape2:
One of the first steps toward the tidyverse was the reshape2 package.
To get a matrix use acast:
library(reshape2)
acast(tmp, x~y, value.var="z")
## a b c
## x 1 2 3
## y 3 3 2
Or to get a data frame, use dcast, as here: Reshape data for values in one column.
dcast(tmp, x~y, value.var="z")
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using plyr:
In between reshape2 and the tidyverse came plyr, with the daply function, as shown here: https://stackoverflow.com/a/7020101/210673
library(plyr)
daply(tmp, .(x, y), function(x) x$z)
## y
## x a b c
## x 1 2 3
## y 3 3 2
Using matrix indexing:
This is kinda old school but is a nice demonstration of matrix indexing, which can be really useful in certain situations.
with(tmp, {
out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
dimnames=list(levels(x), levels(y)))
out[cbind(x, y)] <- z
out
})
Using xtabs:
xtabs(z~x+y, data=tmp)
Using a sparse matrix:
There's also sparseMatrix within the Matrix package, as seen here: R - convert BIG table into matrix by column names
with(tmp, sparseMatrix(i = as.numeric(x), j=as.numeric(y), x=z,
dimnames=list(levels(x), levels(y))))
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## x 1 2 3
## y 3 3 2
Using reshape:
You can also use the base R function reshape, as suggested here: Convert table into matrix by column names, though you have to do a little manipulation afterwards to remove an extra columns and get the names right (not shown).
reshape(tmp, idvar="x", timevar="y", direction="wide")
## x z.a z.b z.c
## 1 x 1 2 3
## 4 y 3 3 2
The question is some years old but maybe some people are still interested in alternative answers.
If you don't want to load any packages, you might use this function:
#' Converts three columns of a data.frame into a matrix -- e.g. to plot
#' the data via image() later on. Two of the columns form the row and
#' col dimensions of the matrix. The third column provides values for
#' the matrix.
#'
#' #param data data.frame: input data
#' #param rowtitle string: row-dimension; name of the column in data, which distinct values should be used as row names in the output matrix
#' #param coltitle string: col-dimension; name of the column in data, which distinct values should be used as column names in the output matrix
#' #param datatitle string: name of the column in data, which values should be filled into the output matrix
#' #param rowdecreasing logical: should the row names be in ascending (FALSE) or in descending (TRUE) order?
#' #param coldecreasing logical: should the col names be in ascending (FALSE) or in descending (TRUE) order?
#' #param default_value numeric: default value of matrix entries if no value exists in data.frame for the entries
#' #return matrix: matrix containing values of data[[datatitle]] with rownames data[[rowtitle]] and colnames data[coltitle]
#' #author Daniel Neumann
#' #date 2017-08-29
data.frame2matrix = function(data, rowtitle, coltitle, datatitle,
rowdecreasing = FALSE, coldecreasing = FALSE,
default_value = NA) {
# check, whether titles exist as columns names in the data.frame data
if ( (!(rowtitle%in%names(data)))
|| (!(coltitle%in%names(data)))
|| (!(datatitle%in%names(data))) ) {
stop('data.frame2matrix: bad row-, col-, or datatitle.')
}
# get number of rows in data
ndata = dim(data)[1]
# extract rownames and colnames for the matrix from the data.frame
rownames = sort(unique(data[[rowtitle]]), decreasing = rowdecreasing)
nrows = length(rownames)
colnames = sort(unique(data[[coltitle]]), decreasing = coldecreasing)
ncols = length(colnames)
# initialize the matrix
out_matrix = matrix(NA,
nrow = nrows, ncol = ncols,
dimnames=list(rownames, colnames))
# iterate rows of data
for (i1 in 1:ndata) {
# get matrix-row and matrix-column indices for the current data-row
iR = which(rownames==data[[rowtitle]][i1])
iC = which(colnames==data[[coltitle]][i1])
# throw an error if the matrix entry (iR,iC) is already filled.
if (!is.na(out_matrix[iR, iC])) stop('data.frame2matrix: double entry in data.frame')
out_matrix[iR, iC] = data[[datatitle]][i1]
}
# set empty matrix entries to the default value
out_matrix[is.na(out_matrix)] = default_value
# return matrix
return(out_matrix)
}
How it works:
myData = as.data.frame(list('dim1'=c('x', 'x', 'x', 'y','y','y'),
'dim2'=c('a','b','c','a','b','c'),
'values'=c(1,2,3,3,3,2)))
myMatrix = data.frame2matrix(myData, 'dim1', 'dim2', 'values')
myMatrix
> a b c
> x 1 2 3
> y 3 3 2
base R, unstack
unstack(df, V3 ~ V2)
# a b c
# 1 1 2 3
# 2 3 3 2
This may not be a general solution but works well in this case.
data
df<-structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = c(1L,
2L, 3L, 3L, 3L, 2L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-6L))
For sake of completeness, there's a tapply() solution around.
with(d, tapply(z, list(x, y), sum))
# a b c
# x 1 2 3
# y 3 3 2
Data
d <- structure(list(x = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), y = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), z = c(1, 2,
3, 3, 3, 2)), class = "data.frame", row.names = c(NA, -6L))
From tidyr 0.8.3.9000, a new function called pivot_wider() is introduced. It is basically an upgraded version of the previous spread() function (which is, moreover, no longer under active development). From pivoting vignette:
This vignette describes the use of the new pivot_longer() and
pivot_wider() functions. Their goal is to improve the usability of
gather() and spread(), and incorporate state-of-the-art features found
in other packages.
For some time, it’s been obvious that there is something fundamentally
wrong with the design of spread() and gather(). Many people don’t find
the names intuitive and find it hard to remember which direction
corresponds to spreading and which to gathering. It also seems
surprisingly hard to remember the arguments to these functions,
meaning that many people (including me!) have to consult the
documentation every time.
How to use it (using the data from #Aaron):
pivot_wider(data = tmp, names_from = y, values_from = z)
x a b c
<fct> <dbl> <dbl> <dbl>
1 x 1 2 3
2 y 3 3 2
Or in a "full" tidyverse fashion:
tmp %>%
pivot_wider(names_from = y, values_from = z)
The tidyr package from the tidyverse has an excellent function that does this.
Assuming your variables are named v1, v2 and v3, left to right, and you data frame is named dat:
dat %>%
spread(key = v2,
value = v3)
Ta da!
Related
I have to apologize in advance if the question is very basic as I am still new to R. I have tried to look on stackoverflow for similar questions, but I still can't resolve the problem that I am facing.
I am currently working on a large dataset X. What I am trying to do is pretty simple. I want to replace all NAs in selected columns (non consecutive columns) with "no".
I firstly have created a variable including all the columns that I want to modify. For instance, if I want to modify the NAs in columns named "m","l" and "h", I wrote the following:
modify <- c("m","l","h")
for (i in 1:length(modify))
column <- modify[i]
X$column <- as.character(X$column) #X is my dataframe
X$column %>% replace_na("no")
This loop returned the output only for the "m" column, which is the first variable in my modify variable. However, even after generating the output after the loop, when I tried to check X$m, nothing has changed in my original dataset.
I also tried to create a function, which is very similar to the loop. Even though no error message was generated, it didn't work as I do not know what the return value should be.
Why can't the loop being applied to my entire dataset while the individual steps in the loop work?
Thank you so so much for your help!
This might help, and was among one of the answers here (but slightly different here using all_of():
library(tidyverse)
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 <NA>
#> 3 NA b
modify <- c("x","y")
df %>%
mutate(
across(all_of(modify), ~replace_na(.x, 0))
)
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 0
#> 3 0 b
Created on 2021-09-22 by the reprex package (v2.0.1)
Here's a base R approach modifying data from #scrameri.
df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"), c = c(1, NA, 5))
modify <- c('x', 'y')
df[modify][is.na(df[modify])] <- 'no'
df
# x y c
#1 1 a 1
#2 2 no NA
#3 no b 5
I'm going to fix your code with as few changes as possible, so you can learn.
There are two big problems. First, the for loop needs to have curly braces {} around the lines you want to loop over. Second, if you want to reference variables in a data frame dynamically, you can't use the $ operator. You have to use double brackets [[]].
library(tidyr)
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"), h = c(1, NA, 5))
modify <- c("m","l","h")
for (i in seq_along(modify)) {
column <- modify[i]
X[[column]] <- as.character(X[[column]]) #X is my dataframe
X[[column]] <- X[[column]] %>% replace_na("no")
}
X
# m l h
# 1 1 a 1
# 2 2 no no
# 3 no b 5
You can do what you were trying to do much more efficiently, as shown in the other answers. But I wanted to show you how to do it the way you were trying to correct your understanding of for loops and the subset operator. These are basic things that everyone should understand when you are first learning R.
You might want to go through a beginners tutorial to solidify your understanding. I used tutorialspoint when I was first learning and found it useful.
We could do this efficiently with set from data.table
library(data.table)
setDT(X)
for(nm in modify) {
set(X, i = NULL, j= nm, value = as.character(X[[nm]]))
set(X, i = which(is.na(X[[nm]])), j = nm, value = 'no')
}
-output
> X
m l h i
1: 1 a 1 NA
2: 2 no no 5
3: no b 5 6
data
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"),
h = c(1, NA, 5), i = c(NA, 5, 6))
modify <- c("m","l","h")
I need some help creating a matrix. I have a large dataset with multiple groups. Each group is sorted into cases and non cases.
For ex.
Group
Cases
Noncases
GroupA
4
7
GroupB
9
4
GroupC
10
3
I want to create a matrix which will compare one group to the sum of the other groups.
For instance:
Disease Category
GroupA
NotGroupA
Case
4
19
Noncase
7
7
The goal is to set up a matrix which will allow me to run a chisquare test and/or a Fisher's exact test (depending on sample size).
I have tried the following code to extrapolate values from my dataframe into a matrix:
GroupA <- as.table(matrix(c(df[1,3], df[1,4], (sum(df$group_cases)-df$group_cases[1])), (sum(df$Noncases)-df$Noncases[1])), nrow=2, ncol=2,
dimnames=list(Group= c("A", "Other"),
Case = c(1, 0)))
However, I get the following error:
Warning message:
In matrix(c(df[1, 3], df[1, 4], (sum(df$group_cases) - :
data length [3] is not a sub-multiple or multiple of the number of rows [329]
It outputs a 329 row list instead of a 2 by 2 matrix.
Because I have many groups, I want R to calculate the values for me when constructing the matrix. I don't want to calculate the "NotGroup_" column separately, as that makes room for human error.
How would you all recommend constructing this matrix, and is it possible to have R calculate the sums of columns/subtract values while creating a matrix?
Thank you for your help!
dplyr
library(dplyr)
library(tidyr) # pivot_*
dat %>%
mutate(Group = ifelse(Group == "GroupA", "GroupA", "NotGroupA")) %>%
pivot_longer(-Group, names_to = c("Case")) %>%
pivot_wider(Case, names_from = Group, values_from = value, values_fn = list(value = sum))
# # A tibble: 2 x 3
# Case GroupA NotGroupA
# <chr> <int> <int>
# 1 Cases 4 19
# 2 Noncases 7 7
base R
dat2 <- transform(dat, Group = ifelse(Group == "GroupA", "GroupA", "NotGroupA"))
aggregate(. ~ Group, data = dat2, FUN = sum)
# Group Cases Noncases
# 1 GroupA 4 7
# 2 NotGroupA 19 7
(though the axes are reversed)
Data
dat <- structure(list(Group = c("GroupA", "GroupB", "GroupC"), Cases = c(4L, 9L, 10L), Noncases = c(7L, 4L, 3L)), class = "data.frame", row.names = c(NA, -3L))
A related link provides many ways to "summarize by group": Calculate the mean by group
Set up example:
dd <- data.frame(Group = LETTERS[1:3], Cases = c(4, 9, 10),
Noncases = c(7,4,3))
Function:
mktab <- function(focal, data) {
## subset rows according to whether $Group == focal or not
## subset cols according to "Cases"/"Noncases"
## sum() the not-focal elements
matrix(c(data[data$Group==focal, "Cases"],
sum(data[data$Group!=focal, "Cases"]),
data[data$Group==focal, "Noncases"],
sum(data[data$Group!=focal, "Noncases"])
),
nrow = 2,
byrow=TRUE,
dimnames = list(c("Case", "Noncase"),
c(focal, paste0("not_", focal)))
)
}
mktab("A", dd)
Results:
A not_A
Case 4 19
Noncase 7 7
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a data.frame that looks like this.
x a 1
x b 2
x c 3
y a 3
y b 3
y c 2
I want this in matrix form so I can feed it to heatmap to make a plot. The result should look something like:
a b c
x 1 2 3
y 3 3 2
I have tried cast from the reshape package and I have tried writing a manual function to do this but I do not seem to be able to get it right.
There are many ways to do this. This answer starts with what is quickly becoming the standard method, but also includes older methods and various other methods from answers to similar questions scattered around this site.
tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))
Using the tidyverse:
The new cool new way to do this is with pivot_wider from tidyr 1.0.0. It returns a data frame, which is probably what most readers of this answer will want. For a heatmap, though, you would need to convert this to a true matrix.
library(tidyr)
pivot_wider(tmp, names_from = y, values_from = z)
## # A tibble: 2 x 4
## x a b c
## <fct> <dbl> <dbl> <dbl>
## 1 x 1 2 3
## 2 y 3 3 2
The old cool new way to do this is with spread from tidyr. It similarly returns a data frame.
library(tidyr)
spread(tmp, y, z)
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using reshape2:
One of the first steps toward the tidyverse was the reshape2 package.
To get a matrix use acast:
library(reshape2)
acast(tmp, x~y, value.var="z")
## a b c
## x 1 2 3
## y 3 3 2
Or to get a data frame, use dcast, as here: Reshape data for values in one column.
dcast(tmp, x~y, value.var="z")
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using plyr:
In between reshape2 and the tidyverse came plyr, with the daply function, as shown here: https://stackoverflow.com/a/7020101/210673
library(plyr)
daply(tmp, .(x, y), function(x) x$z)
## y
## x a b c
## x 1 2 3
## y 3 3 2
Using matrix indexing:
This is kinda old school but is a nice demonstration of matrix indexing, which can be really useful in certain situations.
with(tmp, {
out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
dimnames=list(levels(x), levels(y)))
out[cbind(x, y)] <- z
out
})
Using xtabs:
xtabs(z~x+y, data=tmp)
Using a sparse matrix:
There's also sparseMatrix within the Matrix package, as seen here: R - convert BIG table into matrix by column names
with(tmp, sparseMatrix(i = as.numeric(x), j=as.numeric(y), x=z,
dimnames=list(levels(x), levels(y))))
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## x 1 2 3
## y 3 3 2
Using reshape:
You can also use the base R function reshape, as suggested here: Convert table into matrix by column names, though you have to do a little manipulation afterwards to remove an extra columns and get the names right (not shown).
reshape(tmp, idvar="x", timevar="y", direction="wide")
## x z.a z.b z.c
## 1 x 1 2 3
## 4 y 3 3 2
The question is some years old but maybe some people are still interested in alternative answers.
If you don't want to load any packages, you might use this function:
#' Converts three columns of a data.frame into a matrix -- e.g. to plot
#' the data via image() later on. Two of the columns form the row and
#' col dimensions of the matrix. The third column provides values for
#' the matrix.
#'
#' #param data data.frame: input data
#' #param rowtitle string: row-dimension; name of the column in data, which distinct values should be used as row names in the output matrix
#' #param coltitle string: col-dimension; name of the column in data, which distinct values should be used as column names in the output matrix
#' #param datatitle string: name of the column in data, which values should be filled into the output matrix
#' #param rowdecreasing logical: should the row names be in ascending (FALSE) or in descending (TRUE) order?
#' #param coldecreasing logical: should the col names be in ascending (FALSE) or in descending (TRUE) order?
#' #param default_value numeric: default value of matrix entries if no value exists in data.frame for the entries
#' #return matrix: matrix containing values of data[[datatitle]] with rownames data[[rowtitle]] and colnames data[coltitle]
#' #author Daniel Neumann
#' #date 2017-08-29
data.frame2matrix = function(data, rowtitle, coltitle, datatitle,
rowdecreasing = FALSE, coldecreasing = FALSE,
default_value = NA) {
# check, whether titles exist as columns names in the data.frame data
if ( (!(rowtitle%in%names(data)))
|| (!(coltitle%in%names(data)))
|| (!(datatitle%in%names(data))) ) {
stop('data.frame2matrix: bad row-, col-, or datatitle.')
}
# get number of rows in data
ndata = dim(data)[1]
# extract rownames and colnames for the matrix from the data.frame
rownames = sort(unique(data[[rowtitle]]), decreasing = rowdecreasing)
nrows = length(rownames)
colnames = sort(unique(data[[coltitle]]), decreasing = coldecreasing)
ncols = length(colnames)
# initialize the matrix
out_matrix = matrix(NA,
nrow = nrows, ncol = ncols,
dimnames=list(rownames, colnames))
# iterate rows of data
for (i1 in 1:ndata) {
# get matrix-row and matrix-column indices for the current data-row
iR = which(rownames==data[[rowtitle]][i1])
iC = which(colnames==data[[coltitle]][i1])
# throw an error if the matrix entry (iR,iC) is already filled.
if (!is.na(out_matrix[iR, iC])) stop('data.frame2matrix: double entry in data.frame')
out_matrix[iR, iC] = data[[datatitle]][i1]
}
# set empty matrix entries to the default value
out_matrix[is.na(out_matrix)] = default_value
# return matrix
return(out_matrix)
}
How it works:
myData = as.data.frame(list('dim1'=c('x', 'x', 'x', 'y','y','y'),
'dim2'=c('a','b','c','a','b','c'),
'values'=c(1,2,3,3,3,2)))
myMatrix = data.frame2matrix(myData, 'dim1', 'dim2', 'values')
myMatrix
> a b c
> x 1 2 3
> y 3 3 2
base R, unstack
unstack(df, V3 ~ V2)
# a b c
# 1 1 2 3
# 2 3 3 2
This may not be a general solution but works well in this case.
data
df<-structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = c(1L,
2L, 3L, 3L, 3L, 2L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-6L))
For sake of completeness, there's a tapply() solution around.
with(d, tapply(z, list(x, y), sum))
# a b c
# x 1 2 3
# y 3 3 2
Data
d <- structure(list(x = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), y = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), z = c(1, 2,
3, 3, 3, 2)), class = "data.frame", row.names = c(NA, -6L))
From tidyr 0.8.3.9000, a new function called pivot_wider() is introduced. It is basically an upgraded version of the previous spread() function (which is, moreover, no longer under active development). From pivoting vignette:
This vignette describes the use of the new pivot_longer() and
pivot_wider() functions. Their goal is to improve the usability of
gather() and spread(), and incorporate state-of-the-art features found
in other packages.
For some time, it’s been obvious that there is something fundamentally
wrong with the design of spread() and gather(). Many people don’t find
the names intuitive and find it hard to remember which direction
corresponds to spreading and which to gathering. It also seems
surprisingly hard to remember the arguments to these functions,
meaning that many people (including me!) have to consult the
documentation every time.
How to use it (using the data from #Aaron):
pivot_wider(data = tmp, names_from = y, values_from = z)
x a b c
<fct> <dbl> <dbl> <dbl>
1 x 1 2 3
2 y 3 3 2
Or in a "full" tidyverse fashion:
tmp %>%
pivot_wider(names_from = y, values_from = z)
The tidyr package from the tidyverse has an excellent function that does this.
Assuming your variables are named v1, v2 and v3, left to right, and you data frame is named dat:
dat %>%
spread(key = v2,
value = v3)
Ta da!
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a data.frame that looks like this.
x a 1
x b 2
x c 3
y a 3
y b 3
y c 2
I want this in matrix form so I can feed it to heatmap to make a plot. The result should look something like:
a b c
x 1 2 3
y 3 3 2
I have tried cast from the reshape package and I have tried writing a manual function to do this but I do not seem to be able to get it right.
There are many ways to do this. This answer starts with what is quickly becoming the standard method, but also includes older methods and various other methods from answers to similar questions scattered around this site.
tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))
Using the tidyverse:
The new cool new way to do this is with pivot_wider from tidyr 1.0.0. It returns a data frame, which is probably what most readers of this answer will want. For a heatmap, though, you would need to convert this to a true matrix.
library(tidyr)
pivot_wider(tmp, names_from = y, values_from = z)
## # A tibble: 2 x 4
## x a b c
## <fct> <dbl> <dbl> <dbl>
## 1 x 1 2 3
## 2 y 3 3 2
The old cool new way to do this is with spread from tidyr. It similarly returns a data frame.
library(tidyr)
spread(tmp, y, z)
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using reshape2:
One of the first steps toward the tidyverse was the reshape2 package.
To get a matrix use acast:
library(reshape2)
acast(tmp, x~y, value.var="z")
## a b c
## x 1 2 3
## y 3 3 2
Or to get a data frame, use dcast, as here: Reshape data for values in one column.
dcast(tmp, x~y, value.var="z")
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using plyr:
In between reshape2 and the tidyverse came plyr, with the daply function, as shown here: https://stackoverflow.com/a/7020101/210673
library(plyr)
daply(tmp, .(x, y), function(x) x$z)
## y
## x a b c
## x 1 2 3
## y 3 3 2
Using matrix indexing:
This is kinda old school but is a nice demonstration of matrix indexing, which can be really useful in certain situations.
with(tmp, {
out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
dimnames=list(levels(x), levels(y)))
out[cbind(x, y)] <- z
out
})
Using xtabs:
xtabs(z~x+y, data=tmp)
Using a sparse matrix:
There's also sparseMatrix within the Matrix package, as seen here: R - convert BIG table into matrix by column names
with(tmp, sparseMatrix(i = as.numeric(x), j=as.numeric(y), x=z,
dimnames=list(levels(x), levels(y))))
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## x 1 2 3
## y 3 3 2
Using reshape:
You can also use the base R function reshape, as suggested here: Convert table into matrix by column names, though you have to do a little manipulation afterwards to remove an extra columns and get the names right (not shown).
reshape(tmp, idvar="x", timevar="y", direction="wide")
## x z.a z.b z.c
## 1 x 1 2 3
## 4 y 3 3 2
The question is some years old but maybe some people are still interested in alternative answers.
If you don't want to load any packages, you might use this function:
#' Converts three columns of a data.frame into a matrix -- e.g. to plot
#' the data via image() later on. Two of the columns form the row and
#' col dimensions of the matrix. The third column provides values for
#' the matrix.
#'
#' #param data data.frame: input data
#' #param rowtitle string: row-dimension; name of the column in data, which distinct values should be used as row names in the output matrix
#' #param coltitle string: col-dimension; name of the column in data, which distinct values should be used as column names in the output matrix
#' #param datatitle string: name of the column in data, which values should be filled into the output matrix
#' #param rowdecreasing logical: should the row names be in ascending (FALSE) or in descending (TRUE) order?
#' #param coldecreasing logical: should the col names be in ascending (FALSE) or in descending (TRUE) order?
#' #param default_value numeric: default value of matrix entries if no value exists in data.frame for the entries
#' #return matrix: matrix containing values of data[[datatitle]] with rownames data[[rowtitle]] and colnames data[coltitle]
#' #author Daniel Neumann
#' #date 2017-08-29
data.frame2matrix = function(data, rowtitle, coltitle, datatitle,
rowdecreasing = FALSE, coldecreasing = FALSE,
default_value = NA) {
# check, whether titles exist as columns names in the data.frame data
if ( (!(rowtitle%in%names(data)))
|| (!(coltitle%in%names(data)))
|| (!(datatitle%in%names(data))) ) {
stop('data.frame2matrix: bad row-, col-, or datatitle.')
}
# get number of rows in data
ndata = dim(data)[1]
# extract rownames and colnames for the matrix from the data.frame
rownames = sort(unique(data[[rowtitle]]), decreasing = rowdecreasing)
nrows = length(rownames)
colnames = sort(unique(data[[coltitle]]), decreasing = coldecreasing)
ncols = length(colnames)
# initialize the matrix
out_matrix = matrix(NA,
nrow = nrows, ncol = ncols,
dimnames=list(rownames, colnames))
# iterate rows of data
for (i1 in 1:ndata) {
# get matrix-row and matrix-column indices for the current data-row
iR = which(rownames==data[[rowtitle]][i1])
iC = which(colnames==data[[coltitle]][i1])
# throw an error if the matrix entry (iR,iC) is already filled.
if (!is.na(out_matrix[iR, iC])) stop('data.frame2matrix: double entry in data.frame')
out_matrix[iR, iC] = data[[datatitle]][i1]
}
# set empty matrix entries to the default value
out_matrix[is.na(out_matrix)] = default_value
# return matrix
return(out_matrix)
}
How it works:
myData = as.data.frame(list('dim1'=c('x', 'x', 'x', 'y','y','y'),
'dim2'=c('a','b','c','a','b','c'),
'values'=c(1,2,3,3,3,2)))
myMatrix = data.frame2matrix(myData, 'dim1', 'dim2', 'values')
myMatrix
> a b c
> x 1 2 3
> y 3 3 2
base R, unstack
unstack(df, V3 ~ V2)
# a b c
# 1 1 2 3
# 2 3 3 2
This may not be a general solution but works well in this case.
data
df<-structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = c(1L,
2L, 3L, 3L, 3L, 2L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-6L))
For sake of completeness, there's a tapply() solution around.
with(d, tapply(z, list(x, y), sum))
# a b c
# x 1 2 3
# y 3 3 2
Data
d <- structure(list(x = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), y = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), z = c(1, 2,
3, 3, 3, 2)), class = "data.frame", row.names = c(NA, -6L))
From tidyr 0.8.3.9000, a new function called pivot_wider() is introduced. It is basically an upgraded version of the previous spread() function (which is, moreover, no longer under active development). From pivoting vignette:
This vignette describes the use of the new pivot_longer() and
pivot_wider() functions. Their goal is to improve the usability of
gather() and spread(), and incorporate state-of-the-art features found
in other packages.
For some time, it’s been obvious that there is something fundamentally
wrong with the design of spread() and gather(). Many people don’t find
the names intuitive and find it hard to remember which direction
corresponds to spreading and which to gathering. It also seems
surprisingly hard to remember the arguments to these functions,
meaning that many people (including me!) have to consult the
documentation every time.
How to use it (using the data from #Aaron):
pivot_wider(data = tmp, names_from = y, values_from = z)
x a b c
<fct> <dbl> <dbl> <dbl>
1 x 1 2 3
2 y 3 3 2
Or in a "full" tidyverse fashion:
tmp %>%
pivot_wider(names_from = y, values_from = z)
The tidyr package from the tidyverse has an excellent function that does this.
Assuming your variables are named v1, v2 and v3, left to right, and you data frame is named dat:
dat %>%
spread(key = v2,
value = v3)
Ta da!