In the Wickham's Tidy Data pdf he has an example to go from messy to tidy data.
I wonder where the code is?
For example, what code is used to go from
Table 1: Typical presentation dataset.
to
Table 3: The same data as in Table 1 but with variables in columns and observations in rows.
Per haps melt or cast. But from http://www.statmethods.net/management/reshape.html I cant see how.
(Note to self: Need it for GDPpercapita...)
The answer sort of depends on what the structure of your data are. In the paper you linked to, Hadley was writing about the "reshape" and "reshape2" packages.
It's ambiguous what the data structure is in "Table 1". Judging by the description, it would sound like a matrix with named dimnames (like I show in mymat). In that case, a simple melt would work:
library(reshape2)
melt(mymat)
# Var1 Var2 value
# 1 John Smith treatmenta —
# 2 Jane Doe treatmenta 16
# 3 Mary Johnson treatmenta 3
# 4 John Smith treatmentb 2
# 5 Jane Doe treatmentb 11
# 6 Mary Johnson treatmentb 1
If it were not a matrix, but a data.frame with row.names, you can still use the matrix method by using something like melt(as.matrix(mymat)).
If, on the other hand, the "names" are a column in a data.frame (as they are in the "tidyr" vignette, you need to specify either the id.vars or the measure.vars so that melt knows how to treat the columns.
melt(mydf, id.vars = "name")
# name variable value
# 1 John Smith treatmenta —
# 2 Jane Doe treatmenta 16
# 3 Mary Johnson treatmenta 3
# 4 John Smith treatmentb 2
# 5 Jane Doe treatmentb 11
# 6 Mary Johnson treatmentb 1
The new kid on the block is "tidyr". The "tidyr" package works with data.frames because it is often used in conjunction with dplyr. I won't reproduce the code for "tidyr" here, because that is sufficiently covered in the vignette.
Sample data:
mymat <- structure(c("—", "16", "3", " 2", "11", " 1"), .Dim = c(3L,
2L), .Dimnames = list(c("John Smith", "Jane Doe", "Mary Johnson"
), c("treatmenta", "treatmentb")))
mydf <- structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jane Doe",
"John Smith", "Mary Johnson"), class = "factor"), treatmenta = c("—",
"16", "3"), treatmentb = c(2L, 11L, 1L)), .Names = c("name",
"treatmenta", "treatmentb"), row.names = c(NA, 3L), class = "data.frame")
Related
I have two data frames with two columns each I'd like to compare, and generate output that appears in the first dataframe only that is the difference of the interaction of the two columns when compared between the dataframes.
I've tried using merge, %in%, Interaction, match, and I can't seem to get the correct output. I've also searched extensively on SO and am not finding a similar problem.
The closest response I've found is:
newdat <- match(interaction(dfA$colA, dfA$colB), interaction(dfB$colA, dfB$colB))
But obviously, this code isn't correct as this would (if working) would give me something that is common between the dataframes, and I want the difference between them (erroring - it generates a numeric vector, when both colA and B are string).
Example data:
#Dataframe A
colA colB
Aspirin Smith, John
Aspirin Doe, Jane
Atorva Smith, John
Simva Doe, Jane
#Dataframe B
colA colB
Aspirin Smith, John
Aspirin Doe, Jane
Atorva Doe, Jane
## GOAL:
#Dataframe
colA colB
Atorva Smith, John
Simva Doe, Jane
Thanks!
We can use setdiff from the dplyr package.
library(dplyr)
setdiff(datA, datB)
# colA colB
# 1 Atorva Smith, John
# 2 Simva Doe, Jane
DATA
datA <- read.table(text = " colA colB
Aspirin 'Smith, John'
Aspirin 'Doe, Jane'
Atorva 'Smith, John'
Simva 'Doe, Jane'",
header = TRUE, stringsAsFactors = FALSE)
datB <- read.table(text = " colA colB
Aspirin 'Smith, John'
Aspirin 'Doe, Jane'
Atorva 'Doe, Jane'",
header = TRUE, stringsAsFactors = FALSE)
If you want a base R solution, it's easy to write a setdiffDF function.
setdiffDF <- function(x, y){
ix <- !duplicated(rbind(y, x))[nrow(y) + 1:nrow(x)]
x[ix, ]
}
setdiffDF(dfA, dfB)
# colA colB
#3 Atorva Smith, John
#4 Simva Doe, Jane
Data in dput format.
dfA <-
structure(list(colA = structure(c(1L, 1L, 2L, 3L),
.Label = c("Aspirin", "Atorva", "Simva"), class = "factor"),
colB = structure(c(2L, 1L, 2L, 1L), .Label = c("Doe, Jane",
"Smith, John"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))
dfB <-
structure(list(colA = structure(c(1L, 1L, 2L),
.Label = c("Aspirin", "Atorva"), class = "factor"),
colB = structure(c(2L, 1L, 1L), .Label = c("Doe, Jane",
"Smith, John"), class = "factor")), class = "data.frame",
row.names = c(NA, -3L))
Data set1:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
Data set2:
Terr ID Name Comments
LA 5 Rick yes
MH 11 Oly no
I want final data set to have columns of 1st data set only and identify Territory is same as Terr and does not bring forward Comments column.
Final data should look like:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
5 Rick LA NA
11 Oly MH NA
Thanks in advance
A possible solution:
# create a named vector with names from 'set2'
# with the positions of the matching columns in 'set1'
nms2 <- sort(unlist(sapply(names(set2), agrep, x = names(set1))))
# only keep the columns in 'set2' for which a match is found
# and give them the same names as in 'set1'
set2 <- setNames(set2[names(nms2)], names(set1[nms2]))
# bind the two dataset together
# option 1:
library(dplyr)
bind_rows(set1, set2)
# option 2:
library(data.table)
rbindlist(list(set1, set2), fill = TRUE)
which gives (dplyr-output shown):
ID Name Territory Sales
1 1 Richard NY 59
2 8 Sam California 44
3 5 Rick LA NA
4 11 Oly MH NA
Used data:
set1 <- structure(list(ID = c(1L, 8L),
Name = c("Richard", "Sam"),
Territory = c("NY", "California"),
Sales = c(59L, 44L)),
.Names = c("ID", "Name", "Territory", "Sales"), class = "data.frame", row.names = c(NA, -2L))
set2 <- structure(list(Terr = c("LA", "MH"),
ID = c(5L, 11L),
Name = c("Rick", "Oly"),
Comments = c("yes", "no")),
.Names = c("Terr", "ID", "Name", "Comments"), class = "data.frame", row.names = c(NA, -2L))
I have this matrix below and the apply loop changes the row names to numbers.
This is matrix:
treatmenta treatmentb
John Smith NA " 2"
John Doe "16" "11"
Mary Johnson " 3" " 1"
and this code as.matrix(apply(y, 2, as.numeric))
results is this but i want the row names to be people names
treatmenta treatmentb
[1,] NA 2
[2,] 16 11
[3,] 3 1
Converting to data.table also does not work. How do I do this?
Here is code to reproduce data:
name <- c("John Smith", "John Doe", "Mary Johnson")
treatmenta <- c("NA", "16", "3")
treatmentb <- c("2", "11", "1")
y <- data.frame(name, treatmenta, treatmentb)
rownames(y) <- y[,1]
y[,1] <- NULL
We can do
y <- `dimnames<-`(`dim<-`(as.numeric(y), dim(y)), dimnames(y))
y
# treatmenta treatmentb
#John Smith NA 2
#John Doe 16 11
#Mary Johnson 3 1
Or a compact option is
class(y) <- "numeric"
data
y <- structure(c(NA, "16", " 3", " 2", "11", " 1"), .Dim = c(3L, 2L
), .Dimnames = list(c("John Smith", "John Doe", "Mary Johnson"
), c("treatmenta", "treatmentb")))
You are going from a more general dataform (dataframes) to matrixes (vectors with dim attribute). During this as.matrix or any method from the base that converts your data to matrix will eventually call vector(x) which is generic function setting all your variables to charactor or will set everything to numeric but the name column to NAs (depending on how you call as.matrix).
Having said that, if for some reason you still have to use matrix form then use this for better readability:
treatmenta <- c("1", "16", "3")
treatmentb <- c("2", "11", "1")
y[,1] <- as.matrix(sapply(treatmenta, as.numeric))
y[,2] <- as.matrix(sapply(treatmentb, as.numeric))
#now they are not factors.
#> class(y)
#[1] "matrix"
name <- c("John Smith", "John Doe", "Mary Johnson")
row.names(y) <- name
# treatmenta treatmentb
# John Smith 1 2
# John Doe 16 11
# Mary Johnson 3 1
I am trying to follow the following example for the reshape package but am getting an error
smithsm <- melt(smiths)
smithsm
subject variable value
1 John Smith time 1.00
2 Mary Smith time 1.00
3 John Smith age 33.00
4 Mary Smith age NA
5 John Smith weight 90.00
6 Mary Smith weight NA
7 John Smith height 1.87
8 Mary Smith height 1.54
cast(smithsm, time + subject ~ variable)
This gives the error "Error: Casting formula contains variables not found in molten data: time". Does anyone know what is causing this error? The above is taken word for word from an example
Thanks!
The smithsm dataset doesn't have time column. It is not clear what the expected wide form is. Perhaps, this helps
library(reshape2)
dcast(smithsm, subject~variable, value.var='value')
# subject age height time weight
#1 John Smith 33 1.87 1 90
#2 Mary Smith NA 1.54 1 NA
data
smithsm <- structure(list(subject = c("John Smith", "Mary Smith", "John Smith",
"Mary Smith", "John Smith", "Mary Smith", "John Smith", "Mary Smith"
), variable = c("time", "time", "age", "age", "weight", "weight",
"height", "height"), value = c(1, 1, 33, NA, 90, NA, 1.87, 1.54
)), .Names = c("subject", "variable", "value"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
I have a variable list of people I get as one long row in a data frame and I am interested to reorganize these record into a more meaningful format.
My raw data looks like this,
df <- data.frame(name1 = "John Doe", email1 = "John#Doe.com", phone1 = "(444) 444-4444", name2 = "Jane Doe", email2 = "Jane#Doe.com", phone2 = "(444) 444-4445", name3 = "John Smith", email3 = "John#Smith.com", phone3 = "(444) 444-4446", name4 = NA, email4 = "Jane#Smith.com", phone4 = NA, name5 = NA, email5 = NA, phone5 = NA)
df
# name1 email1 phone1 name2 email2 phone2
# 1 John Doe John#Doe.com (444) 444-4444 Jane Doe Jane#Doe.com (444) 444-4445
# name3 email3 phone3 name4 email4 phone4 name5
# 1 John Smith John#Smith.com (444) 444-4446 NA Jane#Smith.com NA NA
# email5 phone5
# 1 NA NA
and I am trying to bend it into a format like this,
df_transform <- structure(list(name = structure(c(2L, 1L, 3L, NA, NA), .Label = c("Jane Doe",
"John Doe", "John Smith"), class = "factor"), email = structure(c(3L,
1L, 4L, 2L, NA), .Label = c("Jane#Doe.com", "Jane#Smith.com",
"John#Doe.com", "John#Smith.com"), class = "factor"), phone = structure(c(1L,
2L, 3L, NA, NA), .Label = c("(444) 444-4444", "(444) 444-4445",
"(444) 444-4446"), class = "factor")), .Names = c("name", "email",
"phone"), class = "data.frame", row.names = c(NA, -5L))
df_transform
# name email phone
# 1 John Doe John#Doe.com (444) 444-4444
# 2 Jane Doe Jane#Doe.com (444) 444-4445
# 3 John Smith John#Smith.com (444) 444-4446
# 4 <NA> Jane#Smith.com <NA>
# 5 <NA> <NA> <NA>
It should be added that it's not always five record, it could be any number between 1 and 99. I tried with reshape2's melt and `t()1 but it got way to complicated. I imagine there is some know method that I simply do not know about.
You're on the right track, try this:
library(reshape2)
# melt it down
df.melted = melt(t(df))
# get rid of the numbers at the end
df.melted$Var1 = sub('[0-9]+$', '', df.melted$Var1)
# cast it back
dcast(df.melted, (seq_len(nrow(df.melted)) - 1) %/% 3 ~ Var1)[,-1]
# email name phone
#1 John#Doe.com John Doe (444) 444-4444
#2 Jane#Doe.com Jane Doe (444) 444-4445
#3 John#Smith.com John Smith (444) 444-4446
#4 Jane#Smith.com <NA> <NA>
#5 <NA> <NA> <NA>
1) reshape() First we strip off the digits from the column names giving the reduced column names, names0. Then we split the columns into groups producing g (which has three components corresponding to the email, name and phone column groups). Then use reshape (from the base of R) to perform the wide to long transformation and select from the resulting long data frame the desired columns in order to exclude the columns that are added automatically by reshape. That selection vector, unique(names0), is such that it reorders the resulting columns in the desired way.
names0 <- sub("\\d+$", "", names(df))
g <- split(names(df), names0)
reshape(df, dir = "long", varying = g, v.names = names(g))[unique(names0)]
and the last line gives this:
name email phone
1.1 John Doe John#Doe.com (444) 444-4444
1.2 Jane Doe Jane#Doe.com (444) 444-4445
1.3 John Smith John#Smith.com (444) 444-4446
1.4 <NA> Jane#Smith.com <NA>
1.5 <NA> <NA> <NA>
2) reshape2 package Here is a solution using reshape2. We add a rowname column to df and melt it to long form. Then we split the variable column into the name portion (name, email, phone) and the numeric suffix portion which we call id. Finally we convert it back to wide form using dcast and select out the appropriate columns as we did before.
library(reshape2)
m <- melt(data.frame(rowname = 1:nrow(df), df), id = 1)
mt <- transform(m,
variable = sub("\\d+$", "", variable),
id = sub("^\\D+", "", variable)
)
dcast(mt, rowname + id ~ variable)[, unique(mt$variable)]
where the last line gives this:
name email phone
1 John Doe John#Doe.com (444) 444-4444
2 Jane Doe Jane#Doe.com (444) 444-4445
3 John Smith John#Smith.com (444) 444-4446
4 <NA> Jane#Smith.com <NA>
5 <NA> <NA> <NA>
3) Simple matrix reshaping . Remove the numeric suffixes from the column names and set cn to the unique remaining names. (cn stands for column names). Then we merely reshape the df row into an n x length(cn) matrix adding the column names.
cn <- unique(sub("\\d+$", "", names(df)))
matrix(as.matrix(df), nc = length(cn), byrow = TRUE, dimnames = list(NULL, cn))
name email phone
[1,] "John Doe" "John#Doe.com" "(444) 444-4444"
[2,] "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
[3,] "John Smith" "John#Smith.com" "(444) 444-4446"
[4,] NA "Jane#Smith.com" NA
[5,] NA NA NA
4) tapply This problem can also be solved with a simple tapply. As before names0 is the column names without the numeric suffixes. names.suffix is just the suffixes. Now use tapply :
names0 <- sub("\\d+$", "", names(df))
names.suffix <- sub("^\\D+", "", names(df))
tapply(as.matrix(df), list(names.suffix, names0), c)[, unique(names0)]
The last line gives:
name email phone
1 "John Doe" "John#Doe.com" "(444) 444-4444"
2 "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
3 "John Smith" "John#Smith.com" "(444) 444-4446"
4 NA "Jane#Smith.com" NA
5 NA NA NA