Casting and Melting with reshape in R - r

As an example let's say that I have the following dataframe:
datas=data.frame(Variables=c("Power","Happiness","Power","Happiness"),
Country=c("France", "France", "UK", "UK"), y2000=c(1213,1872,1726,2234), y2001=c(1234,2345,6433,9082))
Resulting in the following output:
Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082
I would like to reshape this dataframe as follows:
Year Country Power Happiness
1 2000 France 1213 1872
2 2001 France 1234 2345
3 2000 UK 1726 2234
4 2001 UK 6433 9082
I started out with:
q2=cast(datas, Country~Variables, value="2000")
But then got the following error:
Aggregation requires fun.aggregate: length used as default
Error in `[.data.frame`(sort_df(data, variables), , c(variables, "value"), :
undefined columns selected
Any suggestions?
Also: Would it matter for the solution that my dataframe is really big (417120 by 62)?

Perhaps you're interested in a tidyverse alternative
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
Or using reshape2::melt and reshape2::dcast
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
Or (identically) using data.table::melt and data.table::dcast
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
In terms of performance/runtime, I imagine the data.table or tidyr solutions to be the most efficient. You can check by running a microbenchmark on some larger sample data.
Sample data
df <-read.table(text =
" Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082", header = T)
colnames(df)[3:4] <- c("2000", "2001")
Benchmark analysis
Following results from a microbenchmark analysis of the four methods, based on a (slightly) larger 78x22 sample dataset.
set.seed(2017)
df <- data.frame(
Variables = rep(c("Power", "Happiness", "something_else"), 26),
Country = rep(LETTERS[1:26], each = 3),
matrix(sample(10000, 20 * 26 * 3), nrow = 26 * 3))
colnames(df)[3:ncol(df)] <- 2000:2019
library(microbenchmark)
library(tidyr)
res <- microbenchmark(
reshape2 = {
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
tidyr = {
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
},
datatable = {
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
reshape = {
reshape::cast(reshape::melt(df), Country + variable ~ Variables)
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max neval
# reshape2 3.088740 3.449686 4.313044 3.919372 5.112560 7.856902 100
# tidyr 4.482361 4.982017 6.215872 5.771133 6.931964 28.293377 100
# datatable 3.179035 3.511542 4.861192 4.040188 5.123103 46.010810 100
# reshape 27.371094 30.226222 32.425667 32.504644 34.118499 41.286803 100
library(ggplot2)
autoplot(res)

As above, I would strongly recommend using tidyr instead of reshape, or at least using reshape2 instead of reshape, as it fixes many of the performance issues with reshape.
In reshape itself, you have to melt datas first
> cast(melt(datas), Country + variable ~ Variables)
Using Variables, Country as id variables
Country variable Happiness Power
1 France y2000 1872 1213
2 France y2001 2345 1234
3 UK y2000 2234 1726
4 UK y2001 9082 6433
And then renaming and converting the columns as necessary.
In reshape2 the code is identical but you would use dcast instead of cast. tidyr, as in #Maurits Evers's solution above is a better solution and most development has shifted from reshape2 to the tidyverse

Related

Creating a loop to add labels to colums: library(Hmisc)

I have a dataset which looks something like this:
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
I have another dataset which looks something like this:
Indicator Code Indicator Name
P Power
H Happiness
I would like to add info in the second column of the second dataset (Power, Happiness) as a label to the abbreviation used in the first dataset with a loop, but I don't know exactly how to write the loop.
This is how far I got:
library(Hmisc)
for i in df2[,1]{
if (df1[,i] == df2[i,]){
label(df1[,i]) <- df2[i,2]
}}
But this merely checks whether names are the same and does not search for it.
Could anyone guide further?
Desired output:
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
If you specifically want to use a loop, this approach gives the output you describe:
df <- data.frame(Year = c(2000, 2001, 2000, 2001),
Country = c("France", "France", "UK","UK"),
Matchcode = c("0001", "0002", "0003", "0004"),
P = c(1213, 1234, 1726, 6433),
H = c(1872, 2345, 2234, 9082))
lookup <- data.frame(code = c ("P", "H"),
label = c("Power", "Happiness"),
stringsAsFactors = FALSE)
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
Hmisc::label(df[[i]]) <- lookup$label[(match(colnames(df), lookup$code))[i]]
}
}
This works:
Hmisc::label(df[4])
# P
# "Power"
It also checks out in the RStudio viewer:
Like several of the other answerers and commenters, I had originally thought you wanted to append the "label = " text to the column names. For anyone wanting that, this is the (loop) code.
for (i in 1:length(colnames(df))) {
if(!is.na(match(colnames(df), lookup$code)[i])) {
colnames(df)[i] <- paste0(colnames(df)[i],
"(label=",
lookup$label[(match(colnames(df), lookup$code))[i]],
")")
}
}
It's not clear to me at all what you're trying to do with Hmisc::label but I think you're misinterpreting the role & function of Hmisc::label.
Consider the following:
Let's construct a sample data.frame consisting of 2 rows and 2 columns.
df <- setNames(data.frame(matrix(0, ncol = 2, nrow = 2)), c("a", "b"))
df
# a b
#1 0 0
#2 0 0
We extract the column names. Note that cn is a character vector.
cn <- colnames(df)
cn
#[1] "a" "b"
We now set a Hmisc::label for cn.
label(cn) <- "label for cn"
cn
#label for cn
#[1] "a" "b"
We inspect the attributes of cn
attributes(cn)
#$label
#[1] "label for cn"
#
#$class
#[1] "labelled" "character"
We now assign cn to the column names of df.
colnames(df) <- cn
df
# a b
#1 0 0
#2 0 0
Note how the label attribute is not stored as part of the column names.
Here's a dplyr solution:
# example datasets
df = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
library(dplyr)
data.frame(original_names = names(df)) %>% # get original names
left_join(df2, by=c("original_names"="IndicatorName")) %>% # join names that should be updated
mutate(new_names = ifelse(is.na(IndicatorCode), original_names, paste0(original_names,"(label=",IndicatorCode,")"))) %>% # if there is a match update the name
pull(new_names) -> list_new_names # get column of new names and store it in a vector
# update names
names(df) = list_new_names
# check new names
df
# Year Country Matchcode P(label=Power) H(label=Happiness)
# 1 2000 France 1 1213 1872
# 2 2001 France 2 1234 2345
# 3 2000 UK 3 1726 2234
# 4 2001 UK 4 6433 9082
This would work. Find the corresponding text using %in%, and use paste0 to generate the label.
colnames(df1)[4:5] <- paste0(colnames(df1)[4:5], '(label=', df2$V2[colnames(df1)[4:5] %in% df2$V1], ')')
df1
Year Country Matchcode P(label=Power) H(label=Happiness)
1 2000 France 1 1213 1872
2 2001 France 2 1234 2345
3 2000 UK 3 1726 2234
4 2001 UK 4 6433 9082
Data used
df1 <- read.table(text="Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082", header=T, stringsAsFactors=F)
df2 <- read.table(text="
P Power
H Happiness", header=F, stringsAsFactors=F)
If you still stick with Hmisc, you can modify the 'print' function to handle the extra information provided by the labels, or rather (and less harmfull) says to R that your data has to be printed using the labels. You can achieve this by creating a new data frame class for which the print function behaves differently.
The 'print' trick is not necessary with Rstudio that natively uses the labels together with the column names.
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082 ", header=T)
df2 = read.table(text = "
var lab
P Power
H Happiness", header=T, stringsAsFactors=FALSE)
## Set the labels of the columns in df1 accordingly to df2
library(Hmisc)
for (i in 1:ncol(df1)) {
lab <- df2[df2$var==colnames(df1)[i],2]
if (length(lab!=0)) label(df1[[i]]) <- lab
}
# A print' function dedicated to 'truc' objects
# Mainly it is the code from the original 'print' except for dimnames[[2L]]
print.truc <- function (x, ..., digits = NULL, quote = FALSE, right = TRUE,
row.names = TRUE)
{
n <- length(row.names(x))
if (length(x) == 0L) {
cat(sprintf(ngettext(n, "data frame with 0 columns and %d row",
"data frame with 0 columns and %d rows"), n), "\n",
sep = "")
}
else if (n == 0L) {
print.default(names(x), quote = FALSE)
cat(gettext("<0 rows> (or 0-length row.names)\n"))
}
else {
m <- as.matrix(format.data.frame(x, digits = digits,
na.encode = FALSE))
if (!isTRUE(row.names))
dimnames(m)[[1L]] <- if (isFALSE(row.names))
rep.int("", n)
else row.names
dimnames(m)[[2L]] <- purrr::map(1:ncol(x),
function(i) {
z <- attributes(x[[i]])$label
if (length(z)!=0) z else colnames(x)[i]
})
print(m, ..., quote = quote, right = right)
}
invisible(x)
}
# Says that 'df1' is an 'enhanced' data frame
class(df1) <- c("truc",class(df1))
# Print as enhanced
print(df1)
# Eyra Country Matchcode Power Happiness
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
# Print using standard way
print(as.data.frame(df1))
# Year Country Matchcode P H
#1 2000 France 1 1213 1872
#2 2001 France 2 1234 2345
#3 2000 UK 3 1726 2234
#4 2001 UK 4 6433 9082
No need for a loop with Hmisc, can do this in one line using the option self = FALSE in the label command.
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
Ie.
library(Hmisc, warn.conflicts = FALSE, quietly = TRUE)
df1 = read.table(text = "
Year Country Matchcode P H
1 2000 France 0001 1213 1872
2 2001 France 0002 1234 2345
3 2000 UK 0003 1726 2234
4 2001 UK 0004 6433 9082
", header=T)
df2 = read.table(text = "
IndicatorName IndicatorCode
P Power
H Happiness
", header=T)
label(df1[, df2$IndicatorName], self = FALSE) <- df2$IndicatorCode
sapply(df1, label)
#> Year Country Matchcode P H
#> "" "" "" "Power" "Happiness"
Created on 2020-09-14 by the reprex package (v0.3.0)

Dataframe does not correctly reshape

I have the following dataframe:
Variables Varcode Country Ccode 2000 2001
1 Power P France FR 1213 1234
2 Happiness H France FR 1872 2345
3 Power P UK UK 1726 6433
4 Happiness H UK UK 2234 9082
I would like to reshape this dataframe as follows:
Year Country Ccode P(label=Power) H(label=Happiness)
1 2000 France FR 1213 1872
2 2001 France FR 1234 2345
3 2000 UK UK 1726 2234
4 2001 UK UK 6433 9082
The original code was as follows:
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
I tried to expand the code because, the Ccode and Indicator Code ended up as a row in the list and I decided I wanted to use the codes as variable names and the variable names as labels (please note that because of that I swapped -Variables and Variables with -Varcode and Varcode respectively):
library(tidyverse)
library(Hmisc)
List <- df$Variables
df<-df %>%
gather(Year, val, -Varcode, -Country) %>%
spread(Varcode, val)
for(i in List){
label(df[,i]) <- List[i]
}
Please note: I am using a list because of memory limitations.
I ran into two problems:
The transformation does not go smoothly because two additional columns from df(among which Variables) are added where values should be.
The label function gives an error.
Can anyone help me figuring out what goes wrong?
I think you went wrong with your selection of columns to gather
Data:
df <- read.table(text = "Variables Varcode Country 2000 2001
1 Power P France 1213 1234
2 Happiness H France 1872 2345
3 Power P UK 1726 6433
4 Happiness H UK 2234 9082", header = TRUE, stringsAsFactors = FALSE) %>%
rename(`2000` = X2000, `2001` = X2001)
df %>%
select(-Varcode) %>%
gather(Year, val,`2000`:`2001`) %>%
unite(Country_Ccode, Country, Ccode, sep = "_") %>%
spread(Variables, val) %>%
separate(Country_Ccode, c("Country", "Ccode"), sep = "_")
Output
Country Ccode Year Happiness Power
1 France FR 2000 1872 1213
2 France FR 2001 2345 1234
3 UK UK 2000 2234 1726
4 UK UK 2001 9082 6433

Reshaping by ID number into wide format [duplicate]

This question already has answers here:
How can I spread repeated measures of multiple variables into wide format?
(4 answers)
Closed 4 years ago.
Posting a second question because my first was marked as a duplicate. I apologize in advance if there already is a question that addresses this specific issue.
I started out with a dataframe as follows:
dat<-data.frame(
ID=c(100,101,101,101,102,103),
DEGREE=c("BA","BA","MS","PHD","BA","BA"),
YEAR=c(1980,1990, 1992, 1996, 2000, 2004))
> dat
ID DEGREE YEAR
100 BA 1980
101 BA 1990
101 MS 1992
101 PHD 1996
102 BA 2000
103 BA 2004
ID 101 earned a BA in 1990, an MS in 1992, and a PHD in 1996.
I want to reshape this dataframe into a wide format that ultimately looks like this:
ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_DEGREE_1 YEAR_DEGREE_2 YEAR_DEGREE_3
100 BA 1980
101 BA MS PHD 1990 1992 1996
102 BA 2000
103 BA 2004
With help from an answer to my original question, I attempted to create my new data frame using the following code:
dat$DEGREE<-as.character(dat$DEGREE)
dat %>% group_by(ID) %>%
mutate(DegreeNum = paste("Degree", row_number(), sep = "_"))%>%
mutate(DegreeYear = paste("YearDegree", row_number(), sep = "_"))%>%
spread(DegreeNum, DEGREE, fill = "")%>%
spread(DegreeYear,YEAR,fill="")%>%
as.data.frame()
ID Degree_1 Degree_2 Degree_3 YearDegree_1 YearDegree_2 YearDegree_3
100 BA 1980
101 PHD 1996
101 MS 1992
101 BA 1990
102 BA 2000
103 BA 2004
This is as far as I was able to get, but cannot figure out how to reshape it into a dataframe so that everything from ID 101 is in one row. Any help would be appreciated.
Not so hard with tidyverse...
df<-data.frame(ID=c(100,101,101,101,102,103),
DEGREE=c("BA","BA","MS","PHD","BA","BA"),
YEAR=c(1980,1990, 1992, 1996, 2000, 2004),
stringsAsFactors=FALSE)
df1 <- df %>% select(-3) %>% group_by(ID) %>% mutate(i=row_number()) %>%
as.data.frame() %>%
reshape(direction="wide",idvar="ID",v.names="DEGREE",timevar="i",sep="_")
df1[is.na(df1)] <- ""
df2 <- df %>% select(-2) %>% group_by(ID) %>% mutate(i=row_number()) %>%
as.data.frame() %>%
reshape(direction="wide",idvar="ID",v.names="YEAR",timevar="i",sep="_")
df2[is.na(df2)] <- ""
inner_join(df1,df2,"ID")
# ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_1 YEAR_2 YEAR_3
#1 100 BA 1980
#2 101 BA MS PHD 1990 1992 1996
#3 102 BA 2000
#4 103 BA 2004

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Convert Dataframe to key value pair list in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
How can I 'unpivot' a table? What is the proper technical term for this?
UPDATE: The term is called melt
I have a data frame for countries and data for each year
Country 2001 2002 2003
Nigeria 1 2 3
UK 2 NA 1
And I want to have something like
Country Year Value
Nigeria 2001 1
Nigeria 2002 2
Nigeria 2003 3
UK 2001 2
UK 2002 NA
UK 2003 1
I still can't believe I beat Andrie with an answer. :)
> library(reshape)
> my.df <- read.table(text = "Country 2001 2002 2003
+ Nigeria 1 2 3
+ UK 2 NA 1", header = TRUE)
> my.result <- melt(my.df, id = c("Country"))
> my.result[order(my.result$Country),]
Country variable value
1 Nigeria X2001 1
3 Nigeria X2002 2
5 Nigeria X2003 3
2 UK X2001 2
4 UK X2002 NA
6 UK X2003 1
The base R reshape approach for this problem is pretty ugly, particularly since the names aren't in a form that reshape likes. It would be something like the following, where the first setNames line modifies the column names into something that reshape can make use of.
reshape(
setNames(mydf, c("Country", paste0("val.", c(2001, 2002, 2003)))),
direction = "long", idvar = "Country", varying = 2:ncol(mydf),
sep = ".", new.row.names = seq_len(prod(dim(mydf[-1]))))
A better alternative in base R is to use stack, like this:
cbind(mydf[1], stack(mydf[-1]))
# Country values ind
# 1 Nigeria 1 2001
# 2 UK 2 2001
# 3 Nigeria 2 2002
# 4 UK NA 2002
# 5 Nigeria 3 2003
# 6 UK 1 2003
There are also new tools for reshaping data now available, like the "tidyr" package, which gives us gather. Of course, the tidyr:::gather_.data.frame method just calls reshape2::melt, so this part of my answer doesn't necessarily add much except introduce the newer syntax that you might be encountering in the Hadleyverse.
library(tidyr)
gather(mydf, year, value, `2001`:`2003`) ## Note the backticks
# Country year value
# 1 Nigeria 2001 1
# 2 UK 2001 2
# 3 Nigeria 2002 2
# 4 UK 2002 NA
# 5 Nigeria 2003 3
# 6 UK 2003 1
All three options here would need reordering of rows if you want the row order you showed in your question.
A fourth option would be to use merged.stack from my "splitstackshape" package. Like base R's reshape, you'll need to modify the column names to something that includes a "variable" and "time" indicator.
library(splitstackshape)
merged.stack(
setNames(mydf, c("Country", paste0("V.", 2001:2003))),
var.stubs = "V", sep = ".")
# Country .time_1 V
# 1: Nigeria 2001 1
# 2: Nigeria 2002 2
# 3: Nigeria 2003 3
# 4: UK 2001 2
# 5: UK 2002 NA
# 6: UK 2003 1
Sample data
mydf <- structure(list(Country = c("Nigeria", "UK"), `2001` = 1:2, `2002` = c(2L,
NA), `2003` = c(3L, 1L)), .Names = c("Country", "2001", "2002",
"2003"), row.names = 1:2, class = "data.frame")
You can use the melt command from the reshape package. See here: http://www.statmethods.net/management/reshape.html
Probably something like melt(myframe, id=c('Country'))

Resources