Must one `melt` a dataframe before having it `cast`? - r

Must one melt a data frame prior to having it cast? From ?melt:
data molten data frame, see melt.
In other words, is it absolutely necessary to have a data frame molten prior to any acast or dcast operation?
Consider the following:
library("reshape2")
library("MASS")
xb <- dcast(Cars93, Manufacturer ~ Type, mean, value.var="Price")
m.Cars93 <- melt(Cars93, id.vars=c("Manufacturer", "Type"), measure.vars="Price")
xc <- dcast(m.Cars93, Manufacturer ~ Type, mean, value.var="value")
Then:
> identical(xb, xc)
[1] TRUE
So in this case the melt operation seems to have been redundant.
What are the general guiding rules in these cases? How do you decide when a data frame needs to be molten prior to a *cast operation?

Whether or not you need to melt your dataset depends on what form you want the final data to be in and how that relates to what you currently have.
The way I generally think of it is:
For the LHS of the formula, I should have one or more columns that will become my "id" rows. These will remain as separate columns in the final output.
For the RHS of the formula, I should have one or more columns that combine to form new columns in which I will be "spreading" my values out across. When this is more than one column, dcast will create new columns based on the combination of the values.
I must have just one column that would feed the values to fill in the resulting "grid" created by these rows and columns.
To illustrate with a small example, consider this tiny dataset:
mydf <- data.frame(
A = c("A", "A", "B", "B", "B"),
B = c("a", "b", "a", "b", "c"),
C = c(1, 1, 2, 2, 3),
D = c(1, 2, 3, 4, 5),
E = c(6, 7, 8, 9, 10)
)
Imagine that our possible value variables are columns "D" or "E", but we are only interested in the values from "E". Imagine also that our primary "id" is column "A", and we want to spread the values out according to column "B". Column "C" is irrelevant at this point.
With that scenario, we would not need to melt the data first. We could simply do:
library(reshape2)
dcast(mydf, A ~ B, value.var = "E")
# A a b c
# 1 A 6 7 NA
# 2 B 8 9 10
Compare what happens when you do the following, keeping in mind my three points above:
dcast(mydf, A ~ C, value.var = "E")
dcast(mydf, A ~ B + C, value.var = "E")
dcast(mydf, A + B ~ C, value.var = "E")
When is melt required?
Now, let's make one small adjustment to the scenario: We want to spread out the values from both columns "D" and "E" with no actual aggregation taking place. With this change, we need to melt the data first so that the relevant values that need to be spread out are in a single column (point 3 above).
dfL <- melt(mydf, measure.vars = c("D", "E"))
dcast(dfL, A ~ B + variable, value.var = "value")
# A a_D a_E b_D b_E c_D c_E
# 1 A 1 6 2 7 NA NA
# 2 B 3 8 4 9 5 10

Related

Recoding factor with many levels

I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis.
Is there a way to do it automatically without typing each OldName=NewName?
This is how I do it with dplyr for fewer levels:
df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")
My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode
You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.
library(tidyverse)
# test data
df <- tibble(TraitName = c("a", "b", "c"))
# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter.
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc"))
# Convert to named vector and splice it within the recode
df <-
df |>
mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
One way would be a lookup table, a join, and coalesce (to get the first non-NA value:
my_data <- data.frame(letters = letters[1:6])
levels_to_change <- data.frame(letters = letters[4:5],
new_letters = LETTERS[4:5])
library(dplyr)
my_data %>%
left_join(levels_to_change) %>%
mutate(new = coalesce(new_letters, letters))
Result
Joining, by = "letters"
letters new_letters new
1 a <NA> a
2 b <NA> b
3 c <NA> c
4 d D D
5 e E E
6 f <NA> f

Trying to produce a loop for summing up consecutive column values in R

I am trying to produce an loop function to sum up consecutive columns of values of a table and output them into another table
For example, in my original table, we have columns a, b, c, etc, which contain the same number of numeric values.
The resulting table then should be a, a+b, a+b+c, etc up to the last column of the original table
I have a feeling a for loop should be sufficient for this operation however can't get my head around the format and syntax.
Any help would be appreciated!
Since you're new, here is an example of a very minimal minimal reproducible example?
library(data.table)
x = data.table(a=1:3,b=4:6,c=7:9)
for(... now what?
And here's a way to do your task:
library(data.table)
# make some dummy data
X = data.table(a=1:2,b=3:4,c=5:6)
# make an empty result table
Y = data.table()
# for i = 1 to the number of columns in X
for(i in 1:ncol(X)){
# colnames(X) is "a" "b" "c".
# colnames(X)[1:1] is "a", colnames(X)[1:2] is "a" "b", colnames(X)[1:3] is "a" "b" "c"
# paste0(colnames(X)[1:1],collapse='') is "a",
# paste0(colnames(X)[1:2],collapse='') is "ab",
# paste0(colnames(X)[1:3],collapse='') is "abc"
newcolname = paste0(colnames(X)[1:i],collapse='')
# Y[,(newcolname):= is data.table syntax to create a new column called newcolname
# X[,1:i] selects columns 1 to i
# rowSums calculates the, um, row sums :D
Y[,(newcolname):=rowSums(X[,1:i])]
}
Maybe you need Reduce like below
cbind(
df,
setNames(
as.data.frame(Reduce(`+`, df, accumulate = TRUE)),
Reduce(paste0, names(df), accumulate = TRUE)
)
)
such that
a b c a ab abc
1 1 4 7 1 5 12
2 2 5 8 2 7 15
3 3 6 9 3 9 18
Data
df <- structure(list(a = 1:3, b = 4:6, c = 7:9), class = "data.frame", row.names = c(NA,
-3L))

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

Opposite of dcast [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 9 years ago.
The idea is to convert a frequency table to something geom_density can handle (ggplot2).
Starting with a frequency table
> dat <- data.frame(x = c("a", "a", "b", "b", "b"), y = c("c", "c", "d", "d", "d"))
> dat
x y
1 a c
2 a c
3 b d
4 b d
5 b d
Use dcast to make a frequency table
> library(reshape2)
> dat2 <- dcast(dat, x + y ~ ., fun.aggregate = length)
> dat2
x y count
1 a c 2
2 b d 3
How can this be reversed? melt does not seem to be the answer:
> colnames(dat2) <- c("x", "y", "count")
> melt(dat2, measure.vars = "count")
x y variable value
1 a c count 2
2 b d count 3
As you can use any aggregate function, you won't be able to reverse the dcast (aggregation) without knowing how to reverse the aggregation.
For length, the obvious inverse is rep. For aggregations like sum or mean there isn't an obvious inverse (that assumes you haven't saved the original data as an attribute)
Some options to invert length
You could use ddply
library(plyr)
ddply(dat2,.(x), summarize, y = rep(y,count))
or more simply
as.data.frame(lapply(dat2[c('x','y')], rep, dat2$count))

split dataframe in R by row

I have a long dataframe like this:
Row Conc group
1 2.5 A
2 3.0 A
3 4.6 B
4 5.0 B
5 3.2 C
6 4.2 C
7 5.3 D
8 3.4 D
...
The actual data have hundreds of row. I would like to split A to C, and D. I looked up the web and found several solutions but not applicable to my case.
How to split a data frame?
For example:
Case 1:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
I don't want to split by arbitrary number
Case 2: Split by level/factor
data2 <- data[data$sum_points == 2500, ]
I don't want to split by a single factor either. Sometimes I want to combine many levels together.
Case 3: select by row number
newdf <- mydf[1:3,]
The actual data have hundreds of rows. I don't know the row number. I just know the level I would like to split at.
It sounds like you want two data frames, where one has (A,B,C) in it and one has just D. In that case you could do
Data1 <- subset(Data, group %in% c("A","B","C"))
Data2 <- subset(Data, group=="D")
Correct me if you were asking something different
For those who end up here through internet search engines time after time, the answer to the question in the title is:
x <- data.frame(num = 1:26, let = letters, LET = LETTERS)
split(x, sort(as.numeric(rownames(x))))
Assuming that your data frame has numerically ordered row names. Also split(x, rownames(x)) works, but the result is rearranged.
You may consider using the recode() function from the "car" package.
# Load the library and make up some sample data
library(car)
set.seed(1)
dat <- data.frame(Row = 1:100,
Conc = runif(100, 0, 10),
group = sample(LETTERS[1:10], 100, replace = TRUE))
Currently, dat$group contains the upper case letters A to J. Imagine we wanted the following four groups:
"one" = A, B, C
"two" = D, E, J
"three" = F, I
"four" = G, H
Now, use recode() (note the semicolon and the nested quotes).
recodes <- recode(dat$group,
'c("A", "B", "C") = "one";
c("D", "E", "J") = "two";
c("F", "I") = "three";
c("G", "H") = "four"')
split(dat, recodes)
With base R, we can input the factor that we want to split on.
split(df, df$group == "D")
Output
$`FALSE`
Row Conc group
1 1 2.5 A
2 2 3.0 A
3 3 4.6 B
4 4 5.0 B
5 5 3.2 C
6 6 4.2 C
$`TRUE`
Row Conc group
7 7 5.3 D
8 8 3.4 D
If you wanted to split on multiple letters, then we could:
split(df, df$group %in% c("A", "D"))
Another option is to use group_split from dplyr, but will need to make a grouping variable first for the split.
library(dplyr)
df %>%
mutate(spl = ifelse(group == "D", 1, 0)) %>%
group_split(spl, .keep = FALSE)

Resources