This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 9 years ago.
The idea is to convert a frequency table to something geom_density can handle (ggplot2).
Starting with a frequency table
> dat <- data.frame(x = c("a", "a", "b", "b", "b"), y = c("c", "c", "d", "d", "d"))
> dat
x y
1 a c
2 a c
3 b d
4 b d
5 b d
Use dcast to make a frequency table
> library(reshape2)
> dat2 <- dcast(dat, x + y ~ ., fun.aggregate = length)
> dat2
x y count
1 a c 2
2 b d 3
How can this be reversed? melt does not seem to be the answer:
> colnames(dat2) <- c("x", "y", "count")
> melt(dat2, measure.vars = "count")
x y variable value
1 a c count 2
2 b d count 3
As you can use any aggregate function, you won't be able to reverse the dcast (aggregation) without knowing how to reverse the aggregation.
For length, the obvious inverse is rep. For aggregations like sum or mean there isn't an obvious inverse (that assumes you haven't saved the original data as an attribute)
Some options to invert length
You could use ddply
library(plyr)
ddply(dat2,.(x), summarize, y = rep(y,count))
or more simply
as.data.frame(lapply(dat2[c('x','y')], rep, dat2$count))
Related
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets
I have a data frame that looks like this:
library(dplyr)
df <- data_frame(doc.x = c("a", "b", "c", "d"),
doc.y = c("b", "a", "d", "c"))
So that df is:
Source: local data frame [4 x 2]
doc.x doc.y
(chr) (chr)
1 a b
2 b a
3 c d
4 d c
This is a list of ordered pairs, a to d but also d to a, and so on. What is a dplyr-like way to return only a list of unordered pairs in this data frame? I.e.
doc.x doc.y
(chr) (chr)
1 a b
2 c d
Use pmin and pmax to sort the pairs alphabetically, i.e. turn (b,a) into (a,b) and then filter away all the duplicates.
df %>%
mutate(dx = pmin(doc.x, doc.y), dy = pmax(doc.x, doc.y)) %>%
distinct(dx, dy) %>%
select(-dx, -dy)
doc.x doc.y
(chr) (chr)
1 a b
2 c d
Alternate way using data.table:
df <- data.frame(doc.x = c("a", "b", "c", "d"),
doc.y = c("b", "a", "d", "c"), stringsAsFactors = F)
library(data.table)
setDT(df)
df[, row := 1:nrow(df)]
df <- df[, list(Left = max(doc.x,doc.y),Right = min(doc.x,doc.y)), by = row]
df <- df[, list(Left,Right)]
unique(df)
Left Right
1: b a
2: d c
Using dplyr
# make character columns into factors
df <- as.data.frame(unclass(df))
df$x.lvl <- levels(df$doc.x)
df$y.lvl <- levels(df$doc.y)
# find unique pairs
res <- df %>%
group_by(doc.x) %>%
transform(x.lvl = order(doc.x),
y.lvl = order(doc.y)) %>%
transform(pair = ifelse(x.lvl < y.lvl,
paste(doc.x, doc.y, sep=","), paste(doc.y, doc.x, sep=","))) %>%
.$pair %>%
unique
Unique pairs
res
[1] a,b c,d
Levels: a,b c,d
Edit
Inspired by Backlin's solution, in base R
unique(with(df, paste(pmin(doc.x, doc.y), pmax(doc.x, doc.y), sep=","))
[1] "a,b" "c,d"
Or to store in a data.frame
unique(with(df, data.frame(lvl1=pmin(doc.x, doc.y), lvl2=pmax(doc.x, doc.y))))
lvl1 lvl2
1 a b
3 c d
I have a data.table in this fashion:
dd <- data.table(f = c("a", "a", "a", "b", "b"), g = c(1,2,3,4,5))
dd
I need to sum the values g by factor f, and finally return a single row data.table object that has the maximum value of g, but that also contains the factor information. i.e.
___f|g
1: b 9
My closest attempt so far is
tmp3 <- dd[, sum(g), by = f][, max(V1)]
tmp3
Which results in:
> tmp3
[1] 9
EDIT: I'm ideally looking for a purely data.table piece of code/workflow. I'm surprised that with all the speedy fast split-apply-combine wizardry and ability to subset your data in the form of 'example[i= subset, ]` that I haven't found a straight forward way to subset on a single value condition.
Here's one way to do it:
library(data.table)
dd <- data.table(
f = c("a", "a", "a", "b", "b"),
g = c(1,2,3,4,5))
##
> dd[,list(g = sum(g)),by=f][which.max(g),]
f g
1: b 9
You can use dplyr syntax on a data.table, in this case:
library(dplyr)
dd %>%
group_by(f) %>%
summarise (g = sum(g)) %>%
top_n(1, g)
Source: local data table [1 x 2]
f g
1 b 9
Must one melt a data frame prior to having it cast? From ?melt:
data molten data frame, see melt.
In other words, is it absolutely necessary to have a data frame molten prior to any acast or dcast operation?
Consider the following:
library("reshape2")
library("MASS")
xb <- dcast(Cars93, Manufacturer ~ Type, mean, value.var="Price")
m.Cars93 <- melt(Cars93, id.vars=c("Manufacturer", "Type"), measure.vars="Price")
xc <- dcast(m.Cars93, Manufacturer ~ Type, mean, value.var="value")
Then:
> identical(xb, xc)
[1] TRUE
So in this case the melt operation seems to have been redundant.
What are the general guiding rules in these cases? How do you decide when a data frame needs to be molten prior to a *cast operation?
Whether or not you need to melt your dataset depends on what form you want the final data to be in and how that relates to what you currently have.
The way I generally think of it is:
For the LHS of the formula, I should have one or more columns that will become my "id" rows. These will remain as separate columns in the final output.
For the RHS of the formula, I should have one or more columns that combine to form new columns in which I will be "spreading" my values out across. When this is more than one column, dcast will create new columns based on the combination of the values.
I must have just one column that would feed the values to fill in the resulting "grid" created by these rows and columns.
To illustrate with a small example, consider this tiny dataset:
mydf <- data.frame(
A = c("A", "A", "B", "B", "B"),
B = c("a", "b", "a", "b", "c"),
C = c(1, 1, 2, 2, 3),
D = c(1, 2, 3, 4, 5),
E = c(6, 7, 8, 9, 10)
)
Imagine that our possible value variables are columns "D" or "E", but we are only interested in the values from "E". Imagine also that our primary "id" is column "A", and we want to spread the values out according to column "B". Column "C" is irrelevant at this point.
With that scenario, we would not need to melt the data first. We could simply do:
library(reshape2)
dcast(mydf, A ~ B, value.var = "E")
# A a b c
# 1 A 6 7 NA
# 2 B 8 9 10
Compare what happens when you do the following, keeping in mind my three points above:
dcast(mydf, A ~ C, value.var = "E")
dcast(mydf, A ~ B + C, value.var = "E")
dcast(mydf, A + B ~ C, value.var = "E")
When is melt required?
Now, let's make one small adjustment to the scenario: We want to spread out the values from both columns "D" and "E" with no actual aggregation taking place. With this change, we need to melt the data first so that the relevant values that need to be spread out are in a single column (point 3 above).
dfL <- melt(mydf, measure.vars = c("D", "E"))
dcast(dfL, A ~ B + variable, value.var = "value")
# A a_D a_E b_D b_E c_D c_E
# 1 A 1 6 2 7 NA NA
# 2 B 3 8 4 9 5 10
I aggregate() the value column sums per site levels of the R data.frame given below:
set.seed(2013)
df <- data.frame(site = sample(c("A","B","C"), 10, replace = TRUE),
currency = sample(c("USD", "EUR", "GBP", "CNY", "CHF"),10, replace=TRUE, prob=c(10,6,5,6,0.5)),
value = sample(seq(1:10)/10,10,replace=FALSE))
df.site.sums <- aggregate(value ~ site, data=df, FUN=sum)
df.site.sums
# site value
#1 A 0.2
#2 B 0.6
#3 C 4.7
However, I would like to be able to specify the row order of the resulting df.site.sums. For instance like:
reorder <- c("C","B","A")
?special_sort(df, BY=site, ORDER=reorder) # imaginary function
# site value
#1 C 4.7
#2 B 0.6
#3 A 0.2
How can I do this using base R? Just to be clear, this is essentially a data frame row ordering question where the context is the aggregate() function (which may or may not matter).
This is relevant but does not directly address my issue, or I am missing the crux of the solution.
UPDATE
For future reference, I found a solution to ordering a data.frame's rows with respect to a target vector on this link. I guess it can be applied as a post-processing step.
df.site.sums[match(reorder,df.site.sums$site),]
This may be a possibility: convert 'site' to a factor and specify the order in levels.
df$site2 <- factor(df$site, levels = c("C", "B", "A"))
aggregate(value ~ site2, data = df, FUN = sum)
# site2 value
# 1 C 4.7
# 2 B 0.6
# 3 A 0.2
Update following #Ananda Mahto's comment (thanks!). You can use the 'non-formula' approach of aggregate:
reorder <- c("C", "B", "A")
with(df, aggregate(x = list(value = value),
by = list(site = factor(site, levels = reorder)),
FUN = sum))
# site value
# 1 C 4.7
# 2 B 0.6
# 3 A 0.2
Or, converting to factor within the formula interface, and rename the converted site column:
df2 <- aggregate(value ~ factor(site, levels = c("C", "B", "A")),
data = df, FUN = sum)
df2
names(df2) <- c("site", "value")
df2