For a analysis I would like to transform data from:
data <- data.frame(
Customer = c("A", "A", "B", "B", "C", "C", "C"),
Product = c("X", "Y", "X", "Z", "X", "Y", "Z"),
Value = c(10, 15, 5, 10, 20, 5, 10)
)
data
# Customer Product Value
# 1 A X 10
# 2 A Y 15
# 3 B X 5
# 4 B Z 10
# 5 C X 20
# 6 C Y 5
# 7 C Z 10
To:
Product Product Sum Value
-------|-------|---------
X |Y |50
X |Z |45
Y |Z |15
Basically I want to get the sum of the value for every product combination within a customer. I guess it could work with some help of the reshape package but I cannot get it to work.
Thanks for your time.
Here is one way, in two steps:
1) transform your data into a long data.frame of all pairs within customers. For that, I rely on combn to provide the indices of all possible pairs:
process.one <- function(x) {
n <- nrow(x)
i <- combn(n, 2)
data.frame(Product1 = x$Product[i[1, ]],
Product2 = x$Product[i[2, ]],
Value = x$Value[i[1, ]] +
x$Value[i[2, ]])
}
library(plyr)
long <- ddply(data, "Customer", process.one)
long
# Customer Product1 Product2 Value
# 1 A X Y 25
# 2 B X Z 15
# 3 C X Y 25
# 4 C X Z 30
# 5 C Y Z 15
2) drop the Customer dimension and aggregate your values:
aggregate(Value ~ ., long[c("Product1", "Product2", "Value")], sum)
# Product1 Product2 Value
# 1 X Y 50
# 2 X Z 45
# 3 Y Z 15
Related
My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?
In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())
Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want
An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1
I have data1 like:
id number
1 a
2 b
3 c
The other data2 like:
id value
1 x
2 y
3 z
I hope to merge two datasets like
a x
a y
a z
b x
b y
b z
c x
c y
c z
The two dataset both have 10k of data, I really couldn't do it by hand, could some one give me some suggestion on this.Thanks!
You can either use 'expand.grid()' like lmo pointed out, or in a more tidyverse fashion:
library(tidyverse)
Creating dataframes ("tibbles"):
dt1 <- tribble(
~id, ~number,
1, "a",
2, "b",
3, "c" )
dt2 <- tribble(
~id, ~value,
1, "x",
2, "y",
3, "z")
Using lmo's suggestion, expand.grid():
expand.grid(dt1$number, dt2$value)
A dplyr approach would be:
dt2 %>%
expand(id, value) %>%
dplyr::left_join(dt1) %>%
select(-id)
Resulting in:
Joining, by = "id"
# A tibble: 9 × 2
value number
<chr> <chr>
1 x a
2 y a
3 z a
4 x b
5 y b
6 z b
7 x c
8 y c
9 z c
I would like to filter by groups, the maximal combination of values based on a given order of columns.
A vector of column should specify the order of columns in which looking at maximal values.
For example :
x <- data.frame(id = c("a", "a", "b", "b"),
x = c(1, 1, 1, 2),
y = c(1, 2, 2, 1),
z = c(1, 1, 2, 1))
> x
id x y z
1 a 1 1 1
2 a 1 2 1
3 b 1 2 2
4 b 2 1 1
In this example I would like to group by id and set the 'priority' to x, y, z which means that I want to look the maximal x value, then it's associated maximal y value and then the maximal z value for the maximal x, y couple.
I'm not aware of such a vectorized function so I reccursively group to find the maximum following column maximal value :
> x
id x y z
1 a 1 2 1
2 b 2 1 1
I can do it with base R, with a loop :
group <- "id"
cols <- c("x", "y", "z")
for (i in seq_along(cols)) {
tmp <- aggregate(setNames(list(x[[cols[i]]]), cols[i]), by = as.list(x[group]), FUN = max)
x <- merge(x, tmp, by = c(group, cols[i]))
group <- c(group, cols[i])
}
x <- x[!duplicated(x), ]
> x
id x y z
1 a 1 2 1
2 b 2 1 1
I would like to apply this to larger amount of data, so this code will struggle at some point. Do you have any ideas to improve this ?
Thank you for any help !
We can try with dplyr
library(dplyr)
x %>%
group_by(id) %>%
arrange(desc(y),desc(z)) %>%
slice(which.max(x))
# id x y z
# <fctr> <dbl> <dbl> <dbl>
#1 a 1 2 1
#2 b 2 1 1
Here is a base R solution using the split-apply-combine methodology.
dfNew <- do.call(rbind, lapply(split(x, x$id),
function(x) x[with(x, order(x, y, z, decreasing=TRUE))[1],]))
which returns
dfNew
id x y z
a a 1 2 1
b b 2 1 1
split splits the dataframe by id and returns a list, This list is fed to lapply which then applies an anonymous function that returns the row with the maximum values according to order. Finally, the list of single row data.frames are appended with rbind and do.call.
I have a data set for example,
Data <- data.frame(
groupname = as.factor(sample(c("a", "b", "c"), 10, replace = TRUE)),
someuser = sample(c("x", "y", "z"), 10, replace = TRUE))
groupname someuser
1 a x
2 b y
3 a x
4 a y
5 c z
6 b x
7 b x
8 c x
9 c y
10 c x
How do I aggregate the data so that I get:
groupname someuser
a x
b x
c x
that is the most common value for each of the groupname.
PS: Given my setup, I have the limitation of using only 2 pakcages - plyr & lubridate
You can combine this function for finding the mode with aggregate.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
aggregate(someuser ~ groupname, Data, Mode)
groupname someuser
1 a x
2 b x
3 c x
Note that in the event of a tie, it will only return the first value.
Many options. Here one using table to compute frequency and which.max to select max occurred. within data.table framework:
library(data.table)
setDT(Data)[,list(someuser={
tt <- table(someuser)
names(tt)[which.max(tt)]
}),groupname]
using plyr( nearly the same) :
library(plyr)
ddply(Data,.(groupname),summarize,someuser={
tt <- table(someuser)
names(tt)[which.max(tt)]
})
This might work for you - using base R
set.seed(1)
Data <- data.frame(
groupname = as.factor(sample(c("a", "b", "c"), 10, replace = TRUE)),
someuser = sample(c("x", "y", "z"), 10, replace = TRUE))
Data
groupname someuser
1 a x
2 b x
3 b z
4 c y
5 a z
6 c y
7 c z
8 b z
9 b y
10 a z
res <- lapply(split(Data, Data$groupname), function(x)
data.frame(groupname=x$groupname[1], someuser=names(sort(table(x$someuser),
decreasing=TRUE))[1]))
do.call(rbind, res)
groupname someuser
a a z
b b z
c c y
And using ddply
sort_fn2 <- function(x) {names(sort(table(x$someuser), decreasing=TRUE))[1]}
ddply(Data, .(groupname), .fun=sort_fn2)
groupname V1
1 a z
2 b z
3 c y
I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3