I have data1 like:
id number
1 a
2 b
3 c
The other data2 like:
id value
1 x
2 y
3 z
I hope to merge two datasets like
a x
a y
a z
b x
b y
b z
c x
c y
c z
The two dataset both have 10k of data, I really couldn't do it by hand, could some one give me some suggestion on this.Thanks!
You can either use 'expand.grid()' like lmo pointed out, or in a more tidyverse fashion:
library(tidyverse)
Creating dataframes ("tibbles"):
dt1 <- tribble(
~id, ~number,
1, "a",
2, "b",
3, "c" )
dt2 <- tribble(
~id, ~value,
1, "x",
2, "y",
3, "z")
Using lmo's suggestion, expand.grid():
expand.grid(dt1$number, dt2$value)
A dplyr approach would be:
dt2 %>%
expand(id, value) %>%
dplyr::left_join(dt1) %>%
select(-id)
Resulting in:
Joining, by = "id"
# A tibble: 9 × 2
value number
<chr> <chr>
1 x a
2 y a
3 z a
4 x b
5 y b
6 z b
7 x c
8 y c
9 z c
Related
I am new to R and there may be a simple solution to this but I'm struggling to find one.
I wish to subset a data frame to exclude all rows that don't have both values offered in another row.
So, let's say this is my data frame:
df1
v1
v2
v3
A
1
x
A
2
y
A
3
x
B
4
x
C
5
y
C
6
y
D
7
y
D
8
x
I wish to eliminate any rows that do NOT have both an x and y value (v3) for a corresponding letter (v1) while also keeping all other columns intact (v2)
so my final result would be:
v1
v2
v3
A
1
x
A
2
y
A
3
x
D
7
y
D
8
x
Only values A and D would be retained because they have both a corresponding x and a corresponding y value. B and C would be eliminated since they only have either x OR y but not both.
I've tried using group_by and filter. The result comes out as an empty data frame:
library(dplyr)
df2 <- df1 %>%
group_by(v1) %>%
filter(all(c('x', 'y') %in% v3))
as well as:
library(dplyr)
df2 <- df1 %>%
group_by(v1) %>%
filter(any(v3 == "x"),
any(v3 == "y"))
df1 %>%
group_by(v1) %>%
filter(all(unique(df1$v3) %in% v3))
# A tibble: 5 x 3
# Groups: v1 [2]
v1 v2 v3
<chr> <int> <chr>
1 A 1 x
2 A 2 y
3 A 3 x
4 D 7 y
5 D 8 x
Try this aggregate solution
df1[df1$v1 %in% names( which( table(
aggregate( . ~ v3 + v1, df1, c )[,"v1"] ) > 1 )),]
v1 v2 v3
1 A 1 x
2 A 2 y
3 A 3 x
7 D 7 y
8 D 8 x
Data
df1 <- structure(list(v1 = c("A", "A", "A", "B", "C", "C", "D", "D"),
v2 = 1:8, v3 = c("x", "y", "x", "x", "y", "y", "y", "x")), class = "data.frame", row.names = c(NA,
-8L))
I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6
I have some data like this:
X Y
-----
A 1
A 2
B 3
B 4
C 5
C 6
I would like to add a new column with values equal to the mean of all Ys in rows where X is not euqal to X of the current observation.
In this particlar case we would get
X Y Mean
-------------------
A 1 (3+4+5+6)/4
A 2 (3+4+5+6)/4
B 3 (1+2+5+6)/4
B 4 (1+2+5+6)/4
C 5 (1+2+3+4)/4
C 6 (1+2+3+4)/4
Thanks in advance!
You can likely do this more succinctly, but this will get you the result.
You essentially create a column which contains the total observations and sum of records for the whole data.frame. Then you group by the X column and repeat the process, by taking the difference you can calculate your mean.
data
df <- data.frame(X = c("A", "A", "B", "B", "C", "C"),
Y = c(1:6))
solution
library(tidyverse)
df %>%
mutate(total_sum = sum(Y),
total_obs = n()) %>%
group_by(X) %>%
mutate(group_sum = sum(Y),
group_obs = n()) %>%
ungroup() %>%
mutate(other_group_sum = total_sum - group_sum,
other_group_obs = total_obs - group_obs,
other_mean = other_group_sum/other_group_obs) %>%
select(X, Y, other_mean)
result
# A tibble: 6 x 3
X Y other_mean
<fct> <int> <dbl>
1 A 1 4.50
2 A 2 4.50
3 B 3 3.50
4 B 4 3.50
5 C 5 2.50
6 C 6 2.50
I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3
I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
Update
With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:
df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
# A x y
#1 A 1 0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004
Original answer
For the sample data, it is also possible to use two filter after each other:
group_by(df, A) %>%
filter(x == min(x)) %>%
filter(1:n() == 1)
Just for completeness: Here's the final dplyr solution, derived from the comments of #hadley and #Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
For what it's worth, here's a data.table solution, to those who may be interested:
# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]
# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]
This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:
df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)
For more information see the dplyr vignette on window functions.
dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE
library(dplyr)
df %>%
group_by(A) %>%
slice_min(x, with_ties = FALSE)
Output :
# A tibble: 3 x 3
# Groups: A [3]
A x y
<fct> <dbl> <dbl>
1 A 1 0.273
2 B 2 -0.462
3 C 5 1.08
Another way to do it:
set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))
Result:
a b
1 1 -0.8356286
2 2 -2.2146999
Could also be easily adapted for getting the row in each group with maximum value.
In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:
> df
A x y
1 A 1 1.856368296
2 A 1 -0.298284187
3 A 2 0.800047796
4 B 2 0.107289719
5 B 3 0.641819999
6 B 4 0.650542284
7 C 5 0.422465687
8 C 5 0.009819306
9 C 5 -0.482082635
df %>% group_by(A) %>%
filter(x == min(x), y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
This code will filter the minima of x and y.
Also you can do a double filter
that looks even more readable:
df %>% group_by(A) %>%
filter(x == min(x)) %>%
filter(y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
I like sqldf for its simplicity..
sqldf("select A,min(X),y from 'df.g' group by A")
Output:
A min(X) y
1 A 1 -1.4836989
2 B 2 0.3755771
3 C 5 0.9284441
For the sake of completeness, here's the base R answer:
df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]
# A x y
#1 A 1 0.1076158
#4 B 2 -1.3909084
#7 C 5 0.3511618