I have a large data frame of 100,000's rows, and I want to add a column where the value is a sample of a subset of another data frame based on common names in the data frames. Might be easier to explain with examples...
largeDF <- data.frame(colA = c('a', 'b', 'b', 'a', 'a', 'b'),
colB = c('x', 'y', 'y', 'x', 'y', 'y'),
colC = 1:6)
sampleDF <- data.frame(colA = c('a','a','a','a','b','b','b','b','b','b'),
colB = c('x','x','y','y','x','y','y','y','y','y'),
sample = 1:10)
I then want to add a new column sample to largeDF, which is a random sample of the sample column in sampleDF for the appropriate subset of colA and colB.
For example, for the first row the values are a and x, so the value will be a random sample of 1 or 2, for the next row (b and y) it will be a random sample of 6, 7, 8, 9 or 10.
So we could end up with something like:
rowA rowB rowC sample
1 a x 1 2
2 b y 2 9
3 b y 3 7
4 a x 4 2
5 a y 5 4
6 b y 6 8
Any help would be appreciated!
Using dplyr... (This throws a few warnings, but appears to work anyway.)
library(dplyr)
largeDF <- largeDF %>% group_by(colA,colB) %>%
mutate(sample=sample(sampleDF$sample[sampleDF$colA==colA & sampleDF$colB==colB],
size=n(),replace=TRUE))
largeDF
colA colB colC sample
<fctr> <fctr> <int> <int>
1 a x 1 2
2 b y 2 6
3 b y 3 9
4 a x 4 1
5 a y 5 4
6 b y 6 9
You could do something like this:
largeDF$sample <- apply(largeDF,1,function(a)
with(sampleDF, sample(sampleDF[colA==a[1] & colB==a[2],]$sample,1)))
I do not quite understand the question but it seems that you are just adding a new column in the large data frame that is just the sampled "sample" column from a subsample...
see if the following code gives you an idea into the functionality you need:
cbind.data.frame(largeDF, sample = sample(sampleDF$sample, nrow(largeDF)))
# colA colB colC sample
#1 a x 1 9
#2 b y 2 10
#3 b y 3 1
#4 a x 4 3
#5 a y 5 6
#6 b y 6 7
I think this is one possible solution for you...
library(dplyr)
largeDF_sample <- sapply(1:nrow(largeDF), function(x) {
sampleDF_part = filter(sampleDF, colA==largeDF$colA[x] & colB==largeDF$colB[x])
return(sample(sampleDF_part$sample)[1])
})
largeDF$sample <- largeDF_sample
Related
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 5 years ago.
i have a dataframe that looks like this
> data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'))
> data
foo bar
1 1 a
2 1 b
3 2 a
4 3 b
5 3 c
6 3 d
I would like to create a new column bars_by_foo which is the concatenation of the values of bar by foo. So the new data should look like this:
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
I was hoping that the following would work:
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
But that code gives me an error
Error: incompatible types, expecting a character vector.
What am I doing wrong?
You could simply do
data %>%
group_by(foo) %>%
mutate(bars_by_foo = paste0(bar, collapse = ""))
Without any helper functions
It looks like there's a bit of an issue with the mutate function - I've found that it's a better approach to work with summarise when you're grouping data in dplyr (that's no way a hard and fast rule though).
paste function also introduces whitespace into the result so either set sep = 0 or use just use paste0.
Here is my code:
p <- function(v) {
Reduce(f=paste0, x = v)
}
data %>%
group_by(foo) %>%
summarise(bars_by_foo = p(as.character(bar))) %>%
merge(., data, by = 'foo') %>%
select(foo, bar, bars_by_foo)
Resulting in..
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
You can try this:
agg <- aggregate(bar~foo, data = data, paste0, collapse="")
df <- merge(data, agg, by = "foo", all = T)
colnames(df) <- c(colnames(data), "bars_by_foo") # optional
# foo bar bars_by_foo
# 1 1 a ab
# 2 1 b ab
# 3 2 a a
# 4 3 b bcd
# 5 3 c bcd
# 6 3 d bcd
Your function works if you ensure that bar are all characters and not levels of a factor.
data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'),
stringsAsFactors = FALSE)
library("dplyr")
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
Source: local data frame [6 x 3]
Groups: foo [3]
foo bar bars_by_foo
<dbl> <chr> <chr>
1 1 a a b
2 1 b a b
3 2 a a
4 3 b b c d
5 3 c b c d
6 3 d b c d
I wonder if there is a way to do:
df <- data.frame(x = 1:3)
df$y = df$x + 5
yielding:
x y
1 1 6
2 2 7
3 3 8
in one line of code where the y column refers to the x column? For example:
data.frame(x = 1:3, y = self$x + 5) # doesn't work
(I won't accept answers that ignore the x column, for example, data.frame(x = 1:3, y = 6:8 :-))
This is possible using tibble from tibble library. Credit to #DaveArmstrong from the comments.
library(tibble)
tibble(x = 1:3, y = x + 5)
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 6
2 2 7
3 3 8
Here's a base R method that do not need to use external package (e.g. tibble).
We can use outer to add 5 to each element in df$x, then cbind the result with df.
setNames(data.frame(cbind(1:3, outer(1:3, 5, `+`))), c("x", "y"))
# or to expand your code
setNames(cbind(data.frame(x = 1:3), outer(1:3, 5, `+`)), c("x", "y"))
x y
1 1 6
2 2 7
3 3 8
I need to subset data for when columns don't match. For example if I have an identifier in the first column X like 1 then all of the following examples in column Y should match:
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
So on this one I would like to subset the data when X = 1 and 3 on this example since column y doesn't fully agree by not subset column 2. It would be great to get a function to subset for this type of problem I have on a larger dataframe
Thanks,
With dplyr:
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))
library(dplyr)
df %>%
group_by(x) %>%
filter(any(!y == lag(y), na.rm = T))
#> Source: local data frame [8 x 2]
#> Groups: x [2]
#>
#> x y
#> <int> <fctr>
#> 1 1 Dave
#> 2 1 Sam
#> 3 3 Dave
#> 4 3 Sam
#> 5 1 Dave
#> 6 1 Sam
#> 7 3 Dave
#> 8 3 Sam
I tested some cases, not sure if this holds a lot of edge cases
This is the way I would do it, though there may be a more elegant way. Is this what you need?
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
# First Create Concatenated column
Z$XY <- paste(Z$X, Z$Y)
# Eliminate all duplicates
Z_unique <- unique(Z)
# Find number of occurences of each X value
n_occur <- data.frame(table(Z_unique$X))
# Pull only those that have occurred more than once
n_occur[n_occur$Freq > 1,]
# Subset the output to only those values
output <- Z[Z$X %in% n_occur$Var1[n_occur$Freq > 1],]
We can use data.table
library(data.table)
setDT(df)[, .SD[any(!y == shift(y))], x]
# x y
#1: 1 Dave
#2: 1 Sam
#3: 1 Dave
#4: 1 Sam
#5: 3 Dave
#6: 3 Sam
#7: 3 Dave
#8: 3 Sam
data
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))
I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))
Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6