R select the second element in a group - r

I am trying to find a more R-esque way of selecting the 2nd element (but NOT the first) element of a group in R.
I ended up: 1. creating an index rowNumIndex; 2. selecting and putting the first rows in a one data frame and then the first two rows in a separate data frame; and then 3. "reverse merging" the 2 data frames to get just the unique values from the data frame with the first two rows:
firsts <- ddply(df,.(group), function(x) head(x,1)) # 2 records using data below
seconds <- ddply(df,.(group), function(x) head(x,2)) # 4 records using data below
real.seconds <- seconds[!seconds$rowNumIndex %in% firsts$rowNumIndex, ] # 2 records, the second elements only
Here's some pretend data:
group var1 rowNumIndex
A 8 1
A 9 2
A 10 3
B 11 4
B 12 5
B 13 6
B 14 7
structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("A", "B"), class = "factor"), var1 = 8:14, rowNumIndex = 1:7), .Names = c("group",
"var1", "rowNumIndex"), class = "data.frame", row.names = c(NA,
-7L))
So, data frame firsts looks like:
group var1 rowNumIndex
A 8 1
B 11 4
And data frame seconds looks like:
group var1 rowNumIndex
A 8 1
A 9 2
B 11 4
B 12 5
And data frame real.seconds looks like:
group var1 rowNumIndex
A 9 2
B 12 5
Is there a way to do this w/o resorting to, e.g., the index? Thanks in advance for what will undoubtedly be a soul-crushingly simple and elegant solution!

A solution with dplyr:
library(dplyr)
group_by(df, group) %>% slice(2)
# group var1 rowNumIndex
# <fctr> <int> <int>
# 1 A 9 2
# 2 B 12 5
Pre-dplyr 0.3 alternative:
group_by(df, group)%.%filter(seq_along(var1)==2)
group var1 rowNumIndex
1 A 9 2
2 B 12 5
This solution will keep all the columns of the data. If you just want the two columns (group and var), you can do this:
group_by(df, group)%.%summarise(var1[2])
group var1[2]
1 A 9
2 B 12
A solution with split, lapply and do.call
real.seconds<-do.call("rbind", lapply(split(df, df$group), function(x) x[2,]))
This will give you:
real.seconds
group var1 rowNumIndex
A A 9 2
B B 12 5
Or, more elegantly, with by:
real.seconds <- do.call(rbind, by(df, df$group, function(x) x[2, ]))

I would use data.table:
library(data.table)
dt = data.table(df)
dt[,var1[2],by=group]
As I think about it, there's no reason you shouldn't be able to do this with plyr:
ddply(df, .(group), function(x) x[2,])

A base alternative, where only 'var1' is aggregated:
aggregate(var1 ~ group, data = df, `[`, 2)
...or if you wish to aggregate all columns in the data frame, you can use the ''dot notation':
aggregate(. ~ group, data = df, `[`, 2)

Related

How can I find the number of a vector's elements in another vector?

I have two vectors. First vector name is comments$author_id and second is enrolments$learner_id. I want to add new column into enrolmens dataframe that shows count of repeated rows in comments$author_id vector for each enrolment$learner_id row.
Example:
if(enrolments$learner_id[1] repeated 5 times in comments$author_id)
enrolments$freqs[1] = 5
Can I do this don't using any loops?
The vector samples are as follows:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,4,5,6))
I want to add "counts" column to "df2" that shows counts of repeated v2 element in v1.
"[tabulate] gives me this error: Error in $<-.data.frame(tmp, "comments_count", value = c(0L, 0L, : replacement has 25596 rows, data
has 25597"
That is prly because there is one value at the end of df2$v2, which are not part of df1$v1 - I add 0 and 7 to your example to show that:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,0,4,5,6,7))
df2$count <- tabulate(factor(df1$v1, df2$v2))
# Error in `$<-.data.frame`(`*tmp*`, count, value = c(7L, 5L, 3L, 0L, 5L, :
# replacement has 7 rows, data has 8
To correct that using tabulate, which might be the fastest solution on larger data:
df2$count <- tabulate(factor(df1$v1, df2$v2), length(df2$v2))
df2
# v2 count
# 1 1 7
# 2 2 5
# 3 3 3
# 4 0 0
# 5 4 5
# 6 5 6
# 7 6 2
# 8 7 0
See ?tabulate for the documentation on that function.
Using your df1 and df2 example, you could do it like this:
# Make data
df1 = data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 = data.frame(v2 = c(1,2,3,4,5,6))
# Add 'count' variable as reqeuested
df2$counts = sapply(df2$v2, function(x) {
sum(df1$v1 == x, na.rm = T) #na.rm=T just in case df1$v1 has missing values
})
df2 #view output
What you essentially are doing is aggregating the df1 to get a count, and then adding this count back to the df2 set. This logic can be easily translated to a bunch of different methods:
# base R
merge(
df2,
aggregate(cbind(df1[0], count=1), df1["v1"], FUN=sum),
by.x="v2", by.y="v1", all.x=TRUE
)
# data.table
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, .(count=.N), by=v1], on=c("v2"="v1")]
# dplyr
library(dplyr)
df1 %>%
group_by(v1) %>%
count() %>%
left_join(df2, ., by=c("v2"="v1"))
# v2 count
#1 1 7
#2 2 5
#3 3 3
#4 4 5
#5 5 6
#6 6 2

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))

how to split columns by certain category?

My data look like this:
dd gg site
5 10 A
7 8 A
5 6 B
7 9 B
I want to split the column of site according to A and B.desired output is:
dd gg site gg_B
5 10 A 6
7 8 A 9
It sounds like you want to split by site and then merge by the dd column. You can do this with split and merge:
Reduce(function(x, y) merge(x, y, by="dd"), split(dat, dat$site))
# dd gg.x site.x gg.y site.y
# 1 5 10 A 6 B
# 2 7 8 A 9 B
By using Reduce, this code should work even if you had more than two sites. I have performed an inner join, meaning I will only keep a row for a given value of dd if it is reported for all sites. If you wanted to instead keep a row for a given value of dd if it is used by 1 or more sites, you could use:
Reduce(function(x, y) merge(x, y, by="dd", all=TRUE), split(dat, dat$site))
Maybe you'd be happy with
library("reshape2")
dcast(dat,dd~site,value.var="gg")
## dd A B
## 1 5 10 6
## 2 7 8 9
? (This is essentially the same as the tidyr::spread() answer suggested by others.)
If the columns are always in the right order you simply cbind them:
l <- split(dat, dat$site)
l$B <- l$B$gg
cbind(l$A, l$B, deparse.level = 0)
Result:
dd gg site l$B
1 5 10 A 6
2 7 8 A 9
Data:
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = " dd gg site
5 10 A
7 8 A
5 6 B
7 9 B ")
Your request seems strange in the way A values for site are treated differently than the B values.
Using this data:
xx = structure(list(dd = c(5L, 7L, 5L, 7L), gg = c(10L, 8L, 6L, 9L
), site = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("dd",
"gg", "site"), class = "data.frame", row.names = c(NA, -4L))
We can "spread" the columns from long to wide format using tidyr::spread. but this eliminates the site column and treats A and B values of it the same:
library(tidyr)
(xx = spread(xx, key = site, value = gg))
# dd A B
# 1 5 10 6
# 2 7 8 9
Adding the gg_ prefix to the names:
names(xx)[2:3] = paste("gg", names(xx[2:3]), sep = "_")
xx
# dd gg_A gg_B
# 1 5 10 6
# 2 7 8 9
I would prefer data in the above format. If you want to exactly match your desired output, adding xx$site = "A" and renaming the existing columns is easy enough.
You can use tidyr to turn the subset of your data that has the site you want in column in wide format and then use dplyr::inner_join to merge it with the subset of the data having the other sites.
library(dplyr)
library(tidyr)
df %>%
filter(site == "B") %>%
spread(key = site, value = gg) %>%
inner_join(filter(df, site != "B"))
## Joining by: "dd"
## dd B gg site
## 1 5 6 10 A
## 2 7 9 8 A

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Removing rows when flipped in two columns

Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6

Resources