create list from columns of data table expression - r

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b

Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

Related

Select value from previous group based on condition

I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2

R: Collapse duplicated values in a column while keeping the order

I'm sure this is super simple but just can't find the answer. I have a data frame like so
Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A
And I'd like to group by Id and collapse the distinct event values while keeping the event order like so
Id event
1 1 A
2 1 B
3 1 A
4 2 C
5 2 A
Most of my searches end up with using the distinct() or unique() functions but that leads losing the A event in row 3 for Id 1.
Thanks in advance!
We can use lead to compare each row and filter those rows that are different than the previous ones. is.na(lead(Id)) is to also include the last rows.
library(dplyr)
dat2 <- dat %>%
filter(!(Id == lead(Id) & event == lead(event)) | is.na(lead(Id)))
dat2
# Id event
# 1 1 A
# 2 1 B
# 3 1 A
# 4 2 C
# 5 2 A
DATA
dat <- read.table(text = " Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header = TRUE, stringsAsFactors = FALSE)
You can just compare every row with the one after it.
df = read.table(text=" Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header=TRUE)
df[rowSums(df[-1,] == head(df, -1)) !=2, ]
Id event
1 1 A
2 1 B
4 1 A
6 2 C
7 2 A
Here is a solution with data.table:
library("data.table")
dt <- fread(
" Id event
1 A
1 B
1 A
1 A
2 C
2 C
2 A")
unique(dt[, r:=rleidv(event), Id])[, -3]
# Id event
# 1: 1 A
# 2: 1 B
# 3: 1 A
# 4: 2 C
# 5: 2 A
or
dt[, .SD[unique(rleidv(event))], by = Id]
(thx to #mt1022 for the comment)
A base R solution using tapply and rle:
x <- tapply(dat$event,dat$Id,function(x) rle(x)$values)
do.call(rbind,Map(data.frame,Id=names(x),event=x))
# Id event
# 1.1 1 A
# 1.2 1 B
# 1.3 1 A
# 2.1 2 C
# 2.2 2 A
I think the distinct function will be able to solve the problem.
dat %>%
distinct(Id, event)

How to group rows in data frame while counting occurrences in one column and summing values in other?

I am trying to modify my data frame:
start end duration_time
1 1 2 2.438
2 2 1 3.901
3 1 2 18.037
4 2 3 85.861
5 3 4 83.922
and create something like this:
start end duration_time weight
1 1 2 20.475 2
2 2 1 3.901 1
4 2 3 85.861 1
5 3 4 83.922 1
So the duplicate start-end combinations will be removed, the weight will raise and duration time will sum
I already have a part working I just can't get the weight to work:
library('plyr')
df <- read.table(header = TRUE, text = "start end duration_time
1 1 2 2.438
2 2 1 3.901
3 1 2 18.037
4 2 3 85.861
5 3 4 83.922")
ddply(df, c("start","end"), summarise, weight=? ,duration_time=sum(duration_time))
A base R option is aggregate
do.call(data.frame, aggregate(duration_time~., df1,
FUN = function(x) c(duration_time=sum(x), weight = length(x))))
Simplest solution using data.table :
library(data.table)
setDT(df)[, .(duration_time=sum(duration_time), wt = .N) , by =c("start", "end")]
start end duration_time wt
1: 1 2 20.475 2
2: 2 1 3.901 1
3: 2 3 85.861 1
4: 3 4 83.922 1
Trying something using dplyr, tidyr
library(dplyr)
library(tidyr)
df1 <- df %>% unite(by_var, start,end)
df2 <- cbind(df1 %>% count(by_var), df1 %>% group_by(by_var)%>%
summarise( duration_time=sum(duration_time))%>%
separate(by_var, c("start","end")))[c(3,4,5,2)]
> df2
start end duration_time n
1 1 2 20.475 2
2 2 1 3.901 1
3 2 3 85.861 1
4 3 4 83.922 1

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

Create a rolling index of pairs over groups

I need to create (with R) a rolling index of pairs from a data set that includes groups. Consider the following data set:
times <- c(4,3,2)
V1 <- unlist(lapply(times, function(x) seq(1, x)))
df <- data.frame(group = rep(1:length(times), times = times),
V1 = V1,
rolling_index = c(1,1,2,2,3,3,4,5,5))
df
group V1 rolling_index
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 2 1 3
6 2 2 3
7 2 3 4
8 3 1 5
9 3 2 5
The data frame I have includes the variables group and V1. Within each group V1 designates a running index (that may or may not start at 1).
I want to create a new indexing variable that looks like rolling_index. This variable groups rows within the same group and consecutive V1 value, thus creating a new rolling index. This new index must be consecutive over groups. If there is an uneven amount of rows within a group (e.g. group 2), then the last, single row gets its own rolling index value.
You can try
library(data.table)
setDT(df)[, gr:=as.numeric(gl(.N, 2, .N)), group][,
rollindex:=cumsum(c(TRUE,abs(diff(gr))>0))][,gr:= NULL]
# group V1 rolling_index rollindex
#1: 1 1 1 1
#2: 1 2 1 1
#3: 1 3 2 2
#4: 1 4 2 2
#5: 2 1 3 3
#6: 2 2 3 3
#7: 2 3 4 4
#8: 3 1 5 5
#9: 3 2 5 5
Or using base R
indx1 <- !duplicated(df$group)
indx2 <- with(df, ave(group, group, FUN=function(x)
gl(length(x), 2, length(x))))
cumsum(c(TRUE,diff(indx2)>0)|indx1)
#[1] 1 1 2 2 3 3 4 5 5
Update
The above methods are based on the 'group' column. Suppose you already have a sequence column ('V1') by group as showed in the example, creation of rolling index is easier
cumsum(!!df$V1 %%2)
#[1] 1 1 2 2 3 3 4 5 5
As mentioned in the post, if the 'V1' column do not start at '1' for some groups, we can get the sequence from the 'group' and then do the cumsum as above
cumsum(!!with(df, ave(seq_along(group), group, FUN=seq_along))%%2)
#[1] 1 1 2 2 3 3 4 5 5
There is probably a simpler way but you can do:
rep_each <- unlist(mapply(function(q,r) {c(rep(2, q),rep(1, r))},
q=table(df$group)%/%2,
r=table(df$group)%%2))
df$rolling_index <- inverse.rle(x=list(lengths=rep_each, values=seq(rep_each)))
df$rolling_index
#[1] 1 1 2 2 3 3 4 5 5

Resources