Aggregate and count in new column - r

I have a large data frame with the columns V1 and V2. It is representing an edgelist. I want to create a third column, COUNT, which counts how many times that exact edge appears. For example, if V1 == 1 and V2 == 2, I want to count how many other times V1 == 1 and V2 == 2, combine them into one row and put the count in a third column.
Data <- data.frame(
V1 = c(1,1),
V2 = c(2,2)
)
I've tried something like new = aggregate(V1 ~ V2,data=df,FUN=length) but it's not working for me.

...or maybe use data.table:
library(data.table)
df<-data.table(v1=c(1,2,3,4,5,1,2,3,1),v2=c(2,3,4,5,6,2,3,4,3))
df[ , count := .N, by=.(v1,v2)] ; df
v1 v2 count
1: 1 2 2
2: 2 3 2
3: 3 4 2
4: 4 5 1
5: 5 6 1
6: 1 2 2
7: 2 3 2
8: 3 4 2
9: 1 3 1

Assuming the structure of data as :
df<-data.frame(v1=c(1,2,3,4,5,1,2,3),v2=c(2,3,4,5,6,2,3,4),stringsAsFactors = FALSE)
> df
v1 v2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 1 2
7 2 3
8 3 4
Using ddply function from plyr package to get count of all edge-pairs
df2 <- ddply(df, .(v1,v2), function(df) c(count=nrow(df)))
> df2
v1 v2 count
1 1 2 2
2 2 3 2
3 3 4 2
4 4 5 1
5 5 6 1

Related

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

Remove rows which are different with the first changing in R

I have data sets
ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
df <- data.frame(ID, x, y)
df
ID x y
1 1 1 2
2 1 2 2
3 1 3 3
4 2 1 6
5 2 2 6
6 2 3 4
7 2 4 5
8 3 1 1
9 3 2 1
10 4 1 5
11 4 2 5
12 4 3 5
13 4 4 2
14 4 5 2
15 4 6 2
If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.
The table will be
ID x y
1 1 2
1 2 2
1 3 2
2 1 6
2 2 6
2 3 6
3 1 1
3 2 1
4 1 5
4 2 5
4 3 5
4 4 5
I couldn't figure out, do you have any idea, please help me, thanks.
Or we can do
library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
ID x y
1: 1 1 2
2: 1 2 2
3: 1 3 2
4: 2 1 6
5: 2 2 6
6: 2 3 6
7: 3 1 1
8: 3 2 1
9: 4 1 5
10: 4 2 5
11: 4 3 5
12: 4 4 5
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', if there is only a unique element in 'y' get the sequence of rows (1:.N) or else get the difference of 'y' (diff), check whether it is not equal to 0, use which to return the numeric index of the first TRUE ([1]),get the sequence and wrap it with .I to return row index.
library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1) .I[seq(which(c(FALSE,diff(y)!=0))[1])]
else .I[1:.N], ID]$V1
Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=), the 1st element in 'y' to change the 'y' column.
df[i1][, y:= y[1], ID][]
# ID x y
#1: 1 1 2
#2: 1 2 2
#3: 1 3 2
#4: 2 1 6
#5: 2 2 6
#6: 2 3 6
#7: 3 1 1
#8: 3 2 1
#9: 4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5
Or we can use a bit more simple coding with dplyr. (Disclaimer: The idea is somewhat similar to #Psidom's code). After grouping by 'ID', we get the lag of 'y', get a logical index by comparing with the first observation, filter the rows based on that and change the 'y' values to the first value.
library(dplyr)
df %>%
group_by(ID) %>%
filter(first(y)==lag(y, default = first(y))) %>%
mutate(y, y=first(y))
# ID x y
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 1 2 2
#3 1 3 2
#4 2 1 6
#5 2 2 6
#6 2 3 6
#7 3 1 1
#8 3 2 1
#9 4 1 5
#10 4 2 5
#11 4 3 5
#12 4 4 5
Or another option is ave from base R
df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
lag(x, default= x[1])== x[1]))),]
df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))
You could use a for loop, matching to the first instance of a given ID:
for( i in 1:nrow(df) ){
df$new[i] <- df$y[ match( df$ID[i], df$ID ) ]
}
This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match returns the first value matching a given criteria, which works well for what you're after.
Or you could eliminate the for loop by first extracting ID as a variable:
ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]
EDIT TO ADD: Sorry, here's a step to add to delete rows as requested
df <- subset( df, y == new |
( shift( y, 1, type = "lag" ) != y &
shift( ID, 1, type = "lag" ) == ID )
)

Split data.frame by variable and apply function referring to concrete row

I need to split data.frame by some variable and calculate difference between each row's value and value from some other specified row.
In example below, I split df by v1. Then for each row of v3 calculate difference between the actual value and v3[v2 == "C"].
v1 <- rep(1:4,each = 3)
v2 <- rep(c("A","B","C"),4)
v3 <- rep(1:5,3)[1:12]
res <- c(-2,-1,0,3,4,0,-2,-1,0,3,-1,0)
df <- data.frame(v1,v2,v3,res)
df
v1 v2 v3 res
1 1 A 1 -2
2 1 B 2 -1
3 1 C 3 0
4 2 A 4 3
5 2 B 5 4
6 2 C 1 0
7 3 A 2 -2
8 3 B 3 -1
9 3 C 4 0
10 4 A 5 3
11 4 B 1 -1
12 4 C 2 0
I prefer plyr or data.table, if possible.
Here is a data table solution:
library(data.table)
setDT(df)
df[, new := v3 - v3[v2=="C"], by = "v1"]

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources