Add a column for counting unique tuples in the data frame [duplicate] - r

This question already has answers here:
How to get frequencies then add it as a variable in an array?
(3 answers)
Closed 8 years ago.
Suppose I have the following data frame:
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
df
# userID A B
# 1 1 2 2
# 2 1 3 3
# 3 3 2 1
# 4 5 1 0
# 5 3 2 1
# 6 5 1 0
I would like to create a data frame with the same columns but with an added final column that counts up the number of unique tuples / combinations of the other columns. The output should look like the following:
userID A B count
1 2 2 1
1 3 3 1
3 2 1 2
5 1 0 2
The meaning is the the tuple / combination of (1, 2, 2) occurs with count=1, while the tuple of (3, 2, 1) occurs twice so has count=2. I would prefer not to use any external packages.

1) aggregate
ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ] # sort the rows
giving:
userID A B count
3 1 2 2 1
4 1 3 3 1
2 3 2 1 2
1 5 1 0 2
The last line of code which sorts the rows could be omitted if the order of the rows is unimportant.
The remaining solutions use the indicated packages:
2) sqldf
library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")
giving:
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
The order by clause could be omitted if the order is unimportant.
3) dplyr
library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())
giving:
Source: local data frame [4 x 4]
Groups: userID, A
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
4) data.table
library(data.table)
data.table(df)[, list(count = .N), by = names(df)]
giving:
userID A B count
1: 1 2 2 1
2: 1 3 3 1
3: 3 2 1 2
4: 5 1 0 2
ADDED additional solutions. Also some small improvements.

Here's a fairly straightforward way (ave to the rescue!):
unique(cbind(df,
count = ave(rep(1, nrow(df)),
do.call(paste, df),
FUN = length)))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
Here's a variation of the above:
unique(within(df, {
counter <- rep(1, nrow(df))
count <- ave(counter, df, FUN = length)
rm(counter)
}))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2

userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
Make a quick factor of the tuples:
df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))
No external packages just taking advantage of summary() and storing it as a DF then merging the counts on the original data:
df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL
Almost final output, just has dupes:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 3 2 1 2
5 5 1 0 2
6 5 1 0 2
Lastly, clean up dupes:
df <- df[!duplicated(df), ]
Here you go:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
5 5 1 0 2
Been a while not doing that with sql or plyr. if you can use dplyr or a package later on do it. Bioconductor has a lot of great sequencing packages if it starts to get more complex.
Hope this helps.

This should do the trick, even if it is a little bit ugly:
vec <- table(apply(df,1,paste,collapse=""))
df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))
names(df2) <- names(df)
df2$count <- vec
# userID A B count
#1 1 2 2 1
#2 1 3 3 1
#3 3 2 1 2
#4 5 1 0 2

Related

Pair-wise manipulating rows in data.frame

I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!
s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2
The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))
This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1

Get start and end index of runs of values [duplicate]

This question already has answers here:
Find start and end positions/indices of runs/consecutive values
(2 answers)
Closed 3 years ago.
I have a vector:
a <- c(1, 1, 0, 0, 1, 2, 0, 0)
I would like to get the start and end indexes of each run of equal values:
number start end
0 3 4
0 7 8
1 1 2
1 5 5
2 6 6
A solution from base R.
a <- c(1,1,0,0,1,2,0,0)
# Get run length encoding
b <- rle(a)
# Create a data frame
dt <- data.frame(number = b$values, lengths = b$lengths)
# Get the end
dt$end <- cumsum(dt$lengths)
# Get the start
dt$start <- dt$end - dt$lengths + 1
# Select columns
dt <- dt[, c("number", "start", "end")]
# Sort rows
dt <- dt[order(dt$number), ]
dt
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6
Update
Here is a solution using with to make the code more concise.
with(rle(a), data.frame(number = values,
start = cumsum(lengths) - lengths + 1,
end = cumsum(lengths))[order(values),])
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6
By using dplyr and rleid from data.table
library(data.table)
library(dplyr)
a=c(1,1,0,0,1,2,0,0)
df=data.frame(number=c(1,1,0,0,1,2,0,0))
df$Id=data.table::rleid(df$number)
df$rowname=seq(1:length(a))
df%>%group_by(Id,number)%>%summarise(start=first(rowname),end=last(rowname))%>%arrange(number)
# Groups: Id [5]
Id number start end
<int> <dbl> <int> <int>
1 2 0 3 4
2 5 0 7 8
3 1 1 1 2
4 3 1 5 5
5 4 2 6 6
A solution using a for loop in base R:
a <- c(1, 1, 0, 0, 1, 2, 0, 0)
start <- 1
res <- data.frame()
v <- c(a, -1) # add number that is different from all other numbers
for (index in 1:(length(v) - 1)) {
if (v[index] != v[index + 1]) {
res <- rbind(res,
data.frame(element = v[index], start = start, stop = index))
start <- index + 1
}
}
Which gives:
element start stop
1 1 1 2
2 0 3 4
3 1 5 5
4 2 6 6
5 0 7 8

Summary of values across rows and columns in R

I have a dataset that looks like:
Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4
I want the summary of this dataset across rows and columns for the count of each value as below:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The count of 4 in the dataset for group XYZ across all rows and columns is 1, for 2 and 1 its 2, for 3 its 1. I can do this by creating 4 new columns 4,3,2,1 and getting the count row wise and then column wise, but this is not efficient and scalable. I am sure there is a better way to get this done.
Using reshape2 package we can melt and dcast as follows,
library(reshape2)
dcast(na.omit(melt(df, id.vars = 'Group')), Group ~ value, fun.aggregate = length)
# Group 1 2 3 4
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1
This uses no packages and is just one line. Here DF$Group[row(DF[-1])] is a Group labels vector such that each element corresponds to the unravelled numeric vector unlist(DF[-1]).
table(DF$Group[row(DF[-1])], unlist(DF[-1]))
giving:
1 2 3 4
DEF 3 1 3 1
PQR 2 2 1 1
XYZ 2 2 1 1
If the order of rows and columns shown in the question is important then to we can create factors from each of the two table arguments with the factor levels being defined in the orders desired. In this case we use the following line instead of the line of code above:
table(Group = factor(DF$Group[row(DF[-1])], unique(DF$Group)), factor(unlist(DF[-1]), 4:1))
giving:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The above produces an object of class "table". This is a particularly suitable class for tabulated frequencies. For example, once in this form ftable can be used to easily rearrange it further as in ftable(tab, row.vars = 2) or ftable(tab, row.vars = 1:2) where tab is the above computed table.
If a data.frame were preferred then convert it like this:
cbind(Group = rownames(tab), as.data.frame.matrix(tab))
The input data.frame DF is defined reproducibly in Note 2 at the end.
Alternatives
Although the above seems the most direct here are some other alternatives that also use no packages:
1) by For each set of rows having the same Group value the anonymous function creates a data.frame identifying the Group, converting the columns other than the first to a factor with the indicated levels and running table to get the counts. The "by" list that is returned is sorted back to the original order and we rbind everything back together.
do.call("rbind",
by(DF, DF$Group, function(x) {
data.frame(Group = x[1,1],
as.list(table(factor(unlist(x[, -1]), levels = 4:1))),
check.names = FALSE)
})[unique(DF$Group)])
giving:
Group 4 3 2 1
XYZ XYZ 1 1 2 2
DEF DEF 1 3 1 3
PQR PQR 1 1 2 2
1a) This slightly shorter variation would also work. It returns a matrix identifying the groups using row names.
kount <- function(x) table(factor(unlist(x), levels = 4:1))
m <- do.call("rbind", by(DF[, -1], DF$Group, kount)[unique(DF$Group)])
giving:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
2) outer
gps <- unique(DF$Group)
levs <- 4:1
kount2 <- function(g, lv) sum(subset(DF, Group == g)[-1] == lv, na.rm = TRUE)
m <- outer(gps, levs, Vectorize(kount2))
dimnames(m) <- list(gps, levs))
giving this matrix:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
3) sapply
kount3 <- function(g) table(factor(unlist(DF[DF$Group == g, -1]), levels = 4:1))
gps <- as.character(unique(DF$Group))
do.call("rbind", sapply(gps, kount3, simplify = FALSE))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
4) aggregate
aggregate(1:nrow(DF), DF["Group"], function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1)))[unique(DF$Group), ]
giving:
Group x.4 x.3 x.2 x.1
3 XYZ 1 1 2 2
1 DEF 1 3 1 3
2 PQR 1 1 2 2
5) tapply
do.call("rbind", tapply(1:nrow(DF), DF$Group, function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1))))[unique(DF$Group), ]
6) reshape
with(reshape(DF, dir = "long", varying = list(2:5)),
table(factor(Group, unique(DF$Group)), factor(A, 4:1)))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
Note 1: (1a), (2), (3), (5) and (6) produce a matrix or table result with groups as row names. If you prefer a data frame with Groups as a column then supposing that m is the matrix, add this:
data.frame(Group = rownames(m), m, check.names = FALSE)
Note 2: The input DF in reproducible form is:
Lines <- "Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4"
DF <- read.table(text = Lines, header = TRUE, na.strings = "Na")
We can use dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate_each(funs(replace(., .=="Na", NA))) %>%
gather(Var, Val, A:D, na.rm=TRUE) %>%
group_by(Group, Val) %>%
tally() %>%
spread(Val, n)
# Group `1` `2` `3` `4`
#* <chr> <int> <int> <int> <int>
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Cumulative count of each value [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I want to create a cumulative counter of the number of times each value appears.
e.g. say I have the column:
id
1
2
3
2
2
1
2
3
This would become:
id count
1 1
2 1
3 1
2 2
2 3
1 2
2 4
3 2
etc...
The ave function computes a function by group.
> id <- c(1,2,3,2,2,1,2,3)
> data.frame(id,count=ave(id==id, id, FUN=cumsum))
id count
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
I use id==id to create a vector of all TRUE values, which get converted to numeric when passed to cumsum. You could replace id==id with rep(1,length(id)).
Here is a way to get the counts:
id <- c(1,2,3,2,2,1,2,3)
sapply(1:length(id),function(i)sum(id[i]==id[1:i]))
Which gives you:
[1] 1 1 1 2 3 2 4 2
The dplyr way:
library(dplyr)
foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo
# A tibble: 8 x 2
# Groups: id [3]
id count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
That ends up grouped by id. If you want it not grouped, add %>% ungroup().
For completeness, adding a data.table way:
library(data.table)
DT <- data.table(id = c(1, 2, 3, 2, 2, 1, 2, 3))
DT[, count := seq(.N), by = id][]
Output:
id count
1: 1 1
2: 2 1
3: 3 1
4: 2 2
5: 2 3
6: 1 2
7: 2 4
8: 3 2
The dataframe I had was too large and the accepted answer kept crashing. This worked for me:
library(plyr)
df$ones <- 1
df <- ddply(df, .(id), transform, cumulative_count = cumsum(ones))
df$ones <- NULL
Function to get the cumulative count of any array, including a non-numeric array:
cumcount <- function(x){
cumcount <- numeric(length(x))
names(cumcount) <- x
for(i in 1:length(x)){
cumcount[i] <- sum(x[1:i]==x[i])
}
return(cumcount)
}

Resources