Adding a counter column for a set of similar rows in R [duplicate] - r

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 9 years ago.
I have a data-frame in R with two columns. The first column contains the subjectID and the second column contains the trial ID that subject has done.
The a specific subjectID might have done the trial for more than 1 time. I want to add a column with a counter that starts counting for each subject-trial unique value and increment by 1 till it reaches the last row with that occurance.
More precisely, I have this table:
ID T
A 1
A 1
A 2
A 2
B 1
B 1
B 1
B 1
and I want the following output
ID T Index
A 1 1
A 1 2
A 2 1
A 2 2
B 1 1
B 1 2
B 1 3
B 1 4

I really like the simple syntax of data.table for this (not to mention speed)...
# Load package
require( data.table )
# Turn data.frame into a data.table
dt <- data.table( df )
# Get running count by ID and T
dt[ , Index := 1:.N , by = c("ID" , "T") ]
# ID T Index
#1: A 1 1
#2: A 1 2
#3: A 2 1
#4: A 2 2
#5: B 1 1
#6: B 1 2
#7: B 1 3
#8: B 1 4
.N is an integer equal to the number of rows in each group. The groups are defined by the column names in the by argument, so 1:.N gives a vector as long as the group.
As data.table inherits from data.frame any function that takes a data.frame as input will also take a data.table as input and you can easily convert back if you wished ( df <- data.frame( dt ) )

Related

data.table: How to indicate first occurrence of unique column value by group

I have a large data.table ~ 18*10^6 rows filled with columns ID and CLASS and I want to create a new binary column that indicates the occurrence of a new CLASS value by ID.
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"))
### Starting
ID CLASS
1 a
1 a
1 b
2 c
2 b
### Desired
ID CLASS NEWCLS
1 a 1
1 a 0
1 b 1
2 c 1
2 b 1
I originally initialized the NEWCLS variable and used the data.table::shift() function to lag a 1 by ID and CLASS
DT[,NEWCLS:=0]
DT[,NEWCLS:=data.table::shift(NEWCLS, n = 1L, fill = 1, type = "lag"),by=.(ID,CLASS)]
This creates the desired output but with ~18*10^6 rows it takes quite some time, even for data.table.
Would someone know how to create the NEWCLS variable in quicker and more efficient way using solely data.table arguments?
One possibility could be:
DT[, NEWCLS := as.integer(!duplicated(CLASS)), by = ID]
ID CLASS NEWCLS
1: 1 a 1
2: 1 a 0
3: 1 b 1
4: 2 c 1
5: 2 b 1

R - How to reformat wide dataset to wider dataset by changing the primary unique row variable

Sorry if this has been asked already and if the title is very confusing. I did look and only found questions on reformating where the values of one of the columns were used as column headings in the output dataset.
My dataset is organized so the Filter is the unique value for each row. I want to change it so the id of the individual within each sampling season is unique for each row since individuals had multiple filters. Basically, I want to reformat Table 1 so it looks like Table 2.
Table 1
id season FilterI
1: 1 1 A
2: 1 1 B
3: 2 1 C
4: 2 1 D
5: 1 2 E
6: 1 2 F
Table 2
id season FilterI1 FilterI2
1: 1 1 A B
2: 1 2 E F
3: 2 1 C D
Reshape does not seem to work because none of the columns in the first dataset contain the column headings for the second dataset.
Using dcast with rowid, change from 'long' to 'wide' (assuming the example data is data.table)
library(data.table)
dcast(Table1, id + season ~ paste0("FilterI", rowid(id)), value.var = "FilterI")
# id season FilterI1 FilterI2
#1: 1 1 A B
#2: 1 2 E F
#3: 2 1 C D

Duplicating data frame rows by freq value in same data frame [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 years ago.
I have a data frame with names by type and their frequencies. I'd like to expand this data frame so that the names are repeated according to their name-type frequency.
For example, this:
> df = data.frame(name=c('a','b','c'),type=c(0,1,2),freq=c(2,3,2))
name type freq
1 a 0 2
2 b 1 3
3 c 2 2
would become this:
> df_exp
name type
1 a 0
2 a 0
3 b 1
4 b 1
5 b 1
6 c 2
7 c 2
Appreciate any suggestions on a easy way to do this.
You can just use rep to "expand" your data.frame rows:
df[rep(sequence(nrow(df)), df$freq), c("name", "type")]
# name type
# 1 a 0
# 1.1 a 0
# 2 b 1
# 2.1 b 1
# 2.2 b 1
# 3 c 2
# 3.1 c 2
And there's a function expandRows in the splitstackshape package that does exactly this. It also has the option to accept a vector specifying how many times to replicate each row, for example:
expandRows(df, "freq")

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources