Order multiple rows in a data frame - r

I have a data frame like this
ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0
and what I would like to do is to order it by the ID first appearance date (i.e. the minimum value of EPOCH for each ID) so that I get
ID EPOCH
C 0
A 1
A 2
A 3
B 2
B 3
I managed only to order the data frame according to Epoch and than ID
df[order(df$EPOCH,df$ID),]
but than it is no more clustered by ID, i.e.
C 0
A 1
A 2
B 2
A 3
B 3
Many thanks

First add a column with the minimum EPOCH for each ID to the data.frame:
data <- read.table(textConnection("ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0"), header=TRUE)
a <- aggregate(data$EPOCH, data["ID"], min)
names(a)[2] <- "min_EPOCH"
data <- merge(data, a)
Then sort on that new column:
o <- order(data$min_EPOCH, data$ID, data$EPOCH)
data[o, ]

Related

R determine sequence with dplyr using group by

For the following data frame I would like to determine the sequence for column Drug grouped by ID
where the order should be based on column dat (earliest date shoud be the first in the sequence). In the initial df, some IDs have 1 row and some have more than 1 (in this case ID 1 & 5).
df <- data.frame(ID = c(1,1,2,3,4,5,5,6,7,8),
dat = seq(as.Date("2021-01-01"), as.Date("2021-03-05"), by="weeks"),
drug = c("A","A","B","C","B","B","C","D","C","B"))
The desired output should be
ID seq1
1 1 A,A
2 2 B
3 3 C
4 4 B
5 5 B,C
6 6 D
7 7 C
8 8 B

Count of times a value in one df is exceeded in a second df subject to other conditions

I have a history of transactions in a dataframe. Each transaction has three attributes: year, size and color.
Transactions <- data.frame(Size=c("S","S","S","S","L","L","S","L"),
Color=c("R","R","R","B","R","B","B","R"),
Year=c(1,1,2,1,1,1,2,2))
Size Color Year
S R 1
S R 1
S R 2
S B 1
L R 1
L B 1
S B 2
L R 2
So the first, second and third transactions are: SR1, SR1, and SR2. That's three SR transactions. Two in year 1 and one in year 2.
I'd like report in the form of a df that summarizes, for each combination of color and size, the number of times the year is matched, or exceeded. So, for the data above a correct final report is shown below.
Size Color Year Count
S R 1 3 (from obs 1,2,3 because there are 3 SRs Yr 1 or later)
S R 2 1 (from row 3 of transaction b/c just one SR2)
S B 1 2
S B 2 1
L R 1 2
L R 2 1
L B 1 1
L B 2 0 (Because LB2 doesn't appear in transactions.
The sequence of the rows in the report doesn't come from the transaction frame. It's a complete permutation of all of the levels of size, color, and Year. In my real problem, I have a df with the structure of the first three cols in the report, so I'd like to be able to just append last col to it. This df without the final col would be:
Report <- data.frame(Size= c("S","S","S","S","L","L","L","L"),
Color=c("R","R","B","B","R","R","B","B"),
Year= c(1,2,1,2,1,2,1,2)
)
I would like to append the final col, but if there's a way to generate it directly from the transactions, that's fine, too. But since it's possible that some report combinations don't appear in the transactions I don't think that's feasible.
Here is a solution with data.table:
Transactions <- data.frame(Size=c("S","S","S","S","L","L","S","L"),
Color=c("R","R","R","B","R","B","B","R"),
Year=c(1,1,2,1,1,1,2,2))
library("data.table")
setDT(Transactions)
allYears <- Transactions[, unique(Year)]
Transactions[, .(Year=allYears, count=sapply(allYears, function(y) sum(Year>=y))), by=.(Size, Color)]
# > Transactions[, .(Year=allYears, count=sapply(allYears, function(y) sum(Year>=y))), by=.(Size, Color)]
# Size Color Year count
# 1: S R 1 3
# 2: S R 2 1
# 3: S B 1 2
# 4: S B 2 1
# 5: L R 1 2
# 6: L R 2 1
# 7: L B 1 1
# 8: L B 2 0

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

How to write the remaining data frame in R after randomly subseting the data

I took a random sample from a data frame. But I don't know how to get the remaining data frame.
df <- data.frame(x=rep(1:3,each=2),y=6:1,z=letters[1:6])
#select 3 random rows
df[sample(nrow(df),3)]
What I want is to get the remaining data frame with the other 3 rows.
sample sets a random seed each time you run it, thus if you want to reproduce its results you will either need to set.seed or save its results in a variable.
Addressing your question, you simply need to add - before your index in order to get the rest of the data set.
Also, don't forget to add a comma after the indx if you want to select rows (unlike in your question)
set.seed(1)
indx <- sample(nrow(df), 3)
Your subset
df[indx, ]
# x y z
# 2 1 5 b
# 6 3 1 f
# 3 2 4 c
Remaining data set
df[-indx, ]
# x y z
# 1 1 6 a
# 4 2 3 d
# 5 3 2 e
Try:
> df
x y z
1 1 6 a
2 1 5 b
3 2 4 c
4 2 3 d
5 3 2 e
6 3 1 f
>
> df2 = df[sample(nrow(df),3),]
> df2
x y z
5 3 2 e
3 2 4 c
1 1 6 a
> df[!rownames(df) %in% rownames(df2),]
x y z
1 1 6 a
2 1 5 b
5 3 2 e

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Resources