How to shift data in only one column up and down in R? - r

I have a data frame that looks as follows:
ID
Count
1
3
2
5
3
2
4
0
5
1
And I am trying to shift ONLY the values in the "Count" column down one so that it looks as follows:
ID
Count
1
NA
2
3
3
5
4
2
5
0
I will also need to eventually shift the same data up one:
ID
Count
1
5
2
2
3
0
4
1
5
NA
I've tried the following code:
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df$Count <- shift(df$Count, 1)
But it ended up duplicating the titles and shifting the data down, like as follows:
ID
Count
ID
Count
1
3
2
5
3
2
4
0
Is there an easy way for me to accomplish this? Thank you!!

# set as data.table
setDT(df)
# shift
df[, count := shift(count, 1)]

df$Count=c(NA, df$Count[1:(nrow(df)-1)])

1) dplyr Using DF shown reproducibly in the Note at the end, use lag and lead from dplyr
library(dplyr)
DF %>% mutate(CountLag = lag(Count), CountLead = lead(Count))
## ID Count CountLag CountLead
## 1 1 3 NA 5
## 2 2 5 3 2
## 3 3 2 5 0
## 4 4 0 2 1
## 5 5 1 0 NA
2) zoo This creates a zoo object using zoo's vectorized lag. Optionally use fortify.zoo(z) or as.ts(z) to convert it back to a data frame or ts object.
Note that dplyr clobbers lag with its own lag so we used stats::lag to ensure it does not interfere. The stats:: can optionally be omitted if dplyr is not loaded.
library(zoo)
z <- stats::lag(read.zoo(DF), seq(-1, 1)); z
Index lag-1 lag0 lag1
1 1 NA 3 5
2 2 3 5 2
3 3 5 2 0
4 4 2 0 1
5 5 0 1 NA
3) collapse flag from the collapse package is also vectorized over its second argument.
library(collapse)
with(DF, data.frame(ID, Count = flag(Count, seq(-1, 1))))
## ID Count.F1 Count... Count.L1
## 1 1 5 3 NA
## 2 2 2 5 3
## 3 3 0 2 5
## 4 4 1 0 2
## 5 5 NA 1 0
Note
DF <- data.frame(ID = 1:5, Count = c(3, 5, 2, 0, 1))

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

Create a rolling index of pairs over groups

I need to create (with R) a rolling index of pairs from a data set that includes groups. Consider the following data set:
times <- c(4,3,2)
V1 <- unlist(lapply(times, function(x) seq(1, x)))
df <- data.frame(group = rep(1:length(times), times = times),
V1 = V1,
rolling_index = c(1,1,2,2,3,3,4,5,5))
df
group V1 rolling_index
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 2 1 3
6 2 2 3
7 2 3 4
8 3 1 5
9 3 2 5
The data frame I have includes the variables group and V1. Within each group V1 designates a running index (that may or may not start at 1).
I want to create a new indexing variable that looks like rolling_index. This variable groups rows within the same group and consecutive V1 value, thus creating a new rolling index. This new index must be consecutive over groups. If there is an uneven amount of rows within a group (e.g. group 2), then the last, single row gets its own rolling index value.
You can try
library(data.table)
setDT(df)[, gr:=as.numeric(gl(.N, 2, .N)), group][,
rollindex:=cumsum(c(TRUE,abs(diff(gr))>0))][,gr:= NULL]
# group V1 rolling_index rollindex
#1: 1 1 1 1
#2: 1 2 1 1
#3: 1 3 2 2
#4: 1 4 2 2
#5: 2 1 3 3
#6: 2 2 3 3
#7: 2 3 4 4
#8: 3 1 5 5
#9: 3 2 5 5
Or using base R
indx1 <- !duplicated(df$group)
indx2 <- with(df, ave(group, group, FUN=function(x)
gl(length(x), 2, length(x))))
cumsum(c(TRUE,diff(indx2)>0)|indx1)
#[1] 1 1 2 2 3 3 4 5 5
Update
The above methods are based on the 'group' column. Suppose you already have a sequence column ('V1') by group as showed in the example, creation of rolling index is easier
cumsum(!!df$V1 %%2)
#[1] 1 1 2 2 3 3 4 5 5
As mentioned in the post, if the 'V1' column do not start at '1' for some groups, we can get the sequence from the 'group' and then do the cumsum as above
cumsum(!!with(df, ave(seq_along(group), group, FUN=seq_along))%%2)
#[1] 1 1 2 2 3 3 4 5 5
There is probably a simpler way but you can do:
rep_each <- unlist(mapply(function(q,r) {c(rep(2, q),rep(1, r))},
q=table(df$group)%/%2,
r=table(df$group)%%2))
df$rolling_index <- inverse.rle(x=list(lengths=rep_each, values=seq(rep_each)))
df$rolling_index
#[1] 1 1 2 2 3 3 4 5 5

How can I subset a dataframe according to group membership?

I am wanting to write a function so that a (potentially large) dataframe can be subsetted according to group membership, where a 'group' is a unique combination of a set of column values.
For example, I would like to subset the following data frame according to unique combination of the first two columns (Loc1 and Loc2).
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF=data.frame(Loc1,Loc2,Dat1,Dat2,Dat3)
Loc1 Loc2 Dat1 Dat2 Dat3
1 A a 1 1 2
2 A a 1 2 2
3 A b 1 1 4
4 A b 1 2 4
5 B a 1 1 6
6 B a 1 2 5
7 B b 1 2 3
I want to return (i) the number of groups (i.e. 4), (ii) the number in each group (i.e. c(2,2,2,1), and (iii) to relabel the rows so that I can further analyse the data frame according to group membership (e.g. for ANOVA and MANOVA) (i.e.
Group<-as.factor(c(1,1,2,2,3,3,4))
Data <- cbind(Group,DF[,-1:-2])
Group Dat1 Dat2 Dat3
1 1 1 1 2
2 1 1 2 2
3 2 1 1 4
4 2 1 2 4
5 3 1 1 6
6 3 1 2 5
7 4 1 2 3
).
So far all I have managed is to get the number of groups, and I'm suspicious that there's a better way to do even this:
nrow(unique(DF[,1:2]))
I was hoping to avoid for-loops as I am concerned about the function being slow.
I have tried converting to a data matrix so that I could concatenate the row values but I couldn't get that to work either.
Many thanks
You could try:
Create Group column by using unique level combination of Loc1 and Loc2.
indx <- paste(DF[,1], DF[,2])
DF$Group <- as.numeric(factor(indx, unique(indx))) #query No (iii)
DF1 <- DF[-(1:2)][,c(4,1:3)]
# Group Dat1 Dat2 Dat3
#1 1 1 1 2
#2 1 1 2 2
#3 2 1 1 4
#4 2 1 2 4
#5 3 1 1 6
#6 3 1 2 5
#7 4 1 2 3
table(DF$Group) #(No. ii)
#1 2 3 4
#2 2 2 1
length(unique(DF$Group)) #(i)
#[1] 4
Then, if you need to subset the datasets by group, you could split the dataset using the Group to create a list of 4 list elements
split(DF1, DF1$Group)
Update
If you have multiple columns, you could still try:
ColstoGroup <- 1:2
indx <- apply(DF[,ColstoGroup], 1, paste, collapse="")
as.numeric(factor(indx, unique(indx)))
#[1] 1 1 2 2 3 3 4
You could create a function;
fun1 <- function(dat, GroupCols){
FactGroup <- dat[, GroupCols]
if(length(GroupCols)==1){
dat$Group <- as.numeric(factor(FactGroup, levels=unique(FactGroup)))
}
else {
indx <- apply(FactGroup, 1, paste, collapse="")
dat$Group <- as.numeric(factor(indx, unique(indx)))
}
dat
}
fun1(DF, "Loc1")
fun1(DF, c("Loc1", "Loc2"))
This gets all three of your queries.
Begin with a table of the first two columns and then work with that data.
> (tab <- table(DF$Loc1, DF$Loc2))
#
# a b
# A 2 2
# B 2 1
#
> (ct <- c(tab)) ## (ii)
# [1] 2 2 2 1
> length(unlist(dimnames(tab))) ## (i)
# [1] 4
> cbind(Group = rep(seq_along(ct), ct), DF[-c(1,2)]) ## (iii)
# Group Dat1 Dat2 Dat3
# 1 1 1 1 2
# 2 1 1 2 2
# 3 2 1 1 4
# 4 2 1 2 4
# 5 3 1 1 6
# 6 3 1 2 5
# 7 4 1 2 3
Borrowing a bit from this answer and using some dplyr idioms:
library(dplyr)
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF <- data.frame(Loc1, Loc2, Dat1, Dat2, Dat3)
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L
}
})
DF %>% group_by(Loc1, Loc2) %>% mutate(Group=emitID())
## Loc1 Loc2 Dat1 Dat2 Dat3 Group
## 1 A a 1 1 2 0
## 2 A a 1 2 2 0
## 3 A b 1 1 4 1
## 4 A b 1 2 4 1
## 5 B a 1 1 6 2
## 6 B a 1 2 5 2
## 7 B b 1 2 3 3

Add a column for counting unique tuples in the data frame [duplicate]

This question already has answers here:
How to get frequencies then add it as a variable in an array?
(3 answers)
Closed 8 years ago.
Suppose I have the following data frame:
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
df
# userID A B
# 1 1 2 2
# 2 1 3 3
# 3 3 2 1
# 4 5 1 0
# 5 3 2 1
# 6 5 1 0
I would like to create a data frame with the same columns but with an added final column that counts up the number of unique tuples / combinations of the other columns. The output should look like the following:
userID A B count
1 2 2 1
1 3 3 1
3 2 1 2
5 1 0 2
The meaning is the the tuple / combination of (1, 2, 2) occurs with count=1, while the tuple of (3, 2, 1) occurs twice so has count=2. I would prefer not to use any external packages.
1) aggregate
ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ] # sort the rows
giving:
userID A B count
3 1 2 2 1
4 1 3 3 1
2 3 2 1 2
1 5 1 0 2
The last line of code which sorts the rows could be omitted if the order of the rows is unimportant.
The remaining solutions use the indicated packages:
2) sqldf
library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")
giving:
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
The order by clause could be omitted if the order is unimportant.
3) dplyr
library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())
giving:
Source: local data frame [4 x 4]
Groups: userID, A
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
4) data.table
library(data.table)
data.table(df)[, list(count = .N), by = names(df)]
giving:
userID A B count
1: 1 2 2 1
2: 1 3 3 1
3: 3 2 1 2
4: 5 1 0 2
ADDED additional solutions. Also some small improvements.
Here's a fairly straightforward way (ave to the rescue!):
unique(cbind(df,
count = ave(rep(1, nrow(df)),
do.call(paste, df),
FUN = length)))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
Here's a variation of the above:
unique(within(df, {
counter <- rep(1, nrow(df))
count <- ave(counter, df, FUN = length)
rm(counter)
}))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
Make a quick factor of the tuples:
df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))
No external packages just taking advantage of summary() and storing it as a DF then merging the counts on the original data:
df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL
Almost final output, just has dupes:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 3 2 1 2
5 5 1 0 2
6 5 1 0 2
Lastly, clean up dupes:
df <- df[!duplicated(df), ]
Here you go:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
5 5 1 0 2
Been a while not doing that with sql or plyr. if you can use dplyr or a package later on do it. Bioconductor has a lot of great sequencing packages if it starts to get more complex.
Hope this helps.
This should do the trick, even if it is a little bit ugly:
vec <- table(apply(df,1,paste,collapse=""))
df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))
names(df2) <- names(df)
df2$count <- vec
# userID A B count
#1 1 2 2 1
#2 1 3 3 1
#3 3 2 1 2
#4 5 1 0 2

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Resources