Efficient way to count the change of values between 2 or more matrix or vectors - r

I am checking the change that occurs between different datasets, for now I am using a simple loop that gives me the counts for each change. The datasets are numeric(a sequence of numbers) and I count how many times each change occurs (1 changed to 5 XX times):
n=100
tmp1<-sample(1:25, n, replace=T)
tmp2<-sample(1:25, n, replace=T)
values_tmp1=sort(unique(tmp1))
values_tmp2=sort(unique(tmp2))
count=c()
i=1
for (m in 1:length(values_tmp1)){
for (j in 1:length(values_tmp2)){
count[i]=length(which(tmp1==values_tmp1[m] & tmp2==values_tmp2[j]))
i=i+1
}
}
However my data is much bigger with n = 2000000 , and the loop gets extremely slow.
Can anyone help me improve this calculation?

Like this?
tmp1 <- c(1:5,3)
tmp2 <- c(1,3,3,1,5,3)
aggregate(tmp1,list(tmp1,tmp2),length)
# Group.1 Group.2 x
# 1 1 1 1
# 2 4 1 1
# 3 2 3 1
# 4 3 3 2
# 5 5 5 1
This might be faster for a big dataset:
library(data.table)
DT <- data.table(cbind(tmp1,tmp2),key=c("tmp1","tmp2"))
DT[,.N,by=key(DT)]
# tmp1 tmp2 N
# 1: 1 1 1
# 2: 2 3 1
# 3: 3 3 2
# 4: 4 1 1
# 5: 5 5 1

Related

Slow population of a dataframe

I'm still very new to R and am noticing a very slow load time for the population of a dataframe
For my dataset I'm wanting to load the dataframe per row in the dataset, based on the value in column $population
It should end up with around 700,000 rows but after 10 minutes processing it's only loaded about 77,000 which appears really really slow
Code as per below
df <- data.frame(Ints=integer())
for(i in 1:nrow(popDemo)) {
row <- popDemo[i,]
# Use a while value to loop
j <- 1
while (j <= row$population) {
df[nrow(df) + 1,] <- row$age
j = j+1
}
}
Any guidance greatly appreciated
Thanks
Starting with a simple popDemo,
popDemo <- data.frame(population=c(3,5), age=c(1,10))
Your code produces
df <- data.frame(Ints=integer())
for (i in 1:nrow(popDemo)) {
row <- popDemo[i,]
# Use a while value to loop
j <- 1
while (j <= row$population) {
df[nrow(df) + 1,] <- row$age
j = j+1
}
}
df
# Ints
# 1 1
# 2 1
# 3 1
# 4 10
# 5 10
# 6 10
# 7 10
# 8 10
This can be done much faster in one step:
data.frame(Ints = rep(popDemo$age, times = popDemo$population))
# Ints
# 1 1
# 2 1
# 3 1
# 4 10
# 5 10
# 6 10
# 7 10
# 8 10
If by chance you have more columns, and you're hoping to just repeat them, an alternative implementation that is not just one column.
popDemo <- data.frame(population=c(3,5), age=c(1,10), ltr=c("a","b"))
popDemo[ rep(seq_len(nrow(popDemo)), times = popDemo$population), ]
# population age ltr
# 1 3 1 a
# 1.1 3 1 a
# 1.2 3 1 a
# 2 5 10 b
# 2.1 5 10 b
# 2.2 5 10 b
# 2.3 5 10 b
# 2.4 5 10 b

Find unique set of strings in vector where vector elements can be multiple strings

I have a series of batch records that are labeled sequentially. Sometimes batches overlap.
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
I want to find the set of batches that are not overlapping and label those periods. Batch "1/2" includes both "1" and "2" so it is not unique. When batch = "3" that is not contained in any previous batches, so it starts a new period. I'm having difficulty dealing with the combined batches, otherwise this would be straightforward. The result of this would be:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
My experience is in more functional programming paradigms, so I know the way I did this is very un-R. I'm looking for the way to do this in R that is clean and simple. Any help is appreciated.
Here's my un-R code that works, but is super clunky and not extensible.
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
R has functional paradigm influences, so you can solve this with Map and Reduce. Note that this solution follows your approach in unioning seen values. A simpler approach is possible if you assume batch numbers are consecutive, as they are in your example.
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
What this does is first calculate a cumulative union of seen values. Then, it maps across this to determine the places where none of the current values have been seen before. (Alternatively, this second step could be included within the reduce, but this would be wordier without support for destructuring.) The cumulative sum provides the "period" numbers based on the number of times the intersections have come up empty.
If you do make the assumption that the batch numbers are consecutive then you can do the following instead
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
For the same result (not repeated here).
A little bit shorter:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x
Here's a twist on the original that uses tidyr to split the data into two columns so it's easier to use:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
Now df looks like:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
Now the loop can be a lot simpler:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
The resulting df:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

How can I loop a data matrix in R?

I am trying to loop a data matrix for each separate ID tag, “1”, “2” and “3” (see my data at the bottom). Ultimately I am doing this to transform the X and Y coordinates into a timeseries with the ts() function, but first i need to build a loop into the function that returns a timeseries for each separate ID. The looping itself works perfectly fine when I use the following code for a dataframe:
for(i in 1:3){
print(na.omit(xyframe[ID==i,]))
}
Returning the following output:
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
Timestamp X Y ID
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
Timestamp X Y ID
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
Yet, when I want to produce a loop in a matrix with the same code:
for(i in 1:3){
print(na.omit(xymatrix[ID==i,])
}
It returns the following error:
Error in print(na.omit(xymatrix[ID == i, ]) :
(subscript) logical subscript too long
Why does it not work to loop the ID through a matrix while it does work for the dataframe and how would I be able to fix it?
Furthermore did I read that looping requires much more computational strength then doing the same thing vector based, would there be a way to do this vector based?
The data (simplification of the real data):
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
The format xymatrix[ID==i,] doesn't work for matrix. Try this way:
for(i in 1:3){ print(na.omit(xymatrix[xymatrix[,'ID'] == i,])) }
In general, if you want to apply a function to a data frame, split by some factor, then you should be using one of the apply family of functions in combination with split.
Here's some reproducible sample data.
n <- 20
some_data <- data.frame(
x = sample(c(1:5, NA), n, replace= TRUE),
y = sample(c(letters[1:5], NA), n, replace= TRUE),
id = gl(3, 1, length = n)
)
If you want to print out the rows with no missing values, split by each ID level, then you want something like this.
lapply(split(some_data, some_data$grp), na.omit)
or more concisely using the plyr package.
library(plyr)
dlply(some_data, .(grp), na.omit)
Both methods return output like this
# $`1`
# x y grp
# 1 2 d 1
# 4 3 e 1
# 7 3 c 1
# 10 4 a 1
# 13 2 e 1
# 16 3 a 1
# 19 1 d 1
# $`2`
# x y grp
# 2 1 e 2
# 5 3 e 2
# 8 3 b 2
# $`3`
# x y grp
# 6 3 c 3
# 9 5 a 3
# 12 2 c 3
# 15 2 d 3
# 18 4 a 3

How to assign a counter to a specific subset of a data.frame which is defined by a factor combination?

My question is: I have a data frame with some factor variables. I now want to assign a new vector to this data frame, which creates an index for each subset of those factor variables.
data <-data.frame(fac1=factor(rep(1:2,5)), fac2=sample(letters[1:3],10,rep=T))
Gives me something like:
fac1 fac2
1 1 a
2 2 c
3 1 b
4 2 a
5 1 c
6 2 b
7 1 a
8 2 a
9 1 b
10 2 c
And what I want is a combination counter which counts the occurrence of each factor combination. Like this
fac1 fac2 counter
1 1 a 1
2 2 c 1
3 1 b 1
4 2 a 1
5 1 c 1
6 2 b 1
7 1 a 2
8 2 a 2
9 1 b 2
10 1 a 3
So far I thought about using tapply to get the counter over all factor-combinations, which works fine
counter <-tapply(data$fac1, list(data$fac1,data$fac2), function(x) 1:length(x))
But I do not know how I can assign the counter list (e.g. unlisted) to the combinations in the data-frame without using inefficient looping :)
This is a job for the ave() function:
# Use set.seed for reproducible examples
# when random number generation is involved
set.seed(1)
myDF <- data.frame(fac1 = factor(rep(1:2, 7)),
fac2 = sample(letters[1:3], 14, replace = TRUE),
stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
# fac1 fac2 counter
# 1 1 a 1
# 2 2 b 1
# 3 1 b 1
# 4 2 c 1
# 5 1 a 2
# 6 2 c 2
# 7 1 c 1
# 8 2 b 2
# 9 1 b 2
# 10 2 a 1
# 11 1 a 3
# 12 2 a 2
# 13 1 c 2
# 14 2 b 3
Note the use of stringsAsFactors=FALSE in the data.frame() step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along).
A data.table solution
library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
This is a base R way that avoids (explicit) looping.
data$counter <- with(data, {
inter <- as.character(interaction(fac1, fac2))
names(inter) <- seq_along(inter)
inter.ordered <- inter[order(inter)]
counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
counter[match(names(inter), names(inter.ordered))]
})
Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):
x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]
No idea whether this is efficient over a large data.frame but it works!

Count and label observations per participant using loop

I have repeated-measures data.
I need to create a loop that will incrementally count each observation, within a participant, and label it.
I am new to writing loops. My logic was to say, for each item in the list of unique ids, count each row in that, and apply some function to that row.
Could someone point our what I am doing wrong?
data$Ob <- 0
for (i in unique(data$id)) {
count <- 1
for (u in data[data$id == i,]) {
data[data$id ==u,]$Ob <- count
count <- count + 1
print(count)
}
}
Thanks!
Justin
You can also use ave:
set.seed(1)
data <- data.frame(id = sample(4, 10, TRUE))
data$Ob = ave(data$id, data$id, FUN=seq_along)
data
id Ob
1 2 1
2 2 2
3 3 1
4 4 1
5 1 1
6 4 2
7 4 3
8 3 2
9 3 3
10 1 2
# Generate some dummy data
data <- data.frame(Ob=0, id=sample(4,20,TRUE))
# Go through every id value
for(i in unique(data$id)){
# Label observations
data$Ob[data$id == i] = 1:sum(data$id == i)
}
Be aware though that for loops are notoriously slow in R. In this simple case they work fine, but should you have millions and millions of rows in your data frame you'd better do something purely vectorized.
But you don't need a loop...
data <- data.frame (id = sample (4, 10, TRUE))
## id
## 1 3
## 2 4
## 3 1
## 4 3
## 5 3
## 6 4
## 7 2
## 8 1
## 9 1
## 10 4
data$Ob [order (data$id)] <- sequence (table (data$id))
## id Ob
## 1 3 1
## 2 4 1
## 3 1 1
## 4 3 2
## 5 3 3
## 6 4 2
## 7 2 1
## 8 1 2
## 9 1 3
## 10 4 3
(works also with character or factor IDs)
(isn't R just cool!?)

Resources