Slow population of a dataframe - r

I'm still very new to R and am noticing a very slow load time for the population of a dataframe
For my dataset I'm wanting to load the dataframe per row in the dataset, based on the value in column $population
It should end up with around 700,000 rows but after 10 minutes processing it's only loaded about 77,000 which appears really really slow
Code as per below
df <- data.frame(Ints=integer())
for(i in 1:nrow(popDemo)) {
row <- popDemo[i,]
# Use a while value to loop
j <- 1
while (j <= row$population) {
df[nrow(df) + 1,] <- row$age
j = j+1
}
}
Any guidance greatly appreciated
Thanks

Starting with a simple popDemo,
popDemo <- data.frame(population=c(3,5), age=c(1,10))
Your code produces
df <- data.frame(Ints=integer())
for (i in 1:nrow(popDemo)) {
row <- popDemo[i,]
# Use a while value to loop
j <- 1
while (j <= row$population) {
df[nrow(df) + 1,] <- row$age
j = j+1
}
}
df
# Ints
# 1 1
# 2 1
# 3 1
# 4 10
# 5 10
# 6 10
# 7 10
# 8 10
This can be done much faster in one step:
data.frame(Ints = rep(popDemo$age, times = popDemo$population))
# Ints
# 1 1
# 2 1
# 3 1
# 4 10
# 5 10
# 6 10
# 7 10
# 8 10
If by chance you have more columns, and you're hoping to just repeat them, an alternative implementation that is not just one column.
popDemo <- data.frame(population=c(3,5), age=c(1,10), ltr=c("a","b"))
popDemo[ rep(seq_len(nrow(popDemo)), times = popDemo$population), ]
# population age ltr
# 1 3 1 a
# 1.1 3 1 a
# 1.2 3 1 a
# 2 5 10 b
# 2.1 5 10 b
# 2.2 5 10 b
# 2.3 5 10 b
# 2.4 5 10 b

Related

How to skip iteration in for loop if condition is met

I have code to turn the upper triangle of a matrix into a vector and store the values from this vector along with their original coordinates from the matrix into a data frame.
How do I skip the for loop if the element in the vector is zero?
I have tried else statements and other attempts.
v <- matrix(sample(0:1, 10, replace = TRUE),9,9)
t <- v[upper.tri(v,diag=T)]
tful <- t[t!=0]
df <- data.frame(FP1=rep(0,length(t)),FP2=rep(0,length(t)),tanimoto=rep(0,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
I dont want any zeros in my data frame, and the loop to skip these values.
I understand the data frame needs to be smaller in rows but i am using this as an example.
Your next is working fine to skip the current iteration of the loop.
You still get 0s in the final result because all values of df were initialized df to 0. When you skip the iteration, they are not changed, so they remain 0. If you change the initialization to be NA values, you'll see that no 0s are added.
df <- data.frame(FP1=rep(NA,length(t)),FP2=rep(NA,length(t)),tanimoto=rep(NA,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
df
# FP1 FP2 tanimoto
# 1 1 1 1
# 2 1 2 1
# 3 2 2 1
# 4 1 3 1
# 5 2 3 1
# 6 3 3 1
# 7 NA NA NA
# 8 2 4 1
# 9 3 4 1
# 10 4 4 1
# 11 NA NA NA
# ...
A simple modification would be to filter your data frame as a last step: df = df[df$tanimoto != 0, ], or if you switch to NA, df = na.omit(df).
We could also create a non-looping solution:
v1 = v != 0
df2 = data.frame(FP1 = row(v)[v1], FP2 = col(v)[v1], tanimoto = v[v1])
df2 = subset(df2, FP1 <= FP2)
df2
# FP1 FP2 tanimoto
# 1 1 1 1
# 7 1 2 1
# 8 2 2 1
# 13 1 3 1
# 14 2 3 1
# 15 3 3 1
# 20 2 4 1
# 21 3 4 1
# 22 4 4 1
# 27 3 5 1
# 28 4 5 1
# 29 5 5 1
# 33 1 6 1
# 34 4 6 1
# 35 5 6 1
# ...

Find unique set of strings in vector where vector elements can be multiple strings

I have a series of batch records that are labeled sequentially. Sometimes batches overlap.
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
I want to find the set of batches that are not overlapping and label those periods. Batch "1/2" includes both "1" and "2" so it is not unique. When batch = "3" that is not contained in any previous batches, so it starts a new period. I'm having difficulty dealing with the combined batches, otherwise this would be straightforward. The result of this would be:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
My experience is in more functional programming paradigms, so I know the way I did this is very un-R. I'm looking for the way to do this in R that is clean and simple. Any help is appreciated.
Here's my un-R code that works, but is super clunky and not extensible.
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
R has functional paradigm influences, so you can solve this with Map and Reduce. Note that this solution follows your approach in unioning seen values. A simpler approach is possible if you assume batch numbers are consecutive, as they are in your example.
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
What this does is first calculate a cumulative union of seen values. Then, it maps across this to determine the places where none of the current values have been seen before. (Alternatively, this second step could be included within the reduce, but this would be wordier without support for destructuring.) The cumulative sum provides the "period" numbers based on the number of times the intersections have come up empty.
If you do make the assumption that the batch numbers are consecutive then you can do the following instead
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
For the same result (not repeated here).
A little bit shorter:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x
Here's a twist on the original that uses tidyr to split the data into two columns so it's easier to use:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
Now df looks like:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
Now the loop can be a lot simpler:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
The resulting df:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

multiplication in matrix R

Look, what I want to do: [In Excel is clear and easy, but in R I have a problem...:(]
Column A 1 2 3 4 5
Column B 0 9 2 1 7
That's my real "algorithm":
Column C
(first value) = mean(Column A) = 3
(second value) = ((mean(Column A)*4) + 0)/5 = 2,4
(third value) = ((second value*4) + 9)/5 = 3,72
etc.
So we have:
# A B C
# 1 1 0 3
# 2 2 9 2,4
# 3 3 2 3,72
# 4 4 1 3,37
# 5 5 7 2,90
This is my actually code with your suggestion:
a <- c(1:5)
b <- c(0,9,0,1,7,0)
matrix <- data.frame(A=a,B=b)
matrix <- c(mean(matrix$A), (cumsum(matrix$B) + (mean(matrix$A)*4))/5)
This is solution: 2.4 4.2 4.2 4.4 5.8 (WRONG !!)
Of course R write me error that: "replacement has 6 rows, data has 5" but this isn't relevant...I only want to know, how should I do it??
You could use ?cumsum:
a <- 1:5
b <- c(0, 9, 2, 1, 7)
mean(a) + cumsum(b)
# [1] 3 12 14 15 22
UPDATE:
It seems you want to run a (weighted) moving average. Maybe you should have a look at the TTR package.
Please find an easy approach below:
wma <- function(b, startValue, a=4/5) {
m <- double(length(b)+1)
m[1] <- startValue
for (i in seq(along=b)) {
m[i+1] <- a * m[i] + (1-a) * b[i]
}
return(m)
}
wma(b, mean(a))
# [1] 3.00000 2.40000 3.72000 3.37600 2.90080 3.72064
This solves your issue:
mydf<-data.frame(A=1:5, B=c(0,9,2,1,7))
mydf$C<-cumsum(mydf$B)+mean(mydf$A)
mydf
# A B C
# 1 1 0 3
# 2 2 9 12
# 3 3 2 14
# 4 4 1 15
# 5 5 7 22
Hope it helps.

Efficient way to count the change of values between 2 or more matrix or vectors

I am checking the change that occurs between different datasets, for now I am using a simple loop that gives me the counts for each change. The datasets are numeric(a sequence of numbers) and I count how many times each change occurs (1 changed to 5 XX times):
n=100
tmp1<-sample(1:25, n, replace=T)
tmp2<-sample(1:25, n, replace=T)
values_tmp1=sort(unique(tmp1))
values_tmp2=sort(unique(tmp2))
count=c()
i=1
for (m in 1:length(values_tmp1)){
for (j in 1:length(values_tmp2)){
count[i]=length(which(tmp1==values_tmp1[m] & tmp2==values_tmp2[j]))
i=i+1
}
}
However my data is much bigger with n = 2000000 , and the loop gets extremely slow.
Can anyone help me improve this calculation?
Like this?
tmp1 <- c(1:5,3)
tmp2 <- c(1,3,3,1,5,3)
aggregate(tmp1,list(tmp1,tmp2),length)
# Group.1 Group.2 x
# 1 1 1 1
# 2 4 1 1
# 3 2 3 1
# 4 3 3 2
# 5 5 5 1
This might be faster for a big dataset:
library(data.table)
DT <- data.table(cbind(tmp1,tmp2),key=c("tmp1","tmp2"))
DT[,.N,by=key(DT)]
# tmp1 tmp2 N
# 1: 1 1 1
# 2: 2 3 1
# 3: 3 3 2
# 4: 4 1 1
# 5: 5 5 1

Count and label observations per participant using loop

I have repeated-measures data.
I need to create a loop that will incrementally count each observation, within a participant, and label it.
I am new to writing loops. My logic was to say, for each item in the list of unique ids, count each row in that, and apply some function to that row.
Could someone point our what I am doing wrong?
data$Ob <- 0
for (i in unique(data$id)) {
count <- 1
for (u in data[data$id == i,]) {
data[data$id ==u,]$Ob <- count
count <- count + 1
print(count)
}
}
Thanks!
Justin
You can also use ave:
set.seed(1)
data <- data.frame(id = sample(4, 10, TRUE))
data$Ob = ave(data$id, data$id, FUN=seq_along)
data
id Ob
1 2 1
2 2 2
3 3 1
4 4 1
5 1 1
6 4 2
7 4 3
8 3 2
9 3 3
10 1 2
# Generate some dummy data
data <- data.frame(Ob=0, id=sample(4,20,TRUE))
# Go through every id value
for(i in unique(data$id)){
# Label observations
data$Ob[data$id == i] = 1:sum(data$id == i)
}
Be aware though that for loops are notoriously slow in R. In this simple case they work fine, but should you have millions and millions of rows in your data frame you'd better do something purely vectorized.
But you don't need a loop...
data <- data.frame (id = sample (4, 10, TRUE))
## id
## 1 3
## 2 4
## 3 1
## 4 3
## 5 3
## 6 4
## 7 2
## 8 1
## 9 1
## 10 4
data$Ob [order (data$id)] <- sequence (table (data$id))
## id Ob
## 1 3 1
## 2 4 1
## 3 1 1
## 4 3 2
## 5 3 3
## 6 4 2
## 7 2 1
## 8 1 2
## 9 1 3
## 10 4 3
(works also with character or factor IDs)
(isn't R just cool!?)

Resources