removing specific rows from a data frame within a loop in r - r

How can I save the changes occurred in a data frame within a loop, after each iteration? By the following codes, I want to delete some rows from my data frame (df) which have equal value to 'v'. The codes works well, but the problem is finally, only the result of the last i value in the iterations affects the data frame!
for (i in 1:50){
v <- i+ 450
temp<- fn$sqldf("select count(V1) from df where V1='$v' ")
if (temp[1,] < 1000){
g <- temp[1,]
c <- v
print(paste("Topic number: ", c, "is less than 1000, with ", g, "records") )
new_df<- df[df$V1 != v,]
}
}

A more idiomatic R way would be:
reduced <- subset(df, V1 > 450 & V1 <= 500)
count <- table(reduced$V1)
V1OK <- as.integer(names(count)[count<1000])
filtered <- subset(reduced, V1 %in% V1OK)
If you'd rather continue with an sql-centric perspective, your problem appears to be that in the creation of file3, you're generating this from new each iteration (I have to guess what file_new is). You could set up a flag for each row prior to the loop:
V1OK <- rep(FALSE, nrow(DF))
and update it within the loop with
V1OK <- V1OK | df$V1 !=v
and after the loop you could access
file_new[V1OK,]

Related

How do I save a single column of data produced from a while loop in R to a dataframe?

I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.

Replacing values in high-frequency data

I've got high frequency data about durations. I've found out that I've got some faulty entries that I cannot discard that have 1800*random number added to them. Now I was stupid enough to try:
for(i in 1:21863924) {while(rr[i]>=1800){rr[i]=rr[i]-1800}}
Which obviously didn't work even though I left it overnight. I was wondering if there is a more elegant way for this,since subsetting the dataset to exclude the faulty entries works in matter of seconds ?
It can be done in a vectorized way. Create a logical vector
i1 <- rr >= 1800
Use that vector to replace the values while assigning the values to the original vector
rr[i1] <- rr[i1] - 1800
A recursive function would be
f1 <- function(x, val) {
i1 <- x >= val
x[i1] <- x[i1] - val
if(sum(x >= val) > 0) f1(x, val)
}
out <- f1(rr, val = 1800)
sum(out >= 1800)
#[1] 0
data
set.seed(24)
rr <- sample(20000, 100)

Trouble speeding up algorithm

I have made an algorithm in R to combine multiple sensor readings together under one timestamp.
Most sensor readings are taken every 500ms but some sensors only report changes. Therefor I had to make an algorithm that takes the last known value of a sensor at a given time.
Now the algorithm works, however it is so slow that when i would start using it for the actual 20+ sensors it would take ages to complete. My hypothesis is that it is slow because of my use of dataframes or the way I access and move my data.
I have tried making it faster by only walking trough every dataframe once and not iterating over them for every timestamp. I have also preallocated all space needed for the data.
Any suggestions would be very welcome. I am very new to the R language so I don't really know what datatypes are slow and which are fast.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
if (typeof(dataframes) == "list"){
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data vectors for the dataframe
result.timestamps <- numeric(result.size)
result <- list(result.timestamps)
for (i in 2:(length(dataframes)+1)) result[[i]] <- numeric(result.size) # add an empty vector for every datapoint
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[[1]][i] <- current_timestamp
for (i2 in 1:length(dataframes)){
while (dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] < current_timestamp && dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] != max(dataframes[[i2]]$TIMESTAMP)){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[[i2+1]][i] <- dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
return(result.final)
} else {
return(NA)
}
}
I fixed it today by changing every dataframe to a matrix. The code ran in 9.5 seconds instead of 70 minutes.
Conclusion: dataframes are VERY bad for performance.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
library(compiler)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
time.start <- Sys.time()
if (typeof(dataframes) == "list"){
# store the sizes of all the dataframes
resources.largest_size <- 0
resources.sizes <- numeric(length(dataframes))
for (i in 1:length(dataframes)){
resources.sizes[i] <- length(dataframes[[i]]$VALUE)
if (resources.sizes[i] > resources.largest_size) resources.largest_size <- resources.sizes[i]
}
# generate a matrix that can hold all needed dataframe values
resources <- matrix(nrow = resources.largest_size, ncol = length(dataframes)*2)
for (i in 1:length(dataframes)){
j <- i*2
resources[1:resources.sizes[i],j-1] <- dataframes[[i]]$TIMESTAMP
resources[1:resources.sizes[i],j] <- dataframes[[i]]$VALUE
}
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data matrix for the result
result <- matrix(data = 0, nrow = result.size, ncol = length(dataframes)+1)
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[i,1] <- current_timestamp
for (i2 in 1:length(dataframes)){
j <- i2*2
while (resources[dataframe_indexes[i2],j-1] < current_timestamp && resources[dataframe_indexes[i2],j-1] != resources.sizes[i2]){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
# at the moment the last value of the array is never selected, needs to be fixed
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[i,i2+1] <- resources[dataframe_indexes[i2], j] #dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
time.end <- Sys.time()
print(time.end-time.start)
return(result.final)
} else {
return(NA)
}
}

Using R to loop through vector and copy some sequences to data.frame

I want to search through a vector for the sequence of strings "hello" "world". When I find this sequence, I want to copy it, including the 10 elements before and after, as a row in a data.frame to which I'll apply further analysis.
My problem: I get an error "new column would leave holes after existing columns". I'm new to coding, so I'm not sure how to manipulate data.frames. Maybe I need to create rows in the loop?
This is what I have:
df = data.frame()
i <- 1
for(n in 1:length(v))
{
if(v[n] == 'hello' & v[n+1] == 'world')
{
df[i,n-11:n+11] <- v[n-10:n+11]
i <- i+1
}
}
Thanks!
May be this helps
indx <- which(v1[-length(v1)]=='hello'& v1[-1]=='world')
lst <- Map(function(x,y) {s1 <- seq(x,y)
v1[s1[s1>0 & s1 < length(v1)]]}, indx-10, indx+11)
len <- max(sapply(lst, length))
d1 <- as.data.frame(do.call(rbind,lapply(lst, `length<-`, len)))
data
set.seed(496)
v1 <- sample(c(letters[1:3], 'hello', 'world'), 100, replace=TRUE)

Efficiency of transforming counts to percentages and index scores

I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code

Resources