How to speedup for and if loop in R - r

In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.

The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)

Related

Minimising number of computations in fuzzy matching and a for loop

I am currently trying to find some potential duplicates in a large data set (500,000+ lines) using fuzzy matching. There are three main parts to this code:
A function that I have written that identifies the most like potential duplicate in a data set (by returning a score - it selects the highest score).
A function that identifies the position of the record that is the most likely to be a duplicate.
A for loop that runs both of the functions above for every record and returns values in the DupScore column and the positionBestMatch column.
An example of a resulting dataset is below:
Name: DOB: DupScore positionbestMatch
Ben 6/3/1994 15 3
Abe 5/5/2005 11 5
Benjamin 6/3/1994 15 1
Gabby 01/01/1900 10 6
Abraham 5/5/2005 11 2
Gabriella 01/01/1900 10 4
The for loop to calculate these scores looks a bit like this (scorefunc and position func are self
written functions):
for (i in c(1:length(df$Name))) {
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
}
Obviously, on a data set with so many rows, this loop is time consuming and computationally intensive as it loops through each row. How can I edit my for loop so that :
When a DupScore is calculated for a row, it will also insert the score not only in the [i] row, but also the row of positionbestMatch ?
And have the loop only run for those with empty DupScore and positionBestMatch values.
I hope this makes sense!
Try using a while loop
all_inds <- seq_len(nrow(df))
i <- all_inds[1]
while (length(all_inds) > 1) {
i <- all_inds[1]
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
df$dupScore[df$positionBestMatch[i]] <- df$dupScore[i]
all_inds <- setdiff(all_inds, c(i, df$positionBestMatch[i]))
}
But this will keep some empty values for df$positionBestMatch.

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

how to do sumproduct in R by dplyr package

I just want to achieve a thing on R. Here is the explanation,
I have data sets which contains same value, please find the below data sets,
A B
1122513454 0
1122513460 0
1600041729 0
2100002632 147905
2840007103 0
2840064133 138142
3190300079 138040
3190301011 138120
3680024411 0
4000000263 4000000263
4100002263 4100002268
4880004352 138159
4880015611 138159
4900007044 0
7084781116 142967
7124925306 0
7225002523 7225001325
23012600000 0
80880593057 0
98880000045 0
I have two columns (A & B). In the b column, I have the same value (138159,138159). It appears two times.
I just want to make a calculation, where it will get the same value it will count as 1. That means, I am getting two 138159, but that will be treated as 1. and finally it will count the whole b column value except 0. That means, 0 is here 10 times and the other value is also 10 times, but 138519 appears 2 times, so it will be counted as 1, so other values are 9 times and finally it will give me only other value's count i.e 9.
So my expected output will be 9
I have already done this in excel. But, want to achieve the same in R. Is there any way to do it in R by dplyr package?
I have written following formula in excel,
=+SUMPRODUCT((I2:I14<>0)/COUNTIFS(I2:I14,I2:I14))
how can I count only other value's record without 0?
Can you guys help me with that?
any suggestion is really appreciable.
Edit 1: I have done this by following way,
abc <- hardy[hardy$couponid !=0,]
undertaker <- abc %>%
group_by(TYC) %>%
summarise(count_couponid= n_distinct(couponid))
any smart way to do that?
Thanks

Calculating ratio of consecutive values in dataframe in r

I have a dataframe with 5 second intraday data of a stock. The dataframe exists of a column for the date, one for the time and one for the price at that moment.
I want to make a new column in which it calculates the ratio of two consecutive price values.
I tried it with a for loop, which works but is really slow.
data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
if(is.na(data$price[i])== TRUE){
data$ratio[i] <- 0
} else {
data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1)
}
}
I was wondering if there is a faster option, since my dataset contains more than 500.000 rows.
I was already trying something with ddply:
data["ratio"]<- 0
fun <- function(x){
data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)
and mutate:
data<- mutate(data, (ratio =((price/lag(price))-1)))
but both don't work and I don't know how to solve it...
Hopefully somebody can help me with this!
You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. This is vectorized, so you don't need a for loop, and it should be much faster. Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code.
# Create some fake data
set.seed(5) # For reproducibility
dat = data.frame(x=rnorm(10))
dat$ratio = dat$x/lag(dat$x,1)
dat
x ratio
1 -0.84085548 NA
2 1.38435934 -1.64637013
3 -1.25549186 -0.90691183
4 0.07014277 -0.05586875
5 1.71144087 24.39939227
6 -0.60290798 -0.35228093
7 -0.47216639 0.78314834
8 -0.63537131 1.34565131
9 -0.28577363 0.44977422
10 0.13810822 -0.48327840
for loop in R can be extremely slow. Try to avoid it if you can.
datalen=length(data$price)
data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]
You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA.

Resources