I have a dataframe with 5 second intraday data of a stock. The dataframe exists of a column for the date, one for the time and one for the price at that moment.
I want to make a new column in which it calculates the ratio of two consecutive price values.
I tried it with a for loop, which works but is really slow.
data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
if(is.na(data$price[i])== TRUE){
data$ratio[i] <- 0
} else {
data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1)
}
}
I was wondering if there is a faster option, since my dataset contains more than 500.000 rows.
I was already trying something with ddply:
data["ratio"]<- 0
fun <- function(x){
data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)
and mutate:
data<- mutate(data, (ratio =((price/lag(price))-1)))
but both don't work and I don't know how to solve it...
Hopefully somebody can help me with this!
You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. This is vectorized, so you don't need a for loop, and it should be much faster. Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code.
# Create some fake data
set.seed(5) # For reproducibility
dat = data.frame(x=rnorm(10))
dat$ratio = dat$x/lag(dat$x,1)
dat
x ratio
1 -0.84085548 NA
2 1.38435934 -1.64637013
3 -1.25549186 -0.90691183
4 0.07014277 -0.05586875
5 1.71144087 24.39939227
6 -0.60290798 -0.35228093
7 -0.47216639 0.78314834
8 -0.63537131 1.34565131
9 -0.28577363 0.44977422
10 0.13810822 -0.48327840
for loop in R can be extremely slow. Try to avoid it if you can.
datalen=length(data$price)
data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]
You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA.
Related
In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.
The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)
This might be very simple, but I am not able to get how to fix this problem. Basically I need to calculate growth for multiple columns. So when I am dividing by a column, if it has 0 value it returns Inf
Let me take a example data set
a <- c(1,0,3,4,5)
b <- c(1,4,2,0,4)
c <- data.frame(a,b)
c$growth <- b/a-1
So if you see for the 2nd row since a is having 0 the growth is Inf. It should display 4
My original data is in data.table so any solution in data.table would help.
How can we fix this?
I don't know why you want to turn Inf to 4. In my opinion it doesn't make any sense as the growth is not 4 is Inf. However, if you still want to do that here's some code:
a <- c(1,0,3,4,5)
b <- c(1,4,2,0,4)
data <- data.frame(a,b)
data$growth <- b/a-1
data[data$growth == Inf,3] <- data[data$growth == Inf,2]
I want to apply a percentage calculation on certain rows (according to column criteria) of my data set. Normally I would do a (1) subset for this, (2) calculate the percentage, (3) delete the old (or previously subsetted rows) in my original data and (4) finally stack them together via rbind().
My question is there a better/faster/shorter way to do this calculation? Here some example data:
df <- data.frame(object = c("apples","tomatoes", "apples","pears" ),
Value = c(50,10,30,40))
The percentage calculation (50%) I would like to use for the subset on e.g. apples:
sub[,2] <- sub$Value * 50 /100
And the result should look like this:
object Value
1 apples 25
2 tomatoes 10
3 apples 15
4 pears 40
Thank you. Probably there is an easy way, but I didn't find online a solution so far.
Create a logical index for 'object' that are `apples' and do the calculation only the subset of 'Value' based on the 'index'.
i1 <- df$object=='apples'
df$Value[i1] <- df$Value[i1]*50/100
Or you can use ifelse
df$Value <- with(df, ifelse(object=='apples', Value*50/100, Value))
Or a more faster approach would be data.table
library(data.table)
setDT(df)[object=='apples', Value := Value*0.5]
I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)
R-users,
I have this dataframe:
head(M2006)
X.ID_punto MM.GG.AA Rad_SWD
2945377 1 0001-01-06 19.918
2945378 2 0001-01-06 19.911
2945379 1 0001-02-06 19.903
2945380 2 0001-02-06 19.893
2945381 1 0001-03-06 19.875
2945382 2 0001-03-06 19.858
What I need to do is to obtain different subsets for every dates (MM.GG.AA):
subset(M2006, M2006$MM.GG.AA=="0001-10-06" )
or, in other words, different subsets for every sites (X.ID_punto):
subset(M2006, M2006$X.ID_punto==1)
Is it possible to loop this on sites (X.ID_punto) or dates (MM.GG.AA)?
I have tried in this way:
output<- data.frame(ID=rep(1:365))
for (p in as.factor(M2006[,1])) {
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD
}
the code run, but without looping on every ID.
If I can't loop, I have to write down subset(M2006, M2006$X.ID_punto==xxx) for a thousand times...
Thank you in advance!
Fra
I think from your description of input and desired output you an acheive this pretty simply using the reshape package and the cast function:
require(reshape)
cast( M2006 , MM.GG.AA ~ X.ID_punto , value = .(Rad_SWD) )
# MM.GG.AA 1 2
#1 0001-01-06 19.918 19.911
#2 0001-02-06 19.903 19.893
#3 0001-03-06 19.875 19.858
It will certainly be quicker than using loops ( it isn't going to be the absolute quickest solution but I imagine < 1-2 seconds).
I've found a possible solution by myself.
I won't cancel my question, maybe someone will find it useful.
#first of all, since I have 1008 sites (X.ID_punto)
#I created a list of my sites
list<- rep(1:1008)
#then, create a dataframe where I'll store my subsets.
#Every subset will be a column of 365 observations
output<- data.frame(site1=rep(1:365))
#loop the subset function on list of 1008 sites
for (p in 1:length(list)) {
print(p) #just to see if loop run
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD #add the subset, as a column, to output dataframe
}
write.csv(uscita, "output.csv")#save the resulted data frame