For loop on dates in R - r

R-users,
I have this dataframe:
head(M2006)
X.ID_punto MM.GG.AA Rad_SWD
2945377 1 0001-01-06 19.918
2945378 2 0001-01-06 19.911
2945379 1 0001-02-06 19.903
2945380 2 0001-02-06 19.893
2945381 1 0001-03-06 19.875
2945382 2 0001-03-06 19.858
What I need to do is to obtain different subsets for every dates (MM.GG.AA):
subset(M2006, M2006$MM.GG.AA=="0001-10-06" )
or, in other words, different subsets for every sites (X.ID_punto):
subset(M2006, M2006$X.ID_punto==1)
Is it possible to loop this on sites (X.ID_punto) or dates (MM.GG.AA)?
I have tried in this way:
output<- data.frame(ID=rep(1:365))
for (p in as.factor(M2006[,1])) {
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD
}
the code run, but without looping on every ID.
If I can't loop, I have to write down subset(M2006, M2006$X.ID_punto==xxx) for a thousand times...
Thank you in advance!
Fra

I think from your description of input and desired output you an acheive this pretty simply using the reshape package and the cast function:
require(reshape)
cast( M2006 , MM.GG.AA ~ X.ID_punto , value = .(Rad_SWD) )
# MM.GG.AA 1 2
#1 0001-01-06 19.918 19.911
#2 0001-02-06 19.903 19.893
#3 0001-03-06 19.875 19.858
It will certainly be quicker than using loops ( it isn't going to be the absolute quickest solution but I imagine < 1-2 seconds).

I've found a possible solution by myself.
I won't cancel my question, maybe someone will find it useful.
#first of all, since I have 1008 sites (X.ID_punto)
#I created a list of my sites
list<- rep(1:1008)
#then, create a dataframe where I'll store my subsets.
#Every subset will be a column of 365 observations
output<- data.frame(site1=rep(1:365))
#loop the subset function on list of 1008 sites
for (p in 1:length(list)) {
print(p) #just to see if loop run
sub<- subset(M2006, M2006$X.ID_punto==p )
output[,p] <- sub$Rad_SWD #add the subset, as a column, to output dataframe
}
write.csv(uscita, "output.csv")#save the resulted data frame

Related

How to speedup for and if loop in R

In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.
The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)

Calculating ratio of consecutive values in dataframe in r

I have a dataframe with 5 second intraday data of a stock. The dataframe exists of a column for the date, one for the time and one for the price at that moment.
I want to make a new column in which it calculates the ratio of two consecutive price values.
I tried it with a for loop, which works but is really slow.
data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
if(is.na(data$price[i])== TRUE){
data$ratio[i] <- 0
} else {
data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1)
}
}
I was wondering if there is a faster option, since my dataset contains more than 500.000 rows.
I was already trying something with ddply:
data["ratio"]<- 0
fun <- function(x){
data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)
and mutate:
data<- mutate(data, (ratio =((price/lag(price))-1)))
but both don't work and I don't know how to solve it...
Hopefully somebody can help me with this!
You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. This is vectorized, so you don't need a for loop, and it should be much faster. Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code.
# Create some fake data
set.seed(5) # For reproducibility
dat = data.frame(x=rnorm(10))
dat$ratio = dat$x/lag(dat$x,1)
dat
x ratio
1 -0.84085548 NA
2 1.38435934 -1.64637013
3 -1.25549186 -0.90691183
4 0.07014277 -0.05586875
5 1.71144087 24.39939227
6 -0.60290798 -0.35228093
7 -0.47216639 0.78314834
8 -0.63537131 1.34565131
9 -0.28577363 0.44977422
10 0.13810822 -0.48327840
for loop in R can be extremely slow. Try to avoid it if you can.
datalen=length(data$price)
data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]
You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA.

How to run a for-loop through a string vector of a data frame in R?

I'm trying to do something very simple: to run a loop through a vector of names and use those names in my code.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
My real dataset has 14 countries and 75 time periods. I would like to find a function which for example loops through the countries, then subsets them so I have the single datasets such as:
data_AT <- subset(Data, (Data$geo=="AT"))
data_BE <- subset(Data, (Data$geo=="BE"))
but with a loop and ideally with a solution I can apply to other functions as well :-)
In my mind, this should look something like this:
codes <- unique(Data$geo)
for (i in 1:length(codes))
{k <- codes[i]
data_(k) <- subset(Data, (Data$geo==k))}
however subset doesn't work like this, neither do other functions. I think my problem is that I don't know how to address the respective name which "k" has taken (e.g. "AT") as part of my code. If at all possible, I would very much appreciate an answer with a general solution of how I can run a function through a vector containing text and use each element of that vector in my code. Maybe in the direction of the apply functions? Though I'm not getting very far with that either...
Any help would be very much appreciated!
I'm using loops for simiral purposes too. Maybe it's not the fastest way, but at least I understand it -- for example, when saving plots for different subsets.
There is no need to loop through length of vector, you can loop through vector itself. For converting string to variable name, you can use assign.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
codes <- sort(unique(Data$geo))
for (k in codes) {
name<-paste("data", k, sep="_")
assign(name, subset(Data, (Data$geo==k)))
}
BTW, filter from package dplyr is much faster than subset!
In R, you would typically do this with a list of data.frames instead of several separate data.frames:
lst <- split(Data, Data$geo)
lst
#$AT
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
#
#$BE
# geo time value
#4 BE 1990Q1 4
#5 BE 1990Q2 5
#6 BE 1990Q3 6
Now you can access each element (which is a data.frame) by typing:
lst[["AT"]]
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
If you have a vector of country names for which you want to add +1 to the value column, you can do it like this:
cntrs <- c("BE", "AT")
lst[cntrs] <- lapply(lst[cntrs], function(x) {x$value <- x$value + 1; return(x)} )
#$BE
# geo time value
#4 BE 1990Q1 5
#5 BE 1990Q2 6
#6 BE 1990Q3 7
#
#$AT
# geo time value
#1 AT 1990Q1 2
#2 AT 1990Q2 3
#3 AT 1990Q3 4
Edit: if you really want to stick with a for loop, I recommend not to split the data into several separate data.frames but to run the loop on the whole data set like this for example:
cntrs <- "BE"
for(i in cntrs){
Data$value[Data$geo == i] <- Data$value[Data$geo == i] + 1
}

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources