I am looking for a better way to compare a value from a day (day X) to the previous day (day X-1). Here I am using the airquality dataset. Suppose I am interested in comparing the wind from one day to the wind from the previous day. Right now I am using merge() to bring together two dataframes - one current day dataframe and one from the previous day. I am also just subtracting 1 from the Day column to get the PrevDay column:
airquality$PrevDay=airquality$Day-1
airquality.comp <- merge(
airquality[,c("Wind","Day")],
airquality[,c("Temp","PrevDay")],
by.x=c("Day"),by.y=c("PrevDay"))
My issue here is that I'd need to create another dataframe if I wanted to look back 2 days or if I wanted to switch Wind and Temp and look at them the other way. This just seems clunky. Can anyone recommend a better way of doing this?
IMO data.table may be harder to get used to compared to dplyr, but it will save your tail later when you need robust analysis:
setDT(airquality)[, shift(Wind, n=2L, type="lag") < Wind]
In base R, you can add an NA value and eliminate the last for comparison:
with(airquality, c(NA,head(Wind,-1)) < Wind)
Whar kind of comparison do you need?
For example, to check if the followonf values is greater you could use:
library(dplyr)
with(airquality, lag(Wind) < Wind)
Or with two lags:
with(airquality, lag(Wind, 2) < Wind)
It depends on what questions you are trying to answer, but I would look into Autocorrelation (the correlation of a time series with its own lagged values). You may want to look into the acf() function to compare the time series to itself since this will help you highlight which lags are significantly correlated.
Or if you want to compare 2 different metrics (such as Wind and Temp), then you may want to try the ccf() function since it allows you to input 2 different vectors and it will compute the cross correlation with lags. For example:
ccf(airquality$Wind,airquality$Temp)
If you are interested in autocorrelation or cross-correlation, in particular, then you might also consider something like mutual information, which will work for non-Gaussian data as well. Both the infotheo and entropy (more here) packages for R have built-in functions to do so.
Related
I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662
Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.
Hi so I am new in R and kind of don't know what I'm looking for. I want to measure probability of each frequency of a dust concentration so I need to divide each frequency to whole total of dust concentration frequency. By then I can continue by looking for CDF and PMF of the dust concentration.
So I have a dust probability data that has two column(Dust Concentration and its Frequencies) and it looks like this:
In my first thought, I have to increment i on this line of R queries
dustProb[i, "Frekuensi"]
that should've take specific frequency in row i so I can sum all frequency queried from it after getting that with for loops like this.
# the dataset is called dustData here
# dustFrequencies = dustData[i, "Frekuensi"]
for(i in dustFrequencies){
print(dustFrequencies)
}
The print() part supposed to be where I sum all the variables earned through that incremented queries.
My question is:
Can I increment the 'i' inside that R queries
Was my way is too complicated or there's other way to measure probability in R?
Sorry for lots of confusion, inneficiency, and holes, I hope I was clear enough here.
Using loops in R isn't very tidy-freindly. You can do:
library(dplyr)
dustData <- dustData %>%
mutate(probabilities = Frekuensi/sum(Frekuensi))
The new column is the frenquency divided by the sum of all frequencies, for each dust concentration.
I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)
I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]
I am a beginner using R, and I am wanting to create a dataframe that stores a range of dates to their respective classified time period.
paleo.periods <- c("Paleoindian","Early Paleoindian", "Middle Paleoindian", "Late Paleoindian", "Archaic","Early Archaic", "Middle Archaic","Late Archaic","Woodland","Early Woodland","Middle Woodland","Late Woodland","Late Prehistoric")
paleo.dates <- c(c(13500,8000), c(13500,10050) ,c(10050,9015), c(9015,8000), c(8000,2500), c(8000,5500), c(5500,3500), c(3500,2500), c(2500,1150), c(2500,2000), c(2000,1500), c(1500,1150), c(1150,500))
I would like for the arrangement to come out where I can refer to a given time period, ex: "Late Woodland", and get the associated vector of it's beginning and end timeframes, ex: (1500,1150)
I tried simply doing this by
paleo.seg <- data.frame(paleo.periods,paleo.dates)
however, this creates 3 variables: a list of the periods, a list of the vectors, and paleo.dates. I am not sure why it is creating 3 variables, as I'd like it to be only 2: paleo.periods and paleo.dates. I would also like to refer to them as paleo.seg$paleo.periods which will return the list of periods (and later use this to somehow refer to the periods individually), same with the dates.
Essentially I would like my dataframe to look a bit like this:
paleoperiods paleodates
"Late Woodland" 1500,1100
Therefore I could look specifically for the string "Late Woodland" and find the vector dates. I tried doing this on my current data.frame, and
"Woodland" %in% paleo.seg returns false. So I feel like I am misunderstanding how to build a proper dataframe, as well as being able to match one categorical variable to two dates.
There are a few ways that you could go about this depending on your reasoning about what you want to do with your dataframe. My recommendation would actually be to split the dates column into two separate date columns(start and end I believe, from your description). This way you could calculate or use rules based on the dates. I've found this useful when looking at data, as it gives you the ability to filter based on two different aspects of the date. If you would like them to be in the same column, you could make the dates a character in order to have them in the same column. However, this approach does have drawbacks in terms of using it for exploratory data analysis. An example of this would be:
paleo.dates <- c("13500,8000","13500,10050","10050,9015","9015,8000", ...)
This would allow you to look up Late Woodland and get "1500,1100", but you wouldn't be able to search for periods occurring after 1500 if that type of analysis is something you would be doing at a later point.