I have a dataframe x
and I need to calculate the number of steps from the 1st column by days or by certain 5-min intervals.
This code for dates works fine
b<-summarise(group_by(x,date),h = sum(steps))
But when I change date on interval,
b<-summarise(group_by(x,interval),h = sum(steps))
it returns only NA values
Related
I have a dataset (call it df) that has several columns. One of those columns is the column date, which has strings of the form "d-MON-yy" or "dd-MON-yy" depending on if the day number is less than 10 (e.g. 9-Jan-04, 15-Oct-98) or NA.
I am trying to change this to date type values, but I only need the year. Specifically, all the dates whose yy digits are less than 20 are from this century, and all the dates whose yy digits are greater than or equal to 20are from the 1900s. I want to have the four numbers of the year in the end.
Since I am only interested in the year, I don't mind a solution that returns numeric values.
In the end, I'd like to also filter out the rows that have NA on *the date variable only.
I am pretty new to R, and I have tried to make it work with several answers I found here to no avail.
Thank you.
I am trying to get moving average in this format
Where NA stays NA otherwise show average of three periods but for the first and second period assume the missing values to be extension of existing value.
I am trying the rollmean and rollapply functions with varying inputs but not the results I want.
tempo[,toto:= rollmean(original,3,align="left", fill="extend")]
tempo[,toto1:= rollapply(original,3,mean,align="left", na.pad=FALSE)]
tempo<-data.table(original = c(NA,NA,NA,10,0,0,0,10,10,10,0,NA,NA),
desired = c(NA,NA,NA,10,5,3.3,0,3.3,6.6,10,6.6,NA,NA))
I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))
I started with a daily time series of wind speeds. I wanted to examine of the mean and maximum number of consecutive days under a certain threshold change between two periods of time. This is how far I've come: I subsetted the data to rows with values beneath the threshold and identified consecutive days.
I now have a data frame that looks like this:
dates consecutive_days
1970-03-25 NA
1970-04-09 TRUE
1970-04-10 TRUE
1970-04-11 TRUE
1970-04-12 TRUE
1970-04-15 FALSE
1970-05-08 TRUE
1970-05-09 TRUE
1970-05-13 FALSE
What I want to do next is to find the maximum and mean length of the consecutive "TRUE"-arguments. (which in this case would be: maximum=4; mean=3).
Here is one method using rle:
# construct sample data.frame:
set.seed(1234)
df <- data.frame(days=1:12, consec=sample(c(TRUE, FALSE), 12, replace=T))
# get rle object
consec <- rle(df$consec)
# max consecutive values
max(consec$lengths[consec$values==TRUE])
# mean consecutive values
mean(consec$lengths[consec$values==TRUE])
Quoting from ?rle, rle
Compute[s] the lengths and values of runs of equal values in a vector
We save the results and then subset to consecutive TRUE observations to calculate the mean and max.
You could easily combine this into a function, or simply concatenate the results above:
myResults <- c("max"=max(consec$lengths[consec$values==TRUE]),
"mean"= mean(consec$lengths[consec$values==TRUE]))
I have dataframe (return.monthly) of the form
Date Return
2001-09-1 0.0404775
2001-10-1 -0.01771575
2001-11-1 -0.03304925
etc.
i.e. monthly returns over a period of time (2 years). I would like to calculate quarterly returns, i.e just take 3 observations and calculate the sum.
I tried
return.quarterly <- xts(return.monthly[, -1], return.monthly[,1])
function <- function(x) sum(x)
time <- 3
return.quarterly$return_q <- rollapply(return.quarterly$Return, FUN=function,
width=time, align="left", na.pad=T)
Obviously this formula calculates returns over a rolling window, i.e. it takes observation 1-3 and calculates the sum, then 2-4 and calculates the sum, etc. What I want is however 1-3, 4-6, 7-9...
How could I do that?
Thanks in advance, Dani
You can use apply.quarterly from xts to compute the mean over a quarter:
apply.quarterly(return.quarterly,mean) #Jan-Feb-Mar first quarter etc.
BTW: shouldn't you consider instead of the mean the sum, for quarterly returns?