How to order data based on repeated dates in R - r

Suppose that the following dataset is available:
date V1 V2
[1,] 1996-01-01 1.4628995 12
[2,] 1996-01-01 0.1972603 11
..............
[3,] 1996-02-01 0.1205479 11
[4,] 1996-02-01 0.9643836 9
..............
[5,] 1996-03-01 0.1972603 14
[6,] 1996-03-01 0.1205479 8
How can i order V1 in ascending order, for example, for each specific date, with the remaining variables, V2,V3... and so on, to follow the ordering. Like this:
date V1 V2
[1,] 1996-01-01 0.1972603 11
[2,] 1996-01-01 1.4628995 12
..............
[3,] 1996-02-01 0.1205479 11
[4,] 1996-02-01 0.9643836 9
..............
[5,] 1996-03-01 0.1205479 8
[6,] 1996-03-01 0.1972603 14
Thank you.

To sort by date and then by V1...
data <- data[order(as.Date(data$date),data$V1),]
In response to follow-up question in comment below, the rows with the two smallest values of V1 for each date can easily be selected using dplyr...
library(dplyr)
data2 <- data %>% group_by(date) %>% filter(rank(V1,ties.method = "min")<3)
Or, rather less intuitively, using base-R...
data2 <- data[as.logical(ave(data$V1,data$date,FUN=function(v) rank(v,ties.method = "min")<3)),]
You might need to fiddle with the parameters of rank to adjust the treatment of NA and the way it handles ties. See ?rank

Related

How to extract the values from a raster in R

I want to use R to extract values from a raster. Basically, my raster has values from 0-6 and I want to extract for every single pixel the corresponding value. So that I have at the end a data table containing those two variables.
Thank you for your help, I hope my explanations are precisely enough.
Example data
library(raster)
r <- raster(ncol=5, nrow=5, vals=1:25)
To get all values, you can do
values(r)
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#as.matrix(r)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] 6 7 8 9 10
#[3,] 11 12 13 14 15
#[4,] 16 17 18 19 20
#[5,] 21 22 23 24 25
Also see ?getValues
You can also use indexing
r[2,2]
#7
r[7:8]
#[1] 7 8
For more complex extractions using points, lines or polygons, see ?extract
x is the raster object you are trying to extract values from; y is may be a SpatialPoints, SpatialPolygons,SpatialLines, Extent or a vector representing cell numbers (take a look at ?extract). Your code values_raster <- extract(x = values, df=TRUE) will not work because you're feeding the function with any y object/vector.
You could try to build a vector with all cell numbers of your raster. Imagine your raster have 200 cells. If your do values_raster <- extract(x = values,y=seq(1,200,1), df=TRUE) you'll get a dataframe with values for each cell.
How about simply doing
as.data.frame(s, xy=TRUE) # s is your raster file

Using rollapply() to find modal value

I've got panel data and have been playing around with k-means clustering. So now I've got a panel of factor values that are mostly stable but I'd like to smooth that out a bit more so that (for example) the data says "Wyoming was in group 1 in earlier years, moved into group 2, then moved into group 5" rather than "Wyoming was in group 1,1,1,2,3,2,2,5,5,5".
So the approach I'm taking is to use rollapply() to calculate the modal value. Below is code that works to calculate the mode ("Mode()"), and a wrapper for that ("ModeR()") that (perhaps clumsily) resolves the problem of multi-modal windows by randomly picking a mode. All that is fine, but when I put it into rollapply() I'm getting problems.
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq)) # from dplyr
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(temp[,1])
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[round(runif(1,min=0.5000001,max=length(out)+0.499999999))])
}
temp <- round(runif(20,min=1,max=10)) # A vector to test this out on.
cbind(temp,rollapply(data=temp,width=5,FUN=ModeR,fill=NA,align="right"))
which returned:
temp
[1,] 5 NA
[2,] 6 NA
[3,] 5 NA
[4,] 5 NA
[5,] 7 1
[6,] 6 1
[7,] 5 1
[8,] 5 1
[9,] 3 2
[10,] 1 3
[11,] 5 3
[12,] 7 3
[13,] 5 3
[14,] 4 3
[15,] 3 3
[16,] 4 2
[17,] 8 2
[18,] 5 2
[19,] 6 3
[20,] 6 3
Compare that with:
> ModeR(temp[1:5])
[1] 5
Levels: 5 6 7
> ModeR(temp[2:6])
[1] 6
Levels: 5 6 7
So it seems like the problem is in how ModeR is being applied in rollapply(). Any ideas?
Thanks!
Rick
Thanks to /u/murgs! His comment pointed me in the right direction (in addition to helping me streamline ModeR() using sample()).
ModeR() as written above returns a factor (as does Mode()). I need it to be a number. I can fix this by updating my code as follows:
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq))
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(as.numeric(as.character(temp[,1]))) #HERE'S THE BIG CHANGE
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[sample(1:length(out),1)]) #HERE'S SOME IMPROVED CODE!
}
Now rollapply() does what I expected it to do! There's still that weird as.character() bit (otherwise it rounds down the number). I'm not sure what's going on there, but the code works so I won't worry about it...

Reduce the large dataset into smaller data set using R

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.
I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1
x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.

How to bootstrap respecting within-subject information?

This is the first time I post to this forum, and I want to say from the start I am not a skilled programmer. So please let me know if the question or code were unclear!
I am trying to get the 95% confidence interval (CI) for an interaction (that is my test statistic) by doing bootstrapping. I am using the package "boot". My problem is that for every resample, I would like the randomization to be done within subjects, so that observations from different subjects are not mixed. Here is the code to generate a dataframe similar to mine. As you can see, I have two within-subjects factors ("Num" and "Gram" and I am interested in the interaction between both):
Subject = rep(c("S1","S2","S3","S4"),4)
Num = rep(c("singular","plural"),8)
Gram = rep(c("gram","gram","ungram","ungram"),4)
RT = c(657,775,678,895,887,235,645,916,930,768,890,1016,590,978,450,920)
data = data.frame(Subject,Num,Gram,RT)
This is the code I used to get the empirical interaction value:
summary(lm(RT ~ Num*Gram, data=data))
As you can see, the interaction between my two factors is -348. I want to get a bootstrap confidence interval for this statistic, which I can generate using the "boot" package:
# You need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
#Function to create the statistic to be boostrapped
boot.huber <- function(data, indices) {
data <- data[indices, ] #select obs. in bootstrap sample
mod <- lm(RT ~ Num*Gram, data=data)
coefficients(mod) #return coefficient vector
}
#Generate bootstrap estimate
data.boot <- boot(data, boot.huber, 1999)
#Get confidence interval
boot.ci(data.boot, index=4, type=c("norm", "perc", "bca"),conf=0.95) #4 gets the CI for the interaction
My problem is that I think the resamples should be generated without mixing the individual subjects observations: that is, to generate the new resamples, the observations from subject 1 (S1) should be shuffled within subject 1, not mixing them with the observations from subjects 2, etc... I don't know how "boot" is doing the resampling (I read the documentation but don't understand how the function is doing it)
Does anyone know how I could make sure that the resampling procedure used by "boot" respects subject level information?
Thanks a lot for your help/advice!
Just modify your call to boot() like this:
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
?boot provides this description of the strata= argument, which does exactly what you are asking for:
strata: An integer vector or factor specifying the strata for
multi-sample problems. This may be specified for any
simulation, but is ignored when ‘sim = "parametric"’. When
‘strata’ is supplied for a nonparametric bootstrap, the
simulations are done within the specified strata.
Additional note:
To confirm that it's working as you'd like, you can call debugonce(boot), run the call above, and step through the debugger until the object i (whose rows contain the indices used to resample rows of data to create each bootstrap resample) has been assigned, and then have a look at it.
debugonce(boot)
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
# Browse[2]>
## [Press return 34 times]
# Browse[2]> head(i)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,] 9 10 11 16 9 14 15 16 9 2 15 16 1 10
# [2,] 9 14 7 12 5 6 15 4 13 6 11 16 13 6
# [3,] 5 10 15 16 9 6 3 4 1 2 15 12 5 6
# [4,] 5 10 11 4 9 6 15 16 9 14 11 16 5 2
# [5,] 5 10 3 4 1 10 15 16 9 6 3 8 13 14
# [6,] 13 10 3 12 5 10 3 4 5 14 7 16 5 14
# [,15] [,16]
# [1,] 7 8
# [2,] 11 16
# [3,] 3 16
# [4,] 3 8
# [5,] 7 8
# [6,] 7 12
(You can enter Q to leave the debugger at any time.)

fill gaps in a timeseries with averages

I have a dataframe like so:
day sum_flux samples mean
2005-10-26 0.02 48 0.02
2005-10-27 0.12 12 0.50
It's a series of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.
i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set.
if all of October was missing I'd want to apply this average to each missing day.
I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.
Many Thanks
Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.
library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
Add a month column.
readings$month <- month(readings$day, label = TRUE)
Use tapply to get the monthly mean flux.
monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))
Use this value whenever the flux is missing, or keep the flux if not.
readings$sum_flux2 <- with(readings, ifelse(
is.na(sum_flux),
monthly_avg_flux[month],
sum_flux
))
This is one (very fast) way in data.table.
Using the nice example data from Richie :
require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 NA 42 0.37566491
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Create the average for each month, in appearance order of each group :
> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
mnth V1
[1,] 10 0.4915999
[2,] 11 0.5107873
[3,] 12 0.4451787
[4,] 1 0.4966040
[5,] 2 0.4972244
[6,] 3 0.4952821
[7,] 4 0.5106539
[8,] 5 0.4717122
[9,] 6 0.5110490
[10,] 7 0.4507383
[11,] 8 0.4680827
[12,] 9 0.5150618
Next reorder avg to start in January :
avg = avg[order(mnth)]
avg
mnth V1
[1,] 1 0.4966040
[2,] 2 0.4972244
[3,] 3 0.4952821
[4,] 4 0.5106539
[5,] 5 0.4717122
[6,] 6 0.5110490
[7,] 7 0.4507383
[8,] 8 0.4680827
[9,] 9 0.5150618
[10,] 10 0.4915999
[11,] 11 0.5107873
[12,] 12 0.4451787
Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.
readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 0.51078729** 42 0.37566491 # ** updated with the Nov avg
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Done.

Resources