Error from rollingmedian using columns - r

I have a dataframe for which I try to add additional column calculating the median of the current and the previous 2 values.
Date Value
21/07/2016 14.8
22/07/2016 14.9
23/07/2016 15.8
24/07/2016 15.0
25/07/2016 15.7
26/07/2016 15.6
27/07/2016 16.1
28/07/2016 16.1
I used the following code:
library(zoo)
dataframe$medianval <-rollmedian(dataframe$Value,k=3)
I get the following error
> Error: k <= n is not TRUE
Any suggestions?

Think about what R is trying to do here. The data frame has 8 rows, but the vector you want to append has only 6 elements. To which rows should those elements align? What should R put in the other two spots?
library(zoo)
dataframe <- read.table(text="Date Value
21/07/2016 14.8
22/07/2016 14.9
23/07/2016 15.8
24/07/2016 15.0
25/07/2016 15.7
26/07/2016 15.6
27/07/2016 16.1
28/07/2016 16.1", header=TRUE)
rollmedian(dataframe$Value,k=3)
# [1] 14.9 15.0 15.7 15.6 15.7 16.1
nrow(dataframe) # [1] 8
length(rollmedian(dataframe$Value,k=3)) # [1] 6
Because I can guess what you meant (correct me if I'm wrong), I would try:
dataframe$medianval <- c(NA, NA, rollmedian(dataframe$Value,k=3))
dataframe
# Date Value medianval
# 1 21/07/2016 14.8 NA
# 2 22/07/2016 14.9 NA
# 3 23/07/2016 15.8 14.9
# 4 24/07/2016 15.0 15.0
# 5 25/07/2016 15.7 15.7
# 6 26/07/2016 15.6 15.6
# 7 27/07/2016 16.1 15.7
# 8 28/07/2016 16.1 16.1
If you want to be able to adapt this conveniently, you should write a simple function:
med.fun <- function(var, data, k){
# Note: variable name must be in quotes
return(c(rep(NA, k-1), with(data, rollmedian(get(var), k=k))))
}
med.fun("Value", dataframe, 5)
# [1] NA NA NA NA 15.0 15.6 15.7 15.7

Related

Matrices of difent size multiplication

Good evenning
In Rstudio
I have a problem multiplying these two matrices of a different size, and it becomes worse because I have to multiply in such a way that the values in the row d2$ID=1 have to multiply only the repetitions of w$sample=1.
sample and ID are indicating is the same sample
In other words, from the "subset" d2$ID=1, every single value ("L1", "ST", "GR", "CB", "HSK", "DDM") has to multiply the whole "subset" w$sample=1 (4 rows in this case, but not always), yes, all the values "G2", "G4", "G6", "G8", "G12"
>d2
ID L1 ST GR CB HSK DDM
1 1 0.1662000 0.2337000 0.3637000 0.11110000 0.10100000 0.024300000
2 2 0.1896576 0.2280830 0.3705740 0.09406879 0.09319434 0.024422281
3 3 0.1110259 0.2217769 0.4180797 0.11122498 0.10902635 0.028866094
4 4 0.1558785 0.2008862 0.4222565 0.09805538 0.10218119 0.020742172
5 5 0.1536421 0.1674096 0.4205395 0.14362176 0.08635519 0.028431849
6 6 0.1841964 0.1514189 0.4603306 0.10243621 0.08928011 0.012337688
> w
sample G2 G4 G6 G8 G12
1 1 10.9 15.9 21.4 28.0 37.8
2 1 11.5 16.6 22.2 29.5 38.3
3 1 10.3 15.1 20.7 28.3 36.7
4 1 11.7 18.1 24.8 31.2 39.5
5 2 11.0 16.8 22.4 30.6 38.0
6 2 10.1 15.9 22.5 30.2 36.7
7 2 12.8 17.8 22.8 28.7 37.1
8 2 11.8 16.3 20.8 27.3 34.7
9 2 11.9 16.7 21.6 28.3 34.6
10 3 12.0 18.1 24.2 30.9 40.0
11 3 12.2 17.7 24.2 31.7 40.5
12 4 11.1 16.5 22.7 31.0 39.2
13 4 12.5 19.8 27.4 32.8 38.8
14 4 12.4 19.2 25.8 33.0 39.9
15 4 12.4 19.2 26.2 33.4 38.9
16 4 13.4 18.3 23.7 30.0 38.2
17 5 13.3 18.6 24.0 30.7 38.4
18 5 13.3 18.1 22.9 30.1 36.8
19 5 13.7 19.9 26.5 33.8 43.0
20 5 12.7 18.2 24.6 32.5 41.3
21 6 12.1 17.5 24.3 33.7 42.2
22 6 14.5 20.8 28.4 35.3 43.7
I have check already a lot of questions but I can't figure it out, specially because most of the information is for matrices of the same size.
I tried by filtering the data from d2, but the data set is really big, then is really inefficient.
I am a beginner, if you consider is so easy I would appreciate at least a hint, please!
I have several data sets like these ones...
Thanks in advance!
This seems to perform as requested:
res <- apply(w, 1, function(x){ unclass(
outer(as.matrix( x[-1] ),
as.matrix( d2[1, c( "L1", "ST", "GR", "CB", "HSK", "DDM")])))})
str(res)
# result
# num [1:30, 1:22] 1.81 2.64 3.56 4.65 6.28 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:22] "1" "2" "3" "4" ...
I almost got it right on the first pass but after some debugging found that I needed to add the as.matrix call to both arguments inside outer (so to speak ;-). To explain my logic ... I wanted to run down each row of w with apply and then use match on the value of the first column (of each row of w) to the unique row of d2. The match function is designed for just this purpose, to return a suitable number to be used for indexing. Then with the rest of the row (x[-1] by the time it was passed through the function call), I would use outer on the row values crossed with the desired row and columns of d2. If you do it without the as.matrix calls you get an error message:
Error in tcrossprod(x, y) :
requires numeric/complex matrix/vector arguments
I don't think that's a very informative error message. Both of the arguments were numeric vectors.

Efficiently find groups of rows in a data frame with almost identical values

I have a data set of about a thousand rows, but it will grow.
I want to find groups of rows where a certain number of attributes are almost identical.
I can do a stupid brute-force search, but it's really slow and I'm sure R can do this much better.
Test data:
df = read.table(text="
Date Time F1 F2 F3 F4 F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)
#Initialize column to hold clustering info
df$cluster = NA
#Maximum tolerance for a match between rows
toler=1.0
#Brute force search, very ugly.
for(i in 1:(nrow(df)-1)){
if(is.na(df$cluster[i])){
df$cluster[i] <- i
for(j in (i+1):nrow(df)){
if(max(abs(df[i,3:7] - df[j,3:7]))<toler){
df$cluster[j]<-i
}
}
}
}
if(is.na(df$cluster[j])){df$cluster[j] <- j}
df
Expected output:
Date Time F1 F2 F3 F4 F5 Conc cluster
1 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1 1
2 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4 2
3 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0 1
4 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8 4
Here is an option using data.table:
df[, (cols) := c(.SD-1, .SD+1), .SDcols=fcols]
onstr <- c(paste0(fcols,">",cols[rng]),paste0(fcols,"<",cols[5+rng]))
df[, cluster := df[df, on=onstr, by=.EACHI, min(x.rn)]$V1]
output:
Date Time F1 F2 F3 F4 F5 Conc rn lower1 lower2 lower3 lower4 lower5 upper1 upper2 upper3 upper4 upper5 cluster
1: 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1 1 -0.9 16.6 13.2 18.5 17.6 1.1 18.6 15.2 20.5 19.6 1
2: 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4 2 -1.0 27.7 12.7 14.6 15.2 1.0 29.7 14.7 16.6 17.2 2
3: 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0 3 -1.0 16.0 13.0 18.0 17.0 1.0 18.0 15.0 20.0 19.0 1
4: 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8 4 -1.0 24.5 11.6 13.9 15.6 1.0 26.5 13.6 15.9 17.6 4
data:
df = read.table(text="
Date Time F1 F2 F3 F4 F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)
library(data.table)
setDT(df)[, rn := .I]
toler <- 1.0
rng <- 1L:5L
fcols <- paste0("F", rng)
cols <- do.call(paste0, CJ(c("lower", "upper"), rng))
Explanation:
L behind an integer is telling R this is of type integer (see Why would R use the "L" suffix to denote an integer?)
df[df, on=onstr, by=.EACHI, min(x.rn)] is performing a non-equi self-join using the inequalities specified in onstr.
$V1 access the V1 column of the above join (naming is defaulted to V* when not provided)
df[, cluster := my_result] updates the original data.table by reference so that its faster when dataset is large. Improvement is due to the fact that there is no need to do a deep copy of the original data.frame as in base R.
since we are performing a join, there is a left table and right table. In data.table lingo, for x[i, on=join_keys], x is right table and i is left table (inspiration comes from base R syntax). Hence, x. is used to refer to access columns in the left table like a SQL syntax and similarly for i.. More details can be found in the data.table vignettes. (see https://cran.r-project.org/web/packages/data.table/)

How does use a conditional(if or while) for replacing values in R?

I am trying to work with this a data that has information about price electricity. The price is registered each 5 minutes. My objective is replace the negative values with the mean of the day.
year month day fivemin rrp_nsw rrp_qld rrp_sa rrp_tas rrp_vic
2009 7 1 1 16.9 17.6 16.7 15.7 15.5
2009 7 1 2 17.7 18.8 17.8 -16.1 15.5
2009 7 1 3 -17.7 18.6 18.1 15.9 15.4
2009 7 1 4 16.7 18.6 -17.6 14.3 12.8
2009 7 2 1 -15.6 17.6 16.3 13.2 11.8
2009 7 2 2 13.7 15.7 12.0 -11.1 -12.9
2009 7 2 3 13.7 15.8 11.9 11.1 12.9
2009 7 2 4 -13.9 16.1 -12.1 11.2 12.9
2009 8 1 1 13.8 16.0 12.2 11.2 12.8
2009 8 1 2 -13.7 16.3 11.6 10.6 12.6
2009 8 1 3 13.7 -15.8 11.9 11.0 12.7
2009 8 1 4 13.8 16.0 12.1 11.2 12.9
2009 8 2 1 17.6 -17.6 17.3 16.5 17.1
2009 8 2 2 17.7 17.6 17.3 16.8 17.4
2009 8 2 3 15.8 16.0 15.1 15.0 15.5
2009 8 2 4 -15.4 15.6 14.5 14.6 15.1
2009 9 1 1 14.7 15.0 13.8 14.0 14.5
2009 9 1 2 15.3 15.4 14.3 14.6 15.0
2009 9 1 3 15.3 15.6 14.4 14.5 15.0
2009 9 1 4 14.9 15.7 13.7 13.8 14.5
In order to obtain the mean of each day I use the following code
Daily_mean<-Base %>%
arrange(year, month, day, fivemin) %>% #we are ordering the data
group_by(year, month, day)%>%
summarise_at(
vars(c(rrp_nsw, rrp_qld, rrp_sa, rrp_tas, rrp_vic)),
.funs = funs(mean(.)))
When I get the daily mean I want to replace each negative value with the mean of the day. For example using the 16th observation
2009 8 2 4 "8.925" 15.6 14.5 14.6 15.1
If someone can help me i would be grateful
We can use a replace in mutate_at to change the negative values to mean of that column after grouping by the relevant columns
library(dplyr)
Base %>%
arrange(year, month, day, fivemin) %>%
group_by(year, month, day) %>%
mutate_at(vars(rrp_nsw, rrp_qld, rrp_sa, rrp_tas, rrp_vic),
~ replace(., . < 0, mean(.)))

Computation of the average of n observations in a data set using R

Please, how do I compute the average, that is, mean of the last 5 observations by class in a data: the first column is the class i.e., Plot and the second column is the measured variable i.e., Weight.
Plot Weight
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
We select the last 5 observation for each 'Plot and get the mean
library(dplyr)
df1 %>%
group_by(Plot) %>%
summarise(MeanWt = mean(tail(Weight, 5)))
Or with data.table
library(data.table)
setDT(df1)[, .(MeanWt = mean(tail(Weight, 5))), by = Plot]
Or using base R
aggregate(cbind(MeanWt = Weight) ~ Plot, FUN = function(x) mean(tail(x, 5)))
I made this without a library:
It's a step-by-step solution, of course you can make the code shorter using a for or apply.
Hope you find it useful.
#Collecting your data
values <- scan()
1 12.5 1 14.5 1 15.8 1 16.1 1 18.9 1 21.2 1 23.4 1 25.7 2 13.1 2 15.0 2 15.8
2 16.3 2 17.4 2 18.6 2 22.6 2 24.1 2 25.6 3 11.5 3 12.2 3 13.9 3 14.7 3 18.9
3 20.5 3 21.6 3 22.6 3 24.1 3 25.8
data_w <- matrix(values, ncol=2, byrow = T)
#Naming your cols
colnames(data_w) <- c("Plot", "Weight")
dt_w <- as.data.frame(data_w)
#Mean of the 5 last observations by class:
#Computing number of Plots = 1
size1 <- length(which(dt_w$Plot == 1))
#Value to compute the last 5 values
index1 <- size1 - 5
#Way to compute the mean
mean1 <- mean(dt_w$Weight[index1:size1])
#mean of the last 5 observations of class 1
mean1
To compute for the class 2 and 3 it's the same process.

Draw histograms per row over multiple columns in R

I'm using R for the analysis of my master thesis
I have the following data frame: STOF: Student to staff ratio
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
Then I rank colleges using this commend
HEIrank1<-(STOF[,-c(1)])
rank1 <- apply(HEIrank1,2,rank)
> HEIrank11
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 18.0 20 20.0 20.0 20.0 20
2 MO 14.0 9 13.0 13.5 2.0 12
3 SD 15.0 16 17.0 16.0 16.0 19
4 UN 20.0 18 8.0 13.5 14.0 13
5 WS 12.0 8 4.0 7.0 6.0 8
6 BF 6.5 17 9.5 15.0 10.0 14
7 ME 17.0 15 19.0 19.0 17.0 15
8 IM 2.0 6 12.0 8.0 8.5 10
9 OM 4.5 3 2.5 3.0 3.0 11
10 DC 11.0 14 11.0 9.0 15.0 9
11 OC 16.0 19 16.0 18.0 19.0 17
I would like to draw histogram for each HEIs (for each row)?
If you use ggplot you won't need to do it as a loop, you can plot them all at once. Also, you need to reformat your data so that it's in long format not short format. You can use the melt function from the reshape package to do so.
library(reshape2)
new.df<-melt(HEIrank11,id.vars="HEI.ID")
names(new.df)=c("HEI.ID","Year","Rank")
substring is just getting rid of the X in each year
library(ggplot2)
ggplot(new.df, aes(x=HEI.ID,y=Rank,fill=substring(Year,2)))+
geom_histogram(stat="identity",position="dodge")
Here's a solution in lattice:
require(lattice)
barchart(X2007+X2008+X2009+X2010+X2011+X2012 ~ HEI.ID,
data=HEIrank11,
auto.key=list(space='right')
)

Resources