I have a column with anomaly values and I want to weight it with a specific number representing the number of years (32).
How can I do this?
data(mtcars)
mtcars$weight<-apply(mtcars[5], 1, ??, 32)
mtcars$weighted <- mtcars$drat * 32
Or if your weights are different for each observation
mtcars$weighted <- mtcars$drat * mtcars$cyl
No need for apply, multiplication is already vectorized for your convenience ;)
Related
I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?
You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.
This question already has an answer here:
Obtaining connected components of neighboring values
(1 answer)
Closed 6 years ago.
I'm interested in identifying contiguous regions within a matrix (not necessarily square) of 0-1 (boolean) values, using R. I would like, given a matrix of 0-1 values, to identify each contiguous cluster (diagonals count, although an option of whether to count them or not would be ideal) and register the number of cells within that cluster.
Take the following example:
set.seed(14)
p <- matrix(0, ncol = 10, nrow = 10)
p[sample(1:100, 10)] <- 1
ones <- which(p == 1)
image(p)
I'd like to be able to identify (since I'm counting diagonals) four different groups, with (from top to bottom) 2, 1, 5, and 2 cells per cluster.
The raster package has an adjacent function which does a good job of locating adjacent cells, but I can't figure out how to do this.
One last constraint is that an ideal solution should be fast. I'd like to be able to use it within a data.table dt[, lapply(.SD, ...)] type situation with a large number of groups (each group being a data set from which I could create the matrix).
You definitely need connected component labeling algorithm
I have a simple question. I have a vector of years, spanning 1945:2000, with many repeated years. I want to make this an ordinal vector, so that 1945 is changed to 1, 1946 to 2, etc...
Obviously in this case the easiest way is just to subtract 1944 from the vector. But I have to do this with other numberic vectors that are not evenly spaced.
Is there an R function that does this?
You can do:
as.numeric(factor(x))
For example:
x <- sample(1945:2010, 40)
ordinal_x <- as.numeric(as.factor(x))
plot(x, ordinal_x)
Notice that ordinal_x skips the gaps in x.
Following some excellent replies to an earlier question I posed - selecting n random rows across all levels of a factor within a dataframe - I have been considering an extension to this problem.
The previous question sought to randomly sample n rows/observations from each level of a particular factor, and to combine all information in a new dataframe.
However, this sort of random sampling may not be optimal for some types of data. Here, I want to again select n rows/observations per every level of a particular factor. The major difference here is that the rows/observations selected from each level of the particular factor should be consecutive.
This is an example dataset:
id<-sample(1:20, 100, replace = TRUE)
dat<-as.data.frame(id)
color <- c("blue", "red", "yellow", "pink", "green", "orange", "white", "brown")
dat$colors<- sample(color, 100, replace = TRUE)
To add to this example dataset are timestamps for each observation. These will form the order along which I wish to sample. I am using a function suggested in this thread - efficiently generate a random sample of times and dates between two dates - for this purpose:
randomts <- function(N, st="2013/12/09", et="2013/12/14") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
dat$ts<-randomts(100)
I am not sure if this is necessary, but it is also possible to add a variable that gives the 'day'. This is the factor which I wish to sample from every level.
temp<-strsplit(as.character(dat$ts), " ")
mat<-matrix(unlist(temp), ncol=2, byrow=TRUE)
df<-as.data.frame(mat)
colnames(df)<-c("date", "time")
dat<-cbind(df, dat)
mindate<-as.Date(min(dat$date))
dates<-as.Date(dat$date)
x<-as.numeric(dates-mindate)
x<-x+1
dat$day<-x
as.factor(dat$day) #in this example data there are 6 levels to 'day'.
#EDIT there may be 5 levels to day - depends on how data randomly generated by function
Original post did not accurately calculate day. This is better though not perfect. Seems ok but first day is day=0, when would like it to be day=1
To summarize, the problem is this. I want to create a new dataframe that contains e.g. 5 consecutive observations randomly sampled from every level of the factor day of the dataframe "dat" (ie 5 random consecutive observations taken from every day). Therefore, the new dataframe would have 30 observations. An additional caveat would be that if I wanted to sample e.g. 20 consecutive observations, and a particular level only had 15 observations, then all 15 are returned and there is no replacement.
I have tried to play around with seq_along to solve this. I seem to be able to get this to work for one variable at a time - e.g. if sampling from colors:
x <- sample(seq_along(dat$colors),1)
dat$colors[x:(x+4)]
This produces a randomly sampled list of 5 consecutive colors from the variable colors.
I am having trouble applying this to the problem at hand. I have tried modifying some of the answers to my previous question selecting n random rows across all levels of a factor within a dataframe - but can't seem to work out the correct placement of seq_along in any.
This should sample runs of colors assuming your data.frame is sorted by date. Here N is how many of each color you want. The return value keep will be TRUE for the runs for each color group.
N <- 5
keep <- with(dat, ave(rep(T, nrow(dat)), colors, FUN=function(x) {
start <- sample.int(max(length(x)-N,1),1)
end <- min(length(x), start+N-1)
r <- rep(c(F,T,F), c(start-1, end-start+1, length(x)-end))
}))
dat[keep, ]
This method does not look at any day value. It simply find a random run of N observations. It will only return fewer per category if there are fewer than N observations for a particular group.
This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 1 year ago.
I have 28 groups of 48 rows in an R dataframe. I'm trying to take the standard deviation of each group. I used the following statement in R Studio:
stddev <- vector();
for (i in 1:28) { stddev[i] <- sd(in.subj[((i * 48) -47):(i * 48), 5]); }
When I check the values of stddev[] afterward, stddev[1] = NA. Likewise, when I check the standard deviations of individual groups, like sd(in.subj[49:96,5]) I get different values than the for loop printed out.
What would be the cause of these issues?
Thanks!
you can try :
tapply(in.subj[,5], gl(28,48), sd)
if there is some NAs in your data :
tapply(in.subj[,5], gl(28,48), sd, na.rm=T)