I am trying to use a formula in current cell with reference to the cell above it in R. For example:
data$srno = data$srno[offset(-1,0)] + 1
Is there a way we can code this in R ?
What may be more convenient for you is to use a lag or shift function from different packages.
Here are some different ways of tackling the challenge:
myvector<-1:26
# base version
1+c(0,myvector[1:length(myvector)-1])
# returns an NA for 1st row
1+Hmisc::Lag(myvector)
1L + data.table::shift(myvector, fill=0)
The problem is the top cell has no cell above it. One approach is to use NA for that cell:
data$srno <- c(NA,data$srno[-length(data$srno)]+1);
Another approach is to consider the bottom cell to "wrap around", so that it can be used in the formula for calculating the new value for the top cell. Whether this makes sense depends on your data/formula, but here's how it could be done:
data$srno <- data$srno[c(length(data$srno),1:(length(data$srno)-1))]+1;
Related
I am trying to do some regression matching using the following code:
library(data.table)
colCount = 3
Test = data.table(ID=c(1,2,3),R1=c(1,1,1),R2=c(1,2,3),R3=c(4,2,1))
Compare = data.table(ID=c(1,2,3),R1=c(2,1,2),R2=c(6,2,3),R3=c(1,1,4))
# Example, in real run, I know colCount will give the arbitrary number of data columns in Test and Compare; I will not know their names
for (i in 1:nrow(Compare)){
Test[,paste0('Factor_',i):=sum(Compare[i,2:(1+colCount)]*.SD)/sum(.SD*.SD),.SDcols=2:(1+colCount),by=1:nrow(Test)]
Test[,paste0('Error_',i) :=sum((Compare[i,2:(1+colCount)]-.SD*paste0('Factor_',i))^2),.SDcols=2:(1+colCount),by=1:nrow(Test)]
}
The paste on the second to last line does not work as I was hoping; is there a better way to refer to columns with generated names?
Also more generally, is there a smarter way to do this? The sum-by-row method I'm doing seems way too complicated, but I couldn't get it working using matrix math or lapply instead
I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!
Or how to split a vector into pairs of contiguous members and combine them in a list?
Supose you are given the vector
map <-seq(from = 1, to = 20, by = 4)
which is
1 5 9 13 17
My goal is to create the following list
path <- list(c(1,5), c(5,9), c(9,13), c(13,17))
This is supposed to represent the several path segments that the map is sugesting us to follow. In order to go from 1 to 17, we must first take the first path (path[1]), then the second path (path[2]), and all the way to the end.
My first attempt lead me to:
path <- split(aux <- data.frame(S = map[-length(map)], E = map[-1]), row(aux))
But I think it would be possible without creating this auxiliar data frame
and avoiding the performance decrease when the initial vector (the map) is to big. Also, it returns a warning message which is quite alright, but I like to avoid them.
Then I found this here on stackoverflow (not exactly like this, this is the adapted version for my problem):
mod_map <- c(map, map[c(-1,-length(map))])
mod_map <- sort(mod_map)
split(mod_map, ceiling(seq_along(mod_map)/2))
which is a simpler solution, but I have to use this modified version of my map.
Pherhaps I'm asking too much as I already got two solutions. But, could it be possible to have a third one, so that I don't have so use data frames as in my first solution and can use the original map, unlike my second solution?
We can use Map on the vector ('map' - better not to use function names - it is a function from purrr) with 1st and last element removed and concatenate elementwise
Map(c, map[-length(map)], map[-1])
Or as #Sotos mentioned, split can be used which would be faster
split(cbind(map[-length(map)], map[-1]), seq(length(map)-1))
I am trying to normalize some columns on a data frame so they have the same mean. The solution I am now implementing, even though it works, feels like there is a simpler way of doing this.
# we make a copy of women
w = women
# print out the col Means
colMeans(women)
height weight
65.0000 136.7333
# create a vector of factors to normalize with
factor = colMeans(women)/colMeans(women)[1]
# normalize the copy of women that we previously made
for(i in 1:length(factor)){w[,i] <- w[,i] / factor[i]}
#We achieved our goal to have same means in the columns
colMeans(w)
height weight
65 65
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer?
By the way, what does women/factor actually doing? as doing:
colMeans(women/factor)
height weight
49.08646 98.40094
Is not the same result.
Can use mapply too
colMeans(mapply("/", w, factor))
Re your question re what does women/factor do, so women is a data.frame with two columns, while factor is numeric vector of length two. So when you do women/factor, R takes each entry of women (i.e. women[i,j]) and divides it once by factor[1] and then factor[2]. Because factor is shorter in length than women, R rolls factor over and over again.
You can see, for example, that every second entry of women[, 1]/factor equals to every second entry of women[, 1] (because factor[1] equals to 1)
One way of doing this is using sweep. By default this function subtracts a summary statistic from each row, but you can also specify a different function to perform. In this case a division:
colMeans(sweep(women, 2, factor, '/'))
Also:
rowMeans(t(women)/factor)
#height weight
#65 65
Regarding your question:
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer? By the way, what does women/factor actually doing?
women/factor ## is similar to
unlist(women)/rep(factor,nrow(women))
What you need is:
unlist(women)/rep(factor, each=nrow(women))
or
women/rep(factor, each=nrow(women))
In my solution, I didn't use rep because factor gets recycled as needed.
t(women) ##matrix
as.vector(t(women))/factor #will give same result as above
or just
t(women)/factor #preserve the dimensions for ?rowMeans
In short, column wise operations are happening here.
Let me preface this question by saying that I know very little about R. I'm importing a text file into R using read.table("file.txt", T). The text file is in the general format:
header1 header2
a 1
a 4
b 3
b 2
Each a is an observation from a sample and similarly each b is an observation from a different sample. I want to calculate various statistics of the sets of a and b which I'm doing with tapply(header2, header1, mean). That works fine.
Now I need to do some qqnorm plots of a and b and draw with qqline. I can use tapply(header2, header1, qqnorm) to make quantile plots of each BUT using tapply(header2, header1, qqline) draws both best fit lines on the last quantile plot. Programatically that makes sense but it doesn't help me.
So my question is, how can convert the data frame to two vectors (one for all a and one for all b)? Does that make sense? Basically, in the above example, I'd want to end up with two vectors: a=(1,4) and b=(3,2).
Thanks!
Create a function that does both. You won't be able (easily at least) to revert to an old graphics device.
e.g.
with(dd, tapply(header2,header1, function(x) {qqnorm(x); qqline(x)}))
You could use data.table here for coding elegance (and speed)
You can pass the equivalent of a body of a function that is evaluated within the scope of the data.table e.g.
library(data.table)
DT <- data.table(dd)
DT[, {qqnorm(x)
qqline(x)}, by=header1]
You don't really want to pollute your global environments with lots of objects (that will be inefficient).