Creating new columns using a for loop - r

I am new to R (Economist with background in Stata) and I am having trouble getting a nested for loop to work for me. I know the issue is that I don't have a good understanding of how to use the loop counter as part of a variable name.
A bit of background. I have data frame with data on average rental rates for homes of different size (1 bedroom, 2 bedroom, etc) and data on annual earnings (mean, median, and various percentiles). I am trying to generate a series of new columns containing the ratio of these two things (rental rate / mean earnings).
Specifically my variables are:
beds1, beds2, beds3, beds4
mean, median, p10, p25, p75, p90
So you see I need to generate 24 new columns of cost/earnings data. I could write out 24 lines of code but I don't want to. More importantly, I want to learn an efficient way of doing this in R. In Stata I could do this very simply using a nested for loop, but I can't get it to work in R. Here is my code so far.
for (i in 1:4) {
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for (x in stat) {
df$beds[i]_[x] <- round((df$beds[i]/df$[x]),digits=3)
}
}
When I run this code the error I get is
Error: unexpected input in:
" for (x in stat) {
df$beds[i]_"
> }
Error: unexpected '}' in " }"
> }
Error: unexpected '}' in "}"
I have tried to use the double brackets [[]] but that didn't change the results. If anyone has some insight into why the dynamic variables names aren't working please let me know. Even better, since I guess loops are evil in R, if anyone knows a way to use lapply to get this done, I would love to hear that too.
EDIT
Thanks #Spacedman for the comment. I think I am getting what you're saying. So does that mean that there simply isn't anyway to do what I want to do in R?
var1 <- c("beds1", "beds2")
var2 <- c("mean", "median")
for (i in 1:2) {
for (j in 1:2) {
df$var1[i]_var2[j] <- df$var1[i]/df$var2[j]
}
}
I think this should grab the elements of the lists var1 and var2 so that when i=1 and j=1, df$var1[i]/df$var2[j] should mean df$beds1/df$mean. Or would R get mad and think I was trying to divide strings?
FINAL EDIT WITH ANSWER FROM #SPACEEMAN
Thanks #Spacedman. I loved your spoiler and thank you for providing additional help. I didn't fully grasp the difference between the two ways of referring to columns after your last post, but I think I have a better idea now. I did a bit of tweaking and now I have something that works perfectly. Thanks again!
beds <- c("beds1", "beds2", "beds3", "beds4")
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for(i in beds){
for(x in stat){
res = paste0(i,"_",x)
df[[res]]=round(df[[i]]/df[[x]],digits=3)
}
}

R is not a macro expansion language like other languages you might be used to.
x[i], if i=123, does not "expand" into x123. It gets the value of the 123rd element of the vector, x.
So df$beds[i] tries to get the i'th element of a vector df$beds.
You need to know two things:
How to construct strings from other strings.
For this you can use paste0:
> for(i in 1:4){
+ print(paste0("beds",i))
+ }
[1] "beds1"
[1] "beds2"
[1] "beds3"
[1] "beds4"
How to access columns by names.
For this you can use double square brackets. In a list:
> z = list()
> n = "thing"
Double squabs evaluate their index and use that. So:
> z[[n]] = 99
Will set z$thing, but dollar sign indexing is literal, so:
> z$n = 123
will set z$n:
> z
$thing
[1] 99
$n
[1] 123
hopefully that's enough hints to get you through. It should all be covered in basic R tutorials online.
Spoiler
If you want to work out how to do it yourself, look away now...
First, lets create a sample data frame - you should include something like this in your question so we have common test data to work on. I'll just have three beds and two stats:
> df = data.frame(
beds1=c(1,2,3),
beds2=c(5,2,3),
beds3=c(6,6,6),
mean=c(8,4,3),
median=c(1,7,4))
> df
beds1 beds2 beds3 mean median
1 1 5 6 8 1
2 2 2 6 4 7
3 3 3 6 3 4
Now the work. We loop over the bed number and the character stats. The bed column name is stored in bed by pasting "beds" to the number i. We compute the name of the result column (res) for a given bed number and stat by pasting "beds" to i and "_" and the name of the stat in x.
Then set the new resulting column to the value by dividing the beds number by the stat. We use [[z]] to get the columns by name:
> for(i in 1:3){
stats=c("mean","median")
for(x in stats){
bed = paste0("beds",i)
res = paste0("beds",i,"_",x)
df[[res]]=round(df[[bed]]/df[[x]],digits=3)
}
}
Resulting in....
> df
beds1 beds2 beds3 mean median beds1_mean beds1_median beds2_mean beds2_median
1 1 5 6 8 1 0.125 1.000 0.625 5.000
2 2 2 6 4 7 0.500 0.286 0.500 0.286
3 3 3 6 3 4 1.000 0.750 1.000 0.750
beds3_mean beds3_median
1 0.75 6.000
2 1.50 0.857
3 2.00 1.500
>

Related

Why isn't the mean() function in R giving me the right result?

I've been practicing basics in R (3.6.3) and I'm stuck trying to understand this problem for hours already. This was the exercise:
Step 1: Generate sequence of data between 1 and 3 of total length 100; #use the jitter function (with a large factor) to add noise to your data
Step 2: Compute the vector of rolling averages roll.mean with the average of 5 consecutive points. This vector has only 96 averages.
Step 3: add the vector of these averages to your plot
Step 4: generalize step 2 and step 3 by making a function with parameters consec (default=5) and y.
y88 = seq(1,3,0.02)
y = jitter(y88, 120, set.seed(1))
y = y[-99] # removed one guy so y can have 100 elements, as asked
roll.meanT = rep(0,96)
for (i in 1:length(roll.meanT)) # my 'reference i' is roll.mean[i], not y[i]
{
roll.meanT[i] = (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5
}
plot(y)
lines(roll.meanT, col=3, lwd=2)
This produced this plot:
Then, I proceed to generalize using a function (it asks me to generalize steps 2 and 3, so the data creation step was ignored) and I consider y to remain constant):
fun50 = function(consec=5,y)
{
roll.mean <- rep(NA,96) # Apparently, we just leave NA's as NA's, since lenght(y) is always greater than lenght(roll.means)
for (i in 1:96)
{
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
}
plot(y)
lines(roll.mean, col=3, lwd=2)
}
Which gave me a completely different plot:
When I manually try too see if mean(y[1:5]) produces the right mean, it does. I know I could have already used the mean() function in the first part, but I would really like to get the same results using (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5 or mean(y[1:5],......).
You have the line
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
I believe your intention is to grab the values with indices i to (i+consec-1). Unfortunately for you - the : operator takes precedence over arithmetic operations.
> 1:1+5-1 #(this is what your code would do for i=1, consec=5)
[1] 5
> (1:1)+5-1 # this is what it's actually doing for you
> 5
> 2:2+5-1 #(this is what your code would do for i=2, consec=5)
[1] 6
> 3:3+5-1 #(this is what your code would do for i=3, consec=5)
[1] 7
> 3:(3+5-1) #(this is what you want your code to do for i=3, consec=5)
[1] 3 4 5 6 7
so to fix - just add some parenthesis
roll.mean[i] <- mean(y[i:(i+consec-1)]) # Using mean(), I'm able to generalize.

How to calculate a pooled standard deviation in R?

I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra
What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.
The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106

R: a for statement wanted that allows for the use of values from each row

I'm pretty new to R..
I'm reading in a file that looks like this:
1 2 1
1 4 2
1 6 4
and storing it in a matrix:
matrix <- read.delim("filename",...)
Does anyone know how to make a for statement that adds up the first and last numbers of one row per iteration ?
So the output would be:
2
3
5
Many thanks!
Edit: My bad, I should have made this more clear...
I'm actually more interested in an actual for-loop where I can use multiple values from any column on that specific row in each iteration. The adding up numbers was just an example. I'm actually planning on doing much more with those values (for more than 2 columns), and there are many rows.
So something in the lines of:
for (i in matrix_i) #where i means each row
{
#do something with column j and column x from row i, for example add them up
}
If you want to get a vector out of this, it is simpler (and marginally computationally faster) to use apply rather than a for statement. In this case,
sums = apply(m, 1, function(x) x[1] + x[3])
Also, you shouldn't call your variables "matrix" since that is the name of a built in function.
ETA: There is an even easier and computationally faster way. R lets you pull out columns and add them together (since they are vectors, they will get added elementwise):
sums = m[, 1] + m[, 3]
m[, 1] means the first column of the data.
Something along these lines should work rather efficiently (i.e. this is a vectorised approach):
m <- matrix(c(1,1,1,2,4,6,1,2,4), 3, 3)
# [,1] [,2] [,3]
# [1,] 1 2 1
# [2,] 1 4 2
# [3,] 1 6 4
v <- m[,1] + m[,3]
# [1] 2 3 5
You probably can use an apply function or a vectorized approach --- and if you can you really should, but you ask for how to do it in a for loop, so here's how to do that. (Let's call your matrix m.)
results <- numeric(nrow(m))
for (row in nrow(m)) {
results[row] <- m[row, 1] + m[row, 3]
}
This is probably one of those 100 ways to skin a cat questions. You are perhaps looking for the rowSums function, although you might also find many answers using the apply function.

Summary in R for frequency tables?

I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?

Vectorize my thinking: Vector Operations in R

So earlier I answered my own question on thinking in vectors in R. But now I have another problem which I can't 'vectorize.' I know vectors are faster and loops slower, but I can't figure out how to do this in a vector method:
I have a data frame (which for sentimental reasons I like to call my.data) which I want to do a full marginal analysis on. I need to remove certain elements one at a time and 'value' the data frame then I need to do the iterating again by removing only the next element. Then do again... and again... The idea is to do a full marginal analysis on a subset of my data. Anyhow, I can't conceive of how to do this in a vector efficient way.
I've shortened the looping part of the code down and it looks something like this:
for (j in my.data$item[my.data$fixed==0]) { # <-- selects the items I want to loop
# through
my.data.it <- my.data[my.data$item!= j,] # <-- this kicks item j out of the list
sum.data <-aggregate(my.data.it, by=list(year), FUN=sum, na.rm=TRUE) #<-- do an
# aggregation
do(a.little.dance) && make(a.little.love) -> get.down(tonight) # <-- a little
# song and dance
delta <- (get.love) # <-- get some love
delta.list<-append(delta.list, delta, after=length(delta.list)) #<-- put my love
# in a vector
}
So obviously I hacked out a bunch of stuff in the middle, just to make it less clumsy. The goal would be to remove the j loop using something more vector efficient. Any ideas?
Here's what seems like another very R-type way to generate the sums. Generate a vector that is as long as your input vector, containing nothing but the repeated sum of n elements. Then, subtract your original vector from the sums vector. The result: a vector (isums) where each entry is your original vector less the ith element.
> (my.data$item[my.data$fixed==0])
[1] 1 1 3 5 7
> sums <- rep(sum(my.data$item[my.data$fixed==0]),length(my.data$item[my.data$fixed==0]))
> sums
[1] 17 17 17 17 17
> isums <- sums - (my.data$item[my.data$fixed==0])
> isums
[1] 16 16 14 12 10
Strangely enough, learning to vectorize in R is what helped me get used to basic functional programming. A basic technique would be to define your operations inside the loop as a function:
data = ...;
items = ...;
leave_one_out = function(i) {
data1 = data[items != i];
delta = ...; # some operation on data1
return delta;
}
for (j in items) {
delta.list = cbind(delta.list, leave_one_out(j));
}
To vectorize, all you do is replace the for loop with the sapply mapping function:
delta.list = sapply(items, leave_one_out);
This is no answer, but I wonder if any insight lies in this direction:
> tapply((my.data$item[my.data$fixed==0])[-1], my.data$year[my.data$fixed==0][-1], sum)
tapply produces a table of statistics (sums, in this case; the third argument) grouped by the parameter given as the second argument. For example
2001 2003 2005 2007
1 3 5 7
The [-1] notation drops observation (row) one from the selected rows. So, you could loop and use [-i] on each loop
for (i in 1:length(my.data$item)) {
tapply((my.data$item[my.data$fixed==0])[-i], my.data$year[my.data$fixed==0][-i], sum)
}
keeping in mind that if you have any years with only 1 observation, then the tables returned by the successive tapply calls won't have the same number of columns. (i.e., if you drop out the only observation for 2001, then 2003, 2005, and 2007 would be te only columns returned).

Resources