I want to test the SAS code my team has produced in R to compare the estimates that we get from each but being new to R am not having much luck. In SAS we have written 3 macros to produce three separate estimates (HFS010, HFS011, HFS012), an example of one given here;
%macro HFS010 (peninc_var, pengn_var, pentax_var, pentype_var, HFS010_x_var);
do i = 1 to dim(pentypex);
if &pentype_var = 1 and &pengn_var = 1 then &HFS010_x_var = &peninc_var;
else if &pentype_var = 1 and &pengn_var = 2 then &HFS010_x_var = &peninc_var + &pentax_var;
end;
%mend HFS010;
Basically the idea is that each macro produces an estimate for gross pension income (so where applicable adds tax deducted from pensions on to pension income value). There are three macros as we want separate estimates for cases where pentype = 1 (HFS010), pentype = 2 (HFS011) and pentype = 3 to 7 (HFS012) and the survey accepts up to 16 entries for pensions.
To attempt to produce an equivalent of the above code in R, I wrote the following;
for(i in 1:16) {
pens_data[[paste0("HFS010_",i)]] <- case_when(
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 1 ~ pens_data[[paste0("peninc",i)]],
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 2 ~ pens_data[[paste0("peninc",i)]] + pens_data[[paste0("pentax",i)]],
TRUE ~ 0)
This code does not produce errors but upon inspecting the estimates, there were some cases that should have estimates that were left blank.
Does anyone know of a way to write a macro in R? I thought of writing a function potentially for each of HFS010, HFS011, HFS012 but being new to R am not sure how to go about this.
If anyone has any suggestions as to why my R code isn't producing the correct estimates, or how they would write the equivalent of a SAS macro in R it would be greatly appreciated! I have tried to use defmacro but could not get this to work without errors.
Thanks so much!
Ashlee
There are many ways to write this in R. But first a copule of comments:
R works fine with vector, so we should as possible manipulate vectors. This is much faster and allows to avoid slow for loop with side effect.
In order to help other to give you answer please provide a reproducible example that cover both uses cases.
For example:
set.seed(1)
dx <- data.frame(
peninc_var=sample(c(1,3),5,TRUE),
pengn_var=sample(c(1,2),5,TRUE),
pentax_var=1:5)
Here an option in base R. I am creating the new variable HFS010_x_var using ifelse :
dx$HFS010_x_var <-
with(dx,{
## I am adding a last NO condition here to assign missing NA
ifelse(peninc_var==1 & pengn_var==1,peninc_var,
ifelse(peninc_var==1 & pengn_var==2,peninc_var + pentax_var,NA))
})
peninc_var pengn_var pentax_var HFS010_x_var
1: 1 2 1 2
2: 1 2 2 3
3: 3 2 3 NA
4: 3 2 4 NA
5: 1 1 5 1
Another option (more sugar syntax ) is to use data.table:
library(data.table)
setDT(dx)
dx[peninc_var==1 & pengn_var==1,HFS010_x_var := peninc_var]
dx[peninc_var==1 & pengn_var==2,HFS010_x_var := peninc_var+pentax_var]
Related
I'm starting to learn R and I'm having a hard time making changes to the names of values in a factor. I've tried using revalue and recode but am still seeing the original names when I look at the dataframe.
Here's what the DF looks like:
head(freecut)
gender oldness student_loaniness homeland
1 0 20 4 Eurasia
2 1 25 4 Oceana
3 1 56 2 Eastasia
4 0 65 6 Eastasia
5 1 50 5 Oceana
6 0 20 5 Eastasia
And here are the coding attempts:
revalue(freecut$homeland, c("Eastasia" = "East_Asia", "Eurasia" = "Asiope",
"Oceana" = "Nemoville"))
recode(freecut$homeland, Eastasia = "East_Asia", Eurasia = "Asiope",
Oceana = "Nemoville")
After running the code the DF looks exactly the same. I know that in Python I would have to throw in "inplace = TRUE" to make changes stick--not sure what I need to do here (or what I'm missing).
R doesn't modify in place, you have to assign results - either back to the original variable to modify it, or to a new variable. This is a paradigm of functional programming, and R is a functional programming language.
If you have x = 1, running x + 1 will evaluate and print the result, 2, but x is not changed. If you want to overwrite x with the modified value, you run x = x + 1.
Just the same way, running recode, will evaluate and print a result, but if you want to modify the column in your data frame, you need to explicitly assign it with freecut$homeland = recode(...).
There are a few exceptions in add-on packages. For example, the data.table package defines some set* operators which do modify objects in place. data.table is fantastic, especially if you need efficiency, but if you are just starting with R I would recommend getting familiar with the basics first.
In addition to Gregor's answer which addresses more fundamental issues, you can in your particular case use levels<-:
levels(freecut$homeland) <- c("first", "second", "third")
# order is important if you don't want surprises
Or if you are ready to join the dark side, consider macros from gtools package. The first steps are described e.g. in https://www.r-bloggers.com/macros-in-r/. Nobody is using macros in R but I don't know why. Maybe they're dangerous but maybe they just seem obscure.
I am new to R (Economist with background in Stata) and I am having trouble getting a nested for loop to work for me. I know the issue is that I don't have a good understanding of how to use the loop counter as part of a variable name.
A bit of background. I have data frame with data on average rental rates for homes of different size (1 bedroom, 2 bedroom, etc) and data on annual earnings (mean, median, and various percentiles). I am trying to generate a series of new columns containing the ratio of these two things (rental rate / mean earnings).
Specifically my variables are:
beds1, beds2, beds3, beds4
mean, median, p10, p25, p75, p90
So you see I need to generate 24 new columns of cost/earnings data. I could write out 24 lines of code but I don't want to. More importantly, I want to learn an efficient way of doing this in R. In Stata I could do this very simply using a nested for loop, but I can't get it to work in R. Here is my code so far.
for (i in 1:4) {
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for (x in stat) {
df$beds[i]_[x] <- round((df$beds[i]/df$[x]),digits=3)
}
}
When I run this code the error I get is
Error: unexpected input in:
" for (x in stat) {
df$beds[i]_"
> }
Error: unexpected '}' in " }"
> }
Error: unexpected '}' in "}"
I have tried to use the double brackets [[]] but that didn't change the results. If anyone has some insight into why the dynamic variables names aren't working please let me know. Even better, since I guess loops are evil in R, if anyone knows a way to use lapply to get this done, I would love to hear that too.
EDIT
Thanks #Spacedman for the comment. I think I am getting what you're saying. So does that mean that there simply isn't anyway to do what I want to do in R?
var1 <- c("beds1", "beds2")
var2 <- c("mean", "median")
for (i in 1:2) {
for (j in 1:2) {
df$var1[i]_var2[j] <- df$var1[i]/df$var2[j]
}
}
I think this should grab the elements of the lists var1 and var2 so that when i=1 and j=1, df$var1[i]/df$var2[j] should mean df$beds1/df$mean. Or would R get mad and think I was trying to divide strings?
FINAL EDIT WITH ANSWER FROM #SPACEEMAN
Thanks #Spacedman. I loved your spoiler and thank you for providing additional help. I didn't fully grasp the difference between the two ways of referring to columns after your last post, but I think I have a better idea now. I did a bit of tweaking and now I have something that works perfectly. Thanks again!
beds <- c("beds1", "beds2", "beds3", "beds4")
stat <- c("median", "mean", "p10", "p25", "p75","p90")
for(i in beds){
for(x in stat){
res = paste0(i,"_",x)
df[[res]]=round(df[[i]]/df[[x]],digits=3)
}
}
R is not a macro expansion language like other languages you might be used to.
x[i], if i=123, does not "expand" into x123. It gets the value of the 123rd element of the vector, x.
So df$beds[i] tries to get the i'th element of a vector df$beds.
You need to know two things:
How to construct strings from other strings.
For this you can use paste0:
> for(i in 1:4){
+ print(paste0("beds",i))
+ }
[1] "beds1"
[1] "beds2"
[1] "beds3"
[1] "beds4"
How to access columns by names.
For this you can use double square brackets. In a list:
> z = list()
> n = "thing"
Double squabs evaluate their index and use that. So:
> z[[n]] = 99
Will set z$thing, but dollar sign indexing is literal, so:
> z$n = 123
will set z$n:
> z
$thing
[1] 99
$n
[1] 123
hopefully that's enough hints to get you through. It should all be covered in basic R tutorials online.
Spoiler
If you want to work out how to do it yourself, look away now...
First, lets create a sample data frame - you should include something like this in your question so we have common test data to work on. I'll just have three beds and two stats:
> df = data.frame(
beds1=c(1,2,3),
beds2=c(5,2,3),
beds3=c(6,6,6),
mean=c(8,4,3),
median=c(1,7,4))
> df
beds1 beds2 beds3 mean median
1 1 5 6 8 1
2 2 2 6 4 7
3 3 3 6 3 4
Now the work. We loop over the bed number and the character stats. The bed column name is stored in bed by pasting "beds" to the number i. We compute the name of the result column (res) for a given bed number and stat by pasting "beds" to i and "_" and the name of the stat in x.
Then set the new resulting column to the value by dividing the beds number by the stat. We use [[z]] to get the columns by name:
> for(i in 1:3){
stats=c("mean","median")
for(x in stats){
bed = paste0("beds",i)
res = paste0("beds",i,"_",x)
df[[res]]=round(df[[bed]]/df[[x]],digits=3)
}
}
Resulting in....
> df
beds1 beds2 beds3 mean median beds1_mean beds1_median beds2_mean beds2_median
1 1 5 6 8 1 0.125 1.000 0.625 5.000
2 2 2 6 4 7 0.500 0.286 0.500 0.286
3 3 3 6 3 4 1.000 0.750 1.000 0.750
beds3_mean beds3_median
1 0.75 6.000
2 1.50 0.857
3 2.00 1.500
>
I am using the dative data frame within R, and I am trying to plot only the LengthOfRecipient == 'nonpronomial' against Modality. I gathered all the LengthOfRecipient == 'nonpronomial':
library('languageR')
lor.np = dative[dative$PronomOfRec == 'nonpronominal',]$LengthOfRecipient
I have tried nesting this subset function, and applied vectors, but I cannot figure out a way to then access the Modality column for only the items in lor.np and store it in mod.np, so that I can plot and analyze the data with:
boxplot(lor.np, mod.np)
I'm very new to R and the syntax is extremely confusing. Any help would be very appreciated. Thanks in advance!
It might be easier to select all the columns you want at once and then use the formula feature in boxplot rather than using vectors:
library('languageR')
lor.np <- dative[dative$PronomOfRec == 'nonpronominal',
c('LengthOfRecipient','Modality')]
head(lor.np)
# LengthOfRecipient Modality
# 2 2 written
# 3 1 written
# 5 2 written
# 6 2 written
# 7 2 written
# 11 2 written
## but you don't even need to select the columns:
lor.np <- dative[dative$PronomOfRec == 'nonpronominal', ]
boxplot(LengthOfRecipient ~ Modality, lor.np)
After looking at the data, you don't need droplevels, but here is an example when it may be useful:
dat1 <- dative[dative$Modality == 'written', ]
boxplot(LengthOfRecipient ~ Modality, dat1)
I am totally new with R and I'll appreciate the time anyone bothers to take with helping me with these probably simple tasks. I'm just at a loss with all the resources available and am not sure where to start.
My data looks something like this:
subject sex age nR medL medR meanL meanR pL ageBin
1 0146si 1 67 26 1 1 1.882353 1.5294118 0.5517241 1
2 0162le 1 72 5 2 1 2 1.25 0.6153846 1
3 0323er 1 54 30 2.5 3 2.416667 2.5 0.4915254 0
4 0811ne 0 41 21 2 2 2 1.75 0.5333333 0
5 0825en 1 44 31 2 2 2.588235 1.8235294 0.5866667 0
Though the actual data has many, many more subjects in variables.
This first thing I need to do is compare the 'ageBin' values. 0 = under age 60, 1 = over age 60. I want to compare stats between these two groups. So I guess the first thing I need is the ability to recognize the different ageBin values and make those the two rows.
Then I need to do things like calculate the frequency of the values in the two groups (ie. how many instances of 1 and 0), the mean of the 'age' variable, the median of the age variable, number of males (ie. sex = 1), the mean of meanL, etc. Simple things like that. I just want them to be all in one table.
So an example of a potential table might be
n nMale mAge
ageBin 0 14 x x
ageBin 1 14 x x
I could easily do this stuff in SPSS or even Excel...I just really want to get started with R. So any resource or advice someone could offer to point me in the right direction would be so, so helpful. Sorry if this sounds unclear...I can try to clarify if necessary.
Thanks in advance, anyone.
Use the plyr() package to split up the data structure and then apply a function to combine all the results back together.
install.packages("plyr") # install package from CRAN
library(plyr) # load the package into R
dd <- list(subject=c("0146si", "0162le", "1323er", "0811ne", "0825en"),
sex = c(1,1,1,0,1),
age = c(67,72,54,41,44),
nR = c(26,5,30,21,31),
medL = c(1,2,2.5,2,2),
medR = c(1,1,3,2,2),
meanL = c(1.882352,2,2.416667,2,2.588235),
meanR = c(1.5294118,1.25,2.5,1.75,1.8235294),
pL = c(0.5517241,0.6153846,0.4915254,0.5333333,0.5866667),
ageBin = c(1,1,0,0,0))
dd <- data.frame(dd) # convert to data.frame
Using the ddply function, you can do things like calculate the frequency of the values in the two groups
ddply(dd, .(ageBin), summarise, nMale = sum(sex), mAge = mean(age))
ageBin nMale mAge
0 2 46.33333
1 2 69.50000
The following is a very useful resource by Sean Anderson for getting up to speed with the plyr package.
A more comprehensive extremely resource by Hadley Wickham the package author can be found here
Try the by function:
if your data frame is named df:
by(data=df, INDICES=df$ageBin, FUN=summary)
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
function with multiple outputs
numbering rows within groups in a data frame
G'day, I have a list of individuals that are grouped by place. I want to produce a new variable that gives a number to each individual dependant on their place. What I would like my data to look like is:
place individual
here 1
here 2
here 3
there 1
there 2
somewhere 1
somewhere 2
I have written this:
nest<-data.frame(location=c("one","one","two", "three", "three", "three"))
individual<- function(x)
{
pp = 1
jj = 1
for (i in 2:length(x)){
if (x[i] == x[pp]){
res<- jj+1
pp = pp + 1
jj = jj + 1
}
else{
res<- 1
pp = pp + 1
jj = 1
}
}
return(res)
}
test<- individual(nest$location)
test
I am pretty sure this does what I want, however, I can't figure out how to return more than one result value. How, do I change this function to return a result for each value of location? Or, is there an easier way to do this with a pre-existing R package?
P.S. on a side note, I start the function from nest$individual[2] because when it starts the function from 1 and it tries to look for the previous value (which doesn't exist) it returns an error. If anyone has a thought on how to get around this I would love to hear it. Cheers.
Maybe something like this...?
ddply(nest,.(location),transform,individual = seq_len(length(location)))
location individual
1 one 1
2 one 2
3 three 1
4 three 2
5 three 3
6 two 1
(ddply is in the plyr package)
I'm sure there are slick ways of doing this with rle as well. Indeed, just for fun:
do.call(c,lapply(rle(nest$location)[[1]],seq_len))