R equivalent of Stata's for-loop over local macro list of stubnames - r

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara

Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.

Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}

This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}

Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

Related

Run a function in loop on a data frame of 3 values

Function given:
hmin = function (H,M,s) {
H - ((2*9.8*M)/1.5*s))
}
# Create a data frame to test One-at-A-Time sensitivity analysis
# Keep H and M constant, varying s
# H = 50, M = 70.5, s = 20 to 40
s1 <- numeric(length = 20)
s1 [1:21] <- c(20:40)
H1 <- c(rep(50,length(s1)))
M1 <- c(rep(70.5, length(s1)))
dataframe1 <- data.frame(H1,M1,s1)
Now with this data frame, How to run the hmin function over the variables in dataframe1 and store the output hmin in a vector to plot a barplot.
Tried using:
lapply(dataframe1, hmin (H1,M1,s1), dataframe1)
but it does not work.
Any help would be appreciated. Thank you.
My intended output is just to return the hmin values after running the data frame of variables using the function and saving it in a vector. How can I go about doing it?
Since your function is vectorized, do this:
result = hmin(H = dataframe1$H1, M = dataframe1$M1, s = dataframe1$s1)
# same thing, using the with() helper function to save typing
result = with(dataframe1, hmin(H = H1, M = M1, s = s1))
It is far more easily if you use dplyr
> library(dplyr)
> dataframe1 <- data.frame(H1,M1,s1)
> hmin <- function(H,M,s) {
+ H - ((2*9.8*M)/1.5*s)
+ }
> dataframe1 %>%
+ mutate(hminscores = hmin(H1, M1, s1))
H1 M1 s1 sensitivity
1 50 70.5 20 -18374.0
2 50 70.5 21 -19295.2
3 50 70.5 22 -20216.4
4 50 70.5 23 -21137.6
5 50 70.5 24 -22058.8
6 50 70.5 25 -22980.0
7 50 70.5 26 -23901.2
8 50 70.5 27 -24822.4
9 50 70.5 28 -25743.6
10 50 70.5 29 -26664.8
11 50 70.5 30 -27586.0
12 50 70.5 31 -28507.2
13 50 70.5 32 -29428.4
14 50 70.5 33 -30349.6
15 50 70.5 34 -31270.8
16 50 70.5 35 -32192.0
17 50 70.5 36 -33113.2
18 50 70.5 37 -34034.4
19 50 70.5 38 -34955.6
20 50 70.5 39 -35876.8
21 50 70.5 40 -36798.0
If you just want a vector, you can also use apply:
> hminscores <- apply(dataframe1, 1, function(x) {
+ hmin(x[1], x[2], x[3])
+ })
> hminscores
[1] -18374.0 -19295.2 -20216.4 -21137.6 -22058.8 -22980.0 -23901.2 -24822.4 -25743.6
[10] -26664.8 -27586.0 -28507.2 -29428.4 -30349.6 -31270.8 -32192.0 -33113.2 -34034.4
[19] -34955.6 -35876.8 -36798.0

Outputting percentiles by filtering a data frame

Note that, as requested in the comments, that this question has been revised.
Consider the following example:
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be
PERCENTILE df_1 df_2 ... df_10
1 [first percentiles]
2 [second percentiles]
etc., where df_i is based on FILTER == i.
Note that FILTER, although it contains numbers, is actually categorical.
The way I have been doing this is by using dplyr:
nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}
and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).
Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))
> out
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75
I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.
I suggest that you use a list.
list_of_dfs <- list()
nums <- 1:10
for (i in nums){
list_of_dfs[[i]] <- nums*i
}
df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)
You'll get the result you want:
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
How about using get?
df <- data.frame(1:10)
for (i in nums) {
df <- cbind(df, get(paste0("df_", i)))
}
# get rid of first useless column
df <- df[, -1]
# get names
names(df) <- paste0("df_", nums)
df

R - "find and replace" integers in a column with character labels

I have two data frames the first (DF1) is similar to this:
Ba Ram You Sheep
30 1 33.2 120.9
27 3 22.1 121.2
22 4 39.1 99.1
11 1 20.0 101.6
9 3 9.8 784.3
The second (DF2) contains titles for column "Ram":
V1 V2
1 RED
2 GRN
3 YLW
4 BLU
I need to replace the DF1$Ram with corresponding character strings of DF2$V2:
Ba Ram You Sheep
30 RED 33.2 120.9
27 YLW 22.1 121.2
22 BLU 39.1 99.1
11 RED 20.0 101.6
9 YLW 9.8 784.3
I can do this with a nested for loop, but it feels REALLY inefficient:
x <- c(1:nrows(DF1))
y <- c(1:4)
for (i in x) {
for (j in y) {
if (DF1$Ram[i] == x) {
DF1$Ram[i] <- DF2$V2[y]
}
}
}
Is there a way to do this more efficiently??!?! I know there is. I'm a noob.
Use merge
> result <- merge(df1, df2, by.x="Ram", by.y="V1")[,-1] # merging data.frames
> colnames(result)[4] <- "Ram" # setting name
The following is just for getting the output in the order you showed us
> result[order(result$Ba, decreasing = TRUE), c("Ba", "Ram", "You", "Sheep")]
Ba Ram You Sheep
1 30 RED 33.2 120.9
3 27 YLW 22.1 121.2
5 22 BLU 39.1 99.1
2 11 RED 20.0 101.6
4 9 YLW 9.8 784.3
Usually, when you encode some character strings with integers, you likely want factor. They offer some benefits you can read about in the fine manual.
df1 <- data.frame(V2 = c(3,3,2,3,1))
df2 <- data.frame(V1=1:4, V2=c('a','b','c','d'))
df1 <- within(df1, {
f <- factor(df1$V2, levels=df2$V1, labels=df2$V2)
aschar <- as.character(f)
asnum <- as.numeric(f)
})

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
max(subset(ozone,Ozone))
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
max(ozone,na.rm=T)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
colMax(dat)
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
ozone[order(ozone$Solar.R),]
Or to sort the other direction:
ozone[rev(order(ozone$Solar.R)),]
Here's a dplyr solution:
library(dplyr)
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
or
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...

Select a value for based on a highest value in another column

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?
I have tried following without a success:
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))
dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)
I am after something like this:
month day temp
5 7 89
...
There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :
require(data.table)
DT=as.data.table(airquality)
DT[,.SD[which.max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 8 76 203 9.7 97 28
[5,] 9 73 183 2.8 93 3
.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.
Or to get all the rows where the max is tied :
DT[,.SD[Temp==max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 7 97 272 5.7 92 9
[5,] 8 76 203 9.7 97 28
[6,] 9 73 183 2.8 93 3
[7,] 9 91 189 4.6 93 4
Another approach with plyr
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')
library(plyr)
ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)
Gives
month day temp
1 5 29 81
2 6 11 93
3 7 8 92
4 7 9 92
5 8 28 97
6 9 3 93
7 9 4 93
Or, even simpler
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset,
subset = temp == max(temp), select = c(month, day, temp) )
how about with plyr?
max.func <- function(df) {
max.temp <- max(df$temp)
return(data.frame(day = df$Day[df$Temp==max.temp],
temp = max.temp))
}
ddply(airquality, .(Month), max.func)
As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.
Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):
library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]
If you want to know what the .SD stands for, have a look here: SO

Resources