What is the difference between with and within in R? - r

I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?

I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11

The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10

As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Ranking based on two variables

I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!
You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]

Conditional filtering of data.frame with preceeding and tailing NA observations

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3

Aggregation of all possible unique combinations with observations in the same column in R

I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.

How to pick value from a particular column in R

My data is in the following form:
y<-data.frame(atp=c(1,0,1,0,0,1),
ssmin=c(2,NA,3,NA,NA,1),
Day_1=round(runif(6,5,11),0),
Day_2=round(runif(6,88,110),0),
Day_3=round(runif(6,90,211),0))
I need to create a new column which picks the value from column 3, 4 or 5 depending on the value in column 2(ssmin).
The output would be like this:
FDRT<-c(89,NA,175,NA,NA,7)
I am trying out the following command but this does not help
y$new<- y[which(y$atp==1),na.omit(2+y$ssmin)]
Can any one help me how to write a code for it as my data is in large chunks and i cannot write value individually.
I think this might be what you're trying to do, but I'm not certain:
set.seed(1)
y<-data.frame(atp=c(1,0,1,0,0,1),
ssmin=c(2,NA,3,NA,NA,1),
Day_1=round(runif(6,5,11),0),
Day_2=round(runif(6,88,110),0),
Day_3=round(runif(6,90,211),0))
y
# atp ssmin Day_1 Day_2 Day_3
# 1 1 2 7 109 173
# 2 0 NA 7 103 136
# 3 1 3 8 102 183
# 4 0 NA 10 89 150
# 5 0 NA 6 93 177
# 6 1 1 10 92 210
x <- vapply(y$ssmin, function(x) unique(grep(x, names(y), value = TRUE)),
vector("character", 1L))
Z <- vector(length = length(x))
for (i in sequence(nrow(y))) {
Z[i] <- if (is.na(x[i])) NA else y[i, x[i]]
}
Z
# [1] 109 NA 183 NA NA 10
If I understand your question correctly, the last line you give almost solves your question. You just have to modify it slightly to get only the diagonal elements of the right hand side and only assign it to the applicable elements of the vector new. Here's the modified code.
y[which(y$atp==1), "new"] <- diag(as.matrix(y[which(y$atp==1),na.omit(2+y$ssmin)]))
not very elegant but short
y
atp ssmin Day_1 Day_2 Day_3
1 1 2 5 97 123
2 0 NA 8 108 165
3 1 3 10 109 190
4 0 NA 9 110 177
5 0 NA 10 91 182
6 1 1 7 94 141
> apply(y,1, function(r)r[r[2]+2])
[1] 97 NA 190 NA NA 7
for a more robust maintainable solution you probably want to hardcode the column names using ddply or somesuch.

Resources