I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349
Related
I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)
Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?
If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))
To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.
How can I make R check whether an object is too large to print in the console? "Too large" here means larger than a user-defined value.
Example: You have a list f_data with two elements f_data$data (a 100MB data.frame) and f_data$info (for instance, a vector). Assume you want to inspect the first few entries of the f_data$data data.frame but you make a mistake and type head(f_data) instead of head(f_data$data). R will try to print the whole content of f_data to the console (which would take forever).
Is there somewhere an option that I can set in order to suppress the output of objects that are larger than let's say 1MB?
Edit: Thank you guys for your help. After implementing the max.rows option I realized that this gives indeed the desired output. BUT the problem that the output takes very long to show up still persists. I will give you a proper example below.
df_nrow=100000
df_ncol=100
#create list with first element being a large data.frame
#second element is a short vector
test_list=list(df=data.frame(matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol)),
vec=1:110)
#only print the first 100 elements of an object
options(max.print=100)
#head correctly displays the first row of the data.frame
#BUT for some reason the output takes really long to show up in the console (~30sec)
head(test_list)
#let's try to see how long exactly
system.time(head(test_list))
# user system elapsed
# 0 0 0
#well, obviously system.time is not the proper tool to measure this
#the same problem if I just print the object to the console without using head
test_list$df
I assume that R performs some sort of analysis on the object being printed and this is what takes so long.
Edit 2:
As per my comment below, I checked whether the problem persists if I use a matrix instead of a data.frame.
#create list with first element being a large MATRIX
test_list=list(mat=matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol),vec=1:110)
#no problem
head(test_list)
#no problem
test_list$mat
Could it be that the output to the console is not really efficiently implemented for data.frame objects?
I think there is no such option, but you can check the size of an object with object.size and print it if is lower than a threshold (measure in bytes), for example:
print.small.objects <- function(x, threshold = 1e06, ...)
{
if (object.size(x) < threshold) {
print(x, ...)
} else {
cat(paste("too big object\n"))
print(object.size(x))
}
}
Here's an example that you could adjust up to 100MB. It basically only prints the first 6 rows and 5 columns if the object's size is above 8e5 bytes. You could also turn this into a function and place it in your .Rprofile
> lst <- list(data.frame(replicate(100, rnorm(1000))), 1:10)
> sapply(lst, object.size)
# [1] 810968 88
> lapply(lst, function(x){
if(object.size(x) > 8e5) head(x)[1:5] else x
})
#[[1]]
# X1 X2 X3 X4 X5
#1 0.3398235 -1.7290077 -0.35367971 0.09874918 -0.8562069
#2 0.2318548 -0.3415523 -0.38346083 -0.08333569 -1.1091982
#3 0.0714407 -1.4561768 0.50131914 -0.54899188 0.1652095
#4 -0.5170228 1.7343073 -0.05602883 0.87855313 0.4025590
#5 0.6962212 -0.3179930 0.28016057 1.05414456 -0.5172885
#6 0.9471200 1.4424843 -1.46323827 -0.78004192 -1.3611820
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
I need to compare the values stored in two variables.The variable sizes are different. For example
x = c(1,2,3,4,5,6,7,8,9,10)
and
y = c(2,6,11,12,13)
I need an answer that 2 and 6 are present in both variables. I need this to be done in R.Anyone help please.
The intersect function avoids the need for #mdsumner's simple indexing:
> x = c(1,2,3,4,5,6,7,8,9,10)
> y = c(2,6,11,12,13)
> intersect(x,y)
[1] 2 6
Whole bunch of set operators to be found here: help(intersect)
Posted after the added requirement that some sort of tolerance be allowed: You could sequentially check one set of values against all the others in the second set or you could do it all at once with outer(). Once you have the outer result as a logical matrix there remains the task of referring back to the values, but expand.grid seems capable of handling that:
expand.grid(x,y)[outer(x,y, FUN=function(x,y) abs(x-y) < 0.01), ]
# Var1 Var2
#2 2 2
#16 6 6
After posting It occurred to me that you values were sorted. Turns out that this extraction from expand.grid() survives passing unsorted vectors.
x[x %in% y]
[1] 2 6
Or, more explicitly:
x[match(x, y, nomatch = 0) > 0]
[1] 2 6
Note that you actually chain together the results of the match with simple indexing into the input values.
See ?match.
Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3