How do I grab elements from a table in R?
My data looks like this:
V1 V2
1 12.448 13.919
2 22.242 4.606
3 24.509 0.176
etc...
I basically just want to grab elements individually. I'm getting confused with all the R terminology, like vectors, and I just want to be able to get at the individual elements.
Is there a function where I can just do like data[v1][1] and get the element in row 1 column 1?
Try
data[1, "V1"] # Row first, quoted column name second, and case does matter
Further note: Terminology in discussing R can be crucial and sometimes tricky. Using the term "table" to refer to that structure leaves open the possibility that it was either a 'table'-classed, or a 'matrix'-classed, or a 'data.frame'-classed object. The answer above would succeed with any of them, while #BenBolker's suggestion below would only succeed with a 'data.frame'-classed object.
There is a ton of free introductory material for beginners in R: CRAN: Contributed Documentation
?"[" pretty much covers the various ways of accessing elements of things.
Under usage it lists these:
x[i]
x[i, j, ... , drop = TRUE]
x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]
x$name
getElement(object, name)
x[i] <- value
x[i, j, ...] <- value
x[[i]] <- value
x$i <- value
The second item is sufficient for your purpose
Under Arguments it points out that with [ the arguments i and j can be numeric, character or logical
So these work:
data[1,1]
data[1,"V1"]
As does this:
data$V1[1]
and keeping in mind a data frame is a list of vectors:
data[[1]][1]
data[["V1"]][1]
will also both work.
So that's a few things to be going on with. I suggest you type in the examples at the bottom of the help page one line at a time (yes, actually type the whole thing in one line at a time and see what they all do, you'll pick up stuff very quickly and the typing rather than copypasting is an important part of helping to commit it to memory.)
Maybe not so perfect as above ones, but I guess this is what you were looking for.
data[1:1,3:3] #works with positive integers
data[1:1, -3:-3] #does not work, gives the entire 1st row without the 3rd element
data[i:i,j:j] #given that i and j are positive integers
Here indexing will work from 1, i.e,
data[1:1,1:1] #means the top-leftmost element
Related
There are several ways to index i R e.g. by selecting using positive integers, by excluding using negative integers, by selecting using logicals/conditions, by selecting using names (character-vectors) in a named vector or list, and probably a lot of other ways than that.
What I want is a function taking two inputs ix and lst that tells me if ix makes sense as an index of lst, i.e. that lst[ix] makes sense.
I already know you can do something like
is.index <- function(ix,lst){
ans = FALSE
try({ans=all(!is.na(lst[ix]))},silent = TRUE)
return(ans)
}
But I want it to work when the list contains NAs, and when it's lst is a list it works differently. Both of these cases I could probably easily take care of as special cases, but I feel that I don't know all the possible ways to index, and all the intricacies of these, so I have no way of knowing if I have nailed all the special cases.
I know the "make sense" term isn't well defined, but it would seem reasonable to me, that there exist a function or at least a somewhat easy way of telling if an index is reasonable.
So is there a function or a simple way to do that, preferable not something requiring a try or a try catch statement?
EDIT: I realize that I haven't been clear in the statement of my question. Even if ix is a vector I want to know if lst[ix] makes sense, and not a vector telling me if lst[ix[i]] makes sense for the different possible values of i. Preferably we should have that no matter the type of ix and lst the function should always be able to return one logical value i.e. a TRUE or a FALSE. For example if lst = 1:5 and ix = c(-1,2) should return FALSE and not c(TRUE,TRUE).
Further clarification: Personally I don't like the partial matching or that it makes sense to index by non-integer doubles (I like even less that it just uses the integer part (rather that e.g. rounding to closest integer (useful for small precision errors) or taking the floor (makes lst[x/y] = lst[x%/%y]))); but since it makes sense to R I think it should be up to the preferences of the answerer whether to return TRUE or FALSE in these situations. The same goes for lst[0] and lst[NA], whereas since list("Cheese" = 4)["Che"] gives NA I don't think that partial matching should be accepted.
But seeing that (at least I think) requiring the answerer to make their own choices is bad practice; if I were to choose I think that all these (except partial matching) should be accepted and returned as TRUE.
Something like the following seems to do the job. It uses partial matching to match character ix with the vector or list names.
is.index <- function(X, ix){
if(is.character(ix)){
if(is.null(names(X))) FALSE
!is.na(pmatch(ix, names(X)))
}else{
abs(ix) <= length(X)
}
}
Test with vectors.
x <- 1:6
y <- setNames(x, letters[x])
is.index(x, 2)
is.index(x, 7)
is.index(x, -3)
is.index(y, 'a')
is.index(y, 'z')
And now with lists.
lst <- list(1:6, letters[1:4])
is.index(lst, 3)
is.index(lst, "a")
is.index(lst, -1)
But there are problems, partial matching only works with the $ extractor function, it doesn't work with [[, not even with [.
lst2 <- setNames(lst, c("A", "2nd"))
is.index(lst2, "A")
is.index(lst2, "2n")
lst2$`2n` # works
lst[['2n']] # fails
Recently, I came across a strange behavior of R that I wanted to better understand.
Let us assume the following two examples:
Example 1:
ebit.2018_base <- c(1,2,3,5,7,3,2)
ebit.2017_base <- c(1,2,3,5,7,3,2)
ebit <- data.frame(ebit.2018_base, ebit.2017_base)
ebit$test <- ebit$ebit.2018 * 5
R can compute with this column name, even though it does not exactly match.
Example 2
ebit.2018_base <- c(1,2,3,5,7,3,2)
ebit.2018_notbase <- c(1,2,3,5,7,3,2)
ebit.2017_base <- c(1,2,3,5,7,3,2)
ebit <- data.frame(ebit.2018_base, ebit.2018_notbase, ebit.2017_base)
ebit$test <- ebit$ebit.2018 * 5
This does not work.
My hypothesis: in the first example, R can clearly understand that I refer to the column ebit.2018_base by using the term ebit.2018. In the second example, it is ambiguous because there are two columns starting with ebit.2018.
Is this correct? Sorry if this is common knowledge or has been previously addressed, I just want to be sure I understand the logic behind correctly.
Yes, you are right!
From the documentation of Extract (which you can access through: ?"$" or ?Extract):
Both [[ and $ select a single element of the list. The main difference
is that $ does not allow computed indices, whereas [[ does. x$name is
equivalent to x[["name", exact = FALSE]]. Also, the partial matching
behavior of [[ can be controlled using the exact argument.
Because $ allows partial matching, your first case works. But in the second case, it doesn't resolve a single column and therefore thinks you're asking for a non-existing one; hence the error.
Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!
Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?
I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)
I've created a list of matrices in R. In all matrices in the list, I'd like to "pull out" the collection of matrix elements of a particular index. I was thinking that the colon operator might allow me to implement this in one line. For example, here's an attempt to access the [1,1] elements of all matrices in a list:
myList = list() #list of matrices
myList[[1]] = matrix(1:9, nrow=3, ncol=3, byrow=TRUE) #arbitrary data
myList[[2]] = matrix(2:10, nrow=3, ncol=3, byrow=TRUE)
#I expected the following line to output myList[[1]][1,1], myList[[2]][1,1]
slice = myList[[1:2]][1,1] #prints error: "incorrect number of dimensions"
The final line of the above code throws the error "incorrect number of dimensions."
For reference, here's a working (but less elegant) implementation of what I'm trying to do:
#assume myList has already been created (see the code snippet above)
slice = c()
for(x in 1:2) {
slice = c(slice, myList[[x]][1,1])
}
#this works. slice = [1 2]
Does anyone know how to do the above operation in one line?
Note that my "list of matrices" could be replaced with something else. If someone can suggest an alternative "collection of matrices" data structure that allows me to perform the above operation, then this will be solved.
Perhaps this question is silly...I really would like to have a clean one-line implementation though.
Two things. First, the difference between [ and [[. The relevant sentence from ?'[':
The most important distinction between [, [[ and $ is that the [ can
select more than one element whereas the other two select a single
element.
So you probably want to do myList[1:2]. Second, you can't combine subsetting operations in the way you describe. Once you do myList[1:2] you will get a list of two matrices. A list typically has only one dimension, so doing myList[1:2][1,1] is nonsensical in your case. (See comments for exceptions.)
You might try lapply instead: lapply(myList,'[',1,1).
If your matrices will all have same dimension, you could store them in a 3-dimensional array. That would certainly make indexing and extracting elements easier ...
## One way to get your data into an array
a <- array(c(myList[[1]], myList[[2]]), dim=c(3,3,2))
## Extract the slice containing the upper left element of each matrix
a[1,1,]
# [1] 1 2
This works:
> sapply(myList,"[",1,1)
[1] 1 2
edit: oh, sorry, I see almost the same idea toward the end of an earlier answer. But sapply probably comes closer to what you want, anyway