This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 5 years ago.
I´m a newbie in R (and program). There are some examples with one or two "[", but I could not be sure, what they means.
dim(data)[[-1]] # means the column number of a data frame
dim(data)[-1] # what does it mean?
samples[,dim(samples)[[2]],2] # what does this mean?
Thanks a lot for your help!
In case data is stored in object of class data.frame, matrix or array, dim() returns a numeric vector containing size of each dimension. So the subsetting operator is simply applied to that vector. The operations you described can be used more generally. Here is explanation of what those exactly do.
Let vec <- dim(data)
vec[-1] - drops the first element similar to vec[2:length(vec)]
vec[[-1]] - same as above in your example, but is usually used in context of data.frames and lists. Here is an example that demonstrates the difference:
dt <- data.frame(a = rnorm(20), b = rnorm(20))
dt[-1] # returns data.frame with only b column
dt[[-1]] # returns numeric vector containing values of b column
samples[, dim(samples)[[2]], 2] - this syntax is more often use for selecting dimensions in an array (matrix with more than rows and columns) and will return a numeric vector that contains all rows in last column of the third dimension. Can play with the following to see for yourself:
array <- array(data = rnorm(8), dim = c(2, 2, 2))
array[, dim(array)[[2]], 2]
Note: Plz provide example data so we don't have to guess what objects are or replicate it.
Related
[Probably this question already has an answer here, but I didn't manage to find one, also because I have some difficulty in formulating it concisely. Suggestions for reformulating the title of the question are appreciated.]
I have
a list of matrices with different numbers of rows,
a vector of integer values with the same names as the list's,
a list of names that appear in the list and vector above,
an integer variable telling which column to choose from those matrices.
Let's construct, as a working example:
mynames <- c('a', 'c')
mylist <- list(a=matrix(1:4,2,2), b=matrix(1:6,3,2), c=matrix(1:8,4,2))
myvec <- 2:4
names(myvec) <- names(mylist)
chooseCol <- 2
I'd like to construct a vector having as elements the rows taken from myvec and column chooseCol, for the names appearing in mynames. My attempt is
sapply(mynames, function(elem){mylist[[elem]][myvec[elem], chooseCol]})
which correctly yields
a c
4 8
but I was wondering if there's a faster, base (non-tidyverse) method of doing this.
Also important or relevant: the order of the names in mylist and myvec can be different, so I can't rely on position indices.
I would use mapply -
mapply(function(x, y) x[y, chooseCol], mylist[mynames], myvec[mynames])
#a c
#4 8
Is it possible to store a numeric vector in the names variable of a list?
ie.
x <- c(1.2,3.4,5.9)
alist <- list()
alist[[x]]$somevar <- 2
I know I can store it as a vector within the list element, but I thought it would be faster to move through and find the element of the list I want (or add if needed) if the numeric vector is the name of the list element itself...
EDIT:
I have included a snippit of the code in context below, apologies for the change in nomenclature. In brief, I am working on a clustering problem, the dataset is too large to directly do the distance calculation on, my solution was to create bins for each dimension of the data and find the nearest bin for each observation in the original data. Of course, I cannot make a complete permutation matrix since this would be larger than the original data itself. Therefore, I have opted to find the nearest bin for each dimension individually and add it to a vector, temp.bin, which ideally would become the name of the list element in which the rowname of the original observation would be stored. I was hoping that this would simplify searching for and adding bins to the list.
I also realise that the distance calculation part is likely wrong - this is still very much a prototype.
binlist <- list()
for(i in 1:nrow(data)) # iterate through all data points
{
# for each datapoint make a container for the nearest bin found
temp.bin <- vector(length = length(markers))
names(temp.bin) <- markers
for(j in markers) # and dimensions
{
# find the nearest bin for marker j
if(dist == "eucl")
{
dists <- apply(X=bin.mat, MARGIN = 1, FUN= function(x,y) {sqrt(sum((x-y)^2))}, y=data[i,j])
temp.bin[j] <- bin.mat[which(dists == min(dists)),j]
}
}
### I realise this part doesn't work
binlist[[temp.bin]] <- append(binlist[[temp.bin]], values = i)
The closest answer so far is John Coleman.
names(alist) is a character vector. A numeric vector is not a string, hence it isn't a valid name for a list element. What you want is thus impossible. You could create a string representation of such a list and use that as a name, but that would be cumbersome. If this is what you really wanted to do, you could do something like the following:
x <- c(1.2,3.4,5.9)
alist <- list()
alist[[paste(x,collapse = " ")]]$somevar <- 2
This will create a 1-element list whose only element has the name "1.2 3.4 5.9".
While there might be some use cases for this, I suspect that you have an XY problem. What are you trying to achieve?
Solution
With some slight modifications we can achieve the following:
x = c(1.2,3.4,5.9)
alist = vector("list", length(x))
names(alist) = x
alist[as.character(x)] = list(c(somevar = 2))
#$`1.2`
#somevar
# 2
#
#$`3.4`
#somevar
# 2
#
#$`5.9`
#somevar
# 2
Explanation
Basically:
I had to create the list with the correct length (vector("list", length(x)))
Then assign the correct names (names(alist) = x)
So we can call list levels by name using [ and assign a new list to each list element (alist[as.character(x)] = list(c(somevar = 2)))
2nd Solution
Going by John Coleman comment:
It isn't clear that you answered the question. You gave a list whose
vector of names is the the vector x, coerced to character. OP asked if
it was possible "if the numeric vector is the name of the list element
itself... ". They wanted to treat x as a single name, not a vector of
names.
If you wanted to have the list element named after the vector x you could try, using the deparse(substitute(.)) trick
x = c(1.2,3.4,5.9)
alist = list()
alist[[deparse(substitute(x))]]$somevar = 2
#> alist[[deparse(substitute(x))]]
#$somevar
#[1] 2
If you really wanted the numeric values in x as the name itself, then I point you to John's solution.
I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which
You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.
Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.
I have a matrix where every row consists of zeros and a single one, say y <- rbind(c(1,0,0), c(0,1,0), c(0,1,0)) and I have a vector holding indices for each row, say x <- c(1,2,3) and . Now I would like to count the number of times that y[i,x[i]] == 1 holds. I know I could do this like
count <- 0
for(i in 1:3)
count <- count + y[i, x[i]]
but I was interested if there would be a smarter way. Something like count <- sum(y[,x]). Of course this does not work, because y[,x] gives a matrix.
Therefore my question is there a way to get a vector with the elements at the positions given by another vector by using apply or any other smart trick, i.e. without for-loops?
I've already been looking for this, but I don't really know how to call this and therefore I didn't find anything useful. Apologies if this question already hangs around somewhere...
We can use row/column indexing to extract the elements corresponding to 'x' and 'y' indices and then get the sum
sum(y[cbind(1:nrow(y), x)])
#[1] 2
If the values are different than 1,
sum(y[cbind(1:nrow(y), x)]==1)
Or for this case,
sum(diag(y)==1)
#[1] 2
Or
sum(y*diag(y))
EDIT: Changed the row/column index from cbind(x,1:ncol(y)) to cbind(1:nrow(y), x) as per the comments.
This question already has answers here:
Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]
(2 answers)
Closed 7 years ago.
I'm new to writing functions, am sure this is a simple one. I have a 111 col X ~10,500 row df with all missing values coded as <NA>. Intuitively, I need a function that does the following column-wise over a dataframe:
ifelse(length(is.na(colx) > length(colx)/5, NULL, colx)
i.e. I need to drop any variables with more than 1/5 (20%) missing values. Thanks to all for indicating there's a similar answer, i.e. using
colMeans(is.na(mydf)) > .20
to ID the columns, but this doesn't fully answer my question.
The above code returns a logical vector indicating the variables to be dropped. I have more than 100 variables with complex names and picking through them to drop by hand is tedious and bound to introduce errors. How can I modify the above, or use some version of my original proposed ifelse, to only return a new dataframe of columns with < 20% NA, as I asked originally?
Thanks!!
One way of doing this (probably not the shortest) is to iterate over the lines of the data.frame with by and then rbinding the result together to one data.frame.
Just change the condition in the if in the code below, here line with at least one NA value are removed.
do.call(rbind, by(your.dataset,
1:nrow(your.dataset),
FUN=function(x){
if(sum(is.na(x))==0){
return(x)
} else {
return(NULL)}
}))
When you use lapply on a data.frame, it performs the given function on each column as if each were a list.
So if f is your function for "processing" a column, you should use:
lapply(df, f)
vapply should be used when the result will always be a vector of a known size.
sapply is like an automatic vapply. It tries to simplify the result to a vector. I would advise against using sapply, except for exploratory programming.
(Updated to reflect edit)
Try:
f <- function(x) {
sum(is.na(x)) < length(x) * 0.2
}
df[, vapply(df, f, logical(1)), drop = F]