Role of square brackets - r

I got this code from elsewhere and I wondering if someone can explain what the square brackets are doing.
matrix1[i,] <- df[[1]][]
I am using this to assign values to a matrix and it works but I am not sure what exactly it's doing. What does the initial set of [[]] mean followed by another []?

This might help you understand a bit. You can copy and paste this code and see the differences between different ways of indexing using [] and $. The only thing I can't answer for you is the second empty set of square brackets, from my understanding that does nothing, unless a value is within those brackets.
#Retreives the first column as a data frame
mtcars[1]
#Retrieves the first column values only (three different methods of doing the same thing)
mtcars[,1]
mtcars[[1]]
mtcars$mpg
#Retrieves the first row as a data frame
mtcars[1,]
#I can use a second set of brackets to get the 4th value within the first column
mtcars[[1]][4]
mtcars$mpg[4]

The general function of [ is that of subsetting, which is well documented both in help (as suggested in comments), and in this piece. The rest of of my answer is heavily based on that source.
In fact, there are operators for subsetting in R; [[,[, and $.
The [ and $ are useful for returning the index and named position, respectfully, for example the first three elements of vector a = 1:10 may be subsetted with a[c(1,2,3)]. You can also negatively subset to remove elements, as a[-1] will remove the first index.
The $ operator is different in that it only takes element names as input, e.g. if your df was a dataframe with a column values, df$values would subset that column. You can achieve the same [, but only with a quoted name such as df["values"].
To answer more specifically, what does df[[1]][] do?
First, the [[-operator will return the 1st element from df, and the following empty [-operator will pull everything from that output.

Related

Replacing df values in a column with values from anothe df via key

I need to replace values in the Nth column of my df, call these values v1s, by some other values from anothe df, call them v2s. There is a dictionary, or ruther two dictionaries. The first one translates v1s into numbers, the second one translates the numbers into v2s. I tried merge(), left/right_join(), smth else...but nothing seems to work. Can somebody help please?
Merging the datasets should work. Try the code until you can make it work.
Otherwise, you can always simply add an extra column to your dataset with
datasetA$newvar <- datasetB$v2s
when you have correctly added the second variable, simply drop the first.

Filtering data, comma vs not comma

I have the following code
#abnormal return
exp.ret <- lm((RET-rf)~mkt.rf+smb+hml, data=tesla[tesla$period=="estimation.period",])
tesla$abn.ret <- (tesla$RET-tesla$rf)-predict(exp.ret,tesla)
#CAR during event window
CAR <- sum(tesla$abn.ret[tesla$period=="event.period",])
First section runs fine, but second gets this error:
"Error in tesla$abn.ret[tesla$period == "event.period", ] :
incorrect number of dimensions
I know that the solution is to remove the last comma:
#CAR during event window
CAR <- sum(tesla$abn.ret[tesla$period=="event.period"])
Just wondering what is the right pedagogical way of understanding it, why do I need a comma in the end in some cases, but some not, when I'm filtering for only parts of the data frame.
$ sign, [[]] and [] have different meanings.
In short:
$ sign and [[]] subsets one column of a dataframe or one item of a list.
The output of a subsetted dataframe will be a vector, while the output of a subsetted list will be a variable the same class as the original item, which can be a dataframe, another list, etc...
It's important to note that $ doesn't accept a column index (only a column name) and that you cannot insert two column names/index after $ or inside [[]].
[] slices a dataframe or a list sorting out one or more elements.
the class of the output variable will be the same as the original variable.
if you slice a dataframe using [], the output will be a dataframe, the same applies for lists, etc...
In your specific case, you used $ sign to subset your variable. Then, you tried to slice this output from the subset action using [ , ], but it turned out that the output is a vector, and a vector has always only one dimension and an error was fired. You should slice your vector using [] (the output will be a vector) or [[]] (the output will be a vector with length = 1).
Possible ways to subset tesla as you wish:
tesla$abn.ret[tesla$period == "event.period"]
tesla[["abn.ret"]][tesla$period == "event.period"]
tesla[tesla$period == "event.period", "abn.ret"]
You would achieve the same result using tesla[["period"]] instead of tesla$period.
For some extra details/examples, refer to An introduction to R, published by CRAN.
I hope it helped you somehow..!
tesla$abn.ret is one-dimensional. Each comma separates a dimension, so yours implies 2 dimensions.
Alternatively you could run
tesla[tesla$period=="event.period", "abn.ret"]
And get the same results, since tesla is 2-d.
If you look at the documentation with command ?'[', you find that the default behaviour of syntax x[i] is to drop one dimension away.
If you want to disable the dropping of the dimension, you have explicitly to write x[i,drop=False].

How to run process through first row only in R?

I have a line of code that includes data.table package which allows me to identify all the rows and look if the cell contains the word "Margin".
Census_Bureau_Data<-Filter(function(Census_Bureau_Data) !any(Census_Bureau_Data %like% "Margin"), Census_Bureau_Data)
The code works perfectly and allows me to remove the columns that contain one row with the word Margin. Though I got result I wanted, I only want my script to limit the process to the first row. This is in case in the future the word Margin happens to appear somewhere outside of the first row and i wouldn't necessarily want my whole column deleted because of that. I only care about the first column.
Census_Bureau_Data<-Filter(function(Census_Bureau_Data) !any(Census_Bureau_Data[1,] %like% "Margin"), Census_Bureau_Data)
so i tried this instead. Note the bracket i added. I thought this would be enough. This should be simple enough. Where can I maintain the same string but just have it run through the first row?
[1,]
Two comments:
I think it's a little confusing (though not an error) to have the anonymous function's argument named the same as the external object itself, so for brevity I'll use function(xyz) ... here.
Realize that in that function, xyz is a vector of data, not a frame of data, so [,1] or [1,] are meaningless.
Since you're only looking at the first row's worth of values, you don't need any, just [1].
I think this is what you need:
Filter(
function(xyz) !(xyz[1] %like% "Margin"),
Census_Bureau_Data
)
However, while the use of Filter is not wrong, I think this can be simplified a little:
# data.table
Census_Bureau_Data[, !Census_Bureau_Data[1,,drop=TRUE] %like% "Margin", with = FALSE ]
# data.frame or tbl_df
Census_Bureau_Data[, !Census_Bureau_Data[1,,drop=TRUE] %like% "Margin" ]
It seems that I found this to work.
Census_Bureau_Data<-Filter(function(Census_Bureau_Data) !(Census_Bureau_Data[[1]] %like% "Margin"), Census_Bureau_Data)
i removed "any" as the comments suggested and added a double bracket [[1]]. I also ran tests. So i added the word "margin" in column 5 and row 5.
When i ran my original the cell that included the word margin in the 5th row and column had their column deleted. When i ran the code i have here the script applied only to Row 1 and it kept the column I had.

Subsetting list containing multiple classes by same index/vector

I'm needing to subset a list which contains an array as well as a factor variable. Essentially if you imagine each component of the array is relative to a single individual which is then associated to a two factor variable (treatment).
list(array=array(rnorm(2,4,1),c(5,5,10)), treatment= rep(c(1,2),5))
Typically when sub-setting multiple components of the array from the first component of the list I would use something like
list$array[,,c(2,4,6)]
this would return the array components in location 2,4 and 6. However, for the factor component of the list this wouldn't work as subsetting is different, what you would need is this:
list$treatment[c(2,4,6)]
Need to subset a list with containing different classes (array and vector) by the same relative number.
You're treating your list of matrices as some kind of 3-dimensional object, but it's not.
Your list$matrices is of itself a list as well, which means you can index at as a list as well, it doesn't matter if it is a list of matrices, numerics, plot-objects, or whatever.
The data you provided as an example can just be indexed at one level, so list$matrices[c(2,4,6)] works fine.
And I don't really get your question about saving the indices in a numeric vector, what's to stop you from this code?
indices <- c(2,4,6)
mysubset <- list(list$matrices[indices], list$treatment[indices])
EDIT, adding new info for edited question:
I see you actually have an 3-D array now. Which is kind of weird, as there is no clear convention of what can be seen as "components". I mean, from your question I understand that list$array[,,n] refers to the n-th individual, but from a pure code-point of view there is no reason why something like list$array[n,,] couldn't refer to that.
Maybe you got the idea from other languages, but this is not really R-ish, your earlier example with a list of matrices made more sense to me. And I think the most logical would have been a data.frame with columns matrix and treatment (which is conceptually close to a list with a vector and a list of matrices, but it's clearer to others what you have).
But anyway, what is your desired output?
If it's just subsetting: with this structure, as there are no constraints on what could have been the content, you just have to tell R exactly what you want. There is no one operator that takes a subset of a vector and the 3rd index of an array at the same time. You're going to have to tell R that you want 3rd index to use for subsetting, and that you want to use the same index for subsetting a vector. Which is basically just the code you already have:
idx <- c(2,4,6)
output <- list(list$array[,,idx], list$treatment[idx])
The way that you use for subsetting multiple matrices actually gives an error since you are giving extra dimension although you already specify which sublist you are in. Hence in order to subset matrices for the given indices you can usemy_list[[1]][indices] or directly my_list$matrices[indices]. It is the same for the case treatement my_list[[2]][indices] or my_list$treatement[indices]

How can I make a list of data frames which have the same values in the first column?

Say I have multiple data frames, and I want to make a multiple lists of the data frames with the same first column. For example, dfs 1-4 have "abc" in all columns of the first row, dfs 5-7 have "def" in all columns of the first row, etc. How can I write a script which puts (in this case) dfs 1-4 in a list called "abc", dfs 5-7 in a list called "def"?
This is my first question, so please let me know if there is anything else I could provide. I researched for a few days with no luck :(
Thanks!
Jack
So this is a guide to the solution, as you asked.
First make sure you have your list of data frames called l (all(sapply(l, is.data.frame)) should be TRUE).
Then, for each element (df) of this list, you need to get the character (string) in the first row (in any column, for example the first one). This will give you a vector of characters and you can get it by using either sapply or purrr::map_chr.
After that, here comes the split you want to do. Use split for that with as first argument the vectors of indices (see ?seq_along) and as a second argument the vector of characters you've just computed before.
Finally, use lapply to transform this list of indices in a list of data frames (you need to know the [ accessor for a list).
If you need more guidance, don't hesitate to ask.

Resources