Filtering data, comma vs not comma - r

I have the following code
#abnormal return
exp.ret <- lm((RET-rf)~mkt.rf+smb+hml, data=tesla[tesla$period=="estimation.period",])
tesla$abn.ret <- (tesla$RET-tesla$rf)-predict(exp.ret,tesla)
#CAR during event window
CAR <- sum(tesla$abn.ret[tesla$period=="event.period",])
First section runs fine, but second gets this error:
"Error in tesla$abn.ret[tesla$period == "event.period", ] :
incorrect number of dimensions
I know that the solution is to remove the last comma:
#CAR during event window
CAR <- sum(tesla$abn.ret[tesla$period=="event.period"])
Just wondering what is the right pedagogical way of understanding it, why do I need a comma in the end in some cases, but some not, when I'm filtering for only parts of the data frame.

$ sign, [[]] and [] have different meanings.
In short:
$ sign and [[]] subsets one column of a dataframe or one item of a list.
The output of a subsetted dataframe will be a vector, while the output of a subsetted list will be a variable the same class as the original item, which can be a dataframe, another list, etc...
It's important to note that $ doesn't accept a column index (only a column name) and that you cannot insert two column names/index after $ or inside [[]].
[] slices a dataframe or a list sorting out one or more elements.
the class of the output variable will be the same as the original variable.
if you slice a dataframe using [], the output will be a dataframe, the same applies for lists, etc...
In your specific case, you used $ sign to subset your variable. Then, you tried to slice this output from the subset action using [ , ], but it turned out that the output is a vector, and a vector has always only one dimension and an error was fired. You should slice your vector using [] (the output will be a vector) or [[]] (the output will be a vector with length = 1).
Possible ways to subset tesla as you wish:
tesla$abn.ret[tesla$period == "event.period"]
tesla[["abn.ret"]][tesla$period == "event.period"]
tesla[tesla$period == "event.period", "abn.ret"]
You would achieve the same result using tesla[["period"]] instead of tesla$period.
For some extra details/examples, refer to An introduction to R, published by CRAN.
I hope it helped you somehow..!

tesla$abn.ret is one-dimensional. Each comma separates a dimension, so yours implies 2 dimensions.
Alternatively you could run
tesla[tesla$period=="event.period", "abn.ret"]
And get the same results, since tesla is 2-d.

If you look at the documentation with command ?'[', you find that the default behaviour of syntax x[i] is to drop one dimension away.
If you want to disable the dropping of the dimension, you have explicitly to write x[i,drop=False].

Related

Role of square brackets

I got this code from elsewhere and I wondering if someone can explain what the square brackets are doing.
matrix1[i,] <- df[[1]][]
I am using this to assign values to a matrix and it works but I am not sure what exactly it's doing. What does the initial set of [[]] mean followed by another []?
This might help you understand a bit. You can copy and paste this code and see the differences between different ways of indexing using [] and $. The only thing I can't answer for you is the second empty set of square brackets, from my understanding that does nothing, unless a value is within those brackets.
#Retreives the first column as a data frame
mtcars[1]
#Retrieves the first column values only (three different methods of doing the same thing)
mtcars[,1]
mtcars[[1]]
mtcars$mpg
#Retrieves the first row as a data frame
mtcars[1,]
#I can use a second set of brackets to get the 4th value within the first column
mtcars[[1]][4]
mtcars$mpg[4]
The general function of [ is that of subsetting, which is well documented both in help (as suggested in comments), and in this piece. The rest of of my answer is heavily based on that source.
In fact, there are operators for subsetting in R; [[,[, and $.
The [ and $ are useful for returning the index and named position, respectfully, for example the first three elements of vector a = 1:10 may be subsetted with a[c(1,2,3)]. You can also negatively subset to remove elements, as a[-1] will remove the first index.
The $ operator is different in that it only takes element names as input, e.g. if your df was a dataframe with a column values, df$values would subset that column. You can achieve the same [, but only with a quoted name such as df["values"].
To answer more specifically, what does df[[1]][] do?
First, the [[-operator will return the 1st element from df, and the following empty [-operator will pull everything from that output.

Logical comparison of elements from named list vs named vector in R

Sorry if this is a duplicate, I read through a number of threads but couldn't really find a good explanation.
I have a dataset (dataframe) where I calculated the mean value of each column. I now want to do some logical comparisons between these values. I used lapply to get the means
means_list <- lapply(dataset_df, mean)
which outputs a named list. But when I try to compare two elements of this list, e.g.
means_list["condition1"] > means_list["condition2"]
I get an error ("comparison of these types is not implemented").
I don't get that error if I use sapply instead so that I'm working with a named vector. I can also get around the error by converting the list to a dataframe with as.data.frame first.
So, I feel like I'm doing something wrong when subsetting a named list here but I don't quite understand how. Is there a correct way to subset the list so that I can do the logical comparison? Or is this not possible with named lists?
Thanks!
To access to the element of a list by its name, you have to use double brackets:
means_list[["condition1"]] > means_list[["condition2"]]

How can I access data in a nested R list?

I want to learn how to access data from a nested list in R. I am relatively new to the R programming language, so I am unsure how to proceed.
The data is a 'large list(947 elements, 654.9mb) and takes the form:
The numbers within the datalist refer to station numbers and when I click on one (in Rstudio) it looks like this:
I want to kow how I can access the data within 'doy' for example. I have tried:
data[[1]]
which returns all the data for the first element of the list (site, location, doy,ltm etc). So clearly the number used within the square brackets is interpreted as an index for the list, as opposed to an identifier for the elements/station in the list.
Then I tried:
data$1
but it returned the error:
Error: unexpected numeric constant in "data$1"
Then I tried:
data[data$1==doy]
But was returned this:
Error: unexpected numeric constant in "data[data$1"
So at this point, I realise that it is not construing the number of the station as a category/factor within the list. It's just reading it as a number. So I thought I'd put some quotes around it to see if that changed what happened:
data[data$"1"=="doy"]
This returned
named list()
But when I looked at it in the environment, it was a list of 0.
I looked at some of the similar question here on Stack (like: accessing nested lists in R) and tried:
data[data$"1"=="doy",][[1]]
But just got:
Error in data[data$"1" == "doy", ] : incorrect number of dimensions
How can I access this data? It reminds me of a structure in Matlab, but it doesn't seem to be indexed in a similar fashion in R.
Let's look at some ways to do what you want:
data[[1]]
This returns the first element of the list, which is itself a list. You can use the $ subsetting shorthand, but the name of the first element is nonstandard. R prefers names that start with letters and include only alphanumeric characters, periods and underscores. You can escape this behavior with backticks:
data$`1`
If you want to access one of the elements of list 1 in your list of lists, you need to further subset. To get to doy, which is the third element of 1. You can do that four ways.
data[[1]][[3]]
data$`1`[[3]]
data[[1]]$doy
data$`1`$doy
One way (in addition to what Ben Norris has shown):
our_list[[c("1", "doy")]]
Reproducible example data (please provide next time)
our_list <- list(`1` = list(site = "x", doy = 3))

How can I make a list of data frames which have the same values in the first column?

Say I have multiple data frames, and I want to make a multiple lists of the data frames with the same first column. For example, dfs 1-4 have "abc" in all columns of the first row, dfs 5-7 have "def" in all columns of the first row, etc. How can I write a script which puts (in this case) dfs 1-4 in a list called "abc", dfs 5-7 in a list called "def"?
This is my first question, so please let me know if there is anything else I could provide. I researched for a few days with no luck :(
Thanks!
Jack
So this is a guide to the solution, as you asked.
First make sure you have your list of data frames called l (all(sapply(l, is.data.frame)) should be TRUE).
Then, for each element (df) of this list, you need to get the character (string) in the first row (in any column, for example the first one). This will give you a vector of characters and you can get it by using either sapply or purrr::map_chr.
After that, here comes the split you want to do. Use split for that with as first argument the vectors of indices (see ?seq_along) and as a second argument the vector of characters you've just computed before.
Finally, use lapply to transform this list of indices in a list of data frames (you need to know the [ accessor for a list).
If you need more guidance, don't hesitate to ask.

My data is stored as a matrix and as a list at the same time?

I am using the tabular() function to produce tables in r (tables library).
I want to compute CI's from the data in the output (let mytable be the output from tabular()). Simple enough I thought, except when I go to call a value from the matrix, I get the error Error in mytable[1, i] - 1 : non-numeric argument to binary operator. I thought this was odd, as when I call up a particular cell of the matrix (where as.matrix returned true for mytable), for example mytable[1, i] for some i, I get an interger. I then do the as.list for mytable and get true also, so I am not sure what this means. I guess the tabular() function stores the results as a special kind of matrix.
I am only trying to pull out the mean,sdev, and n, which I am able to just by typing the cell location, for example mytable[1, i] would return an 86. However, when I try to call up the value in qt(.975,df=(mytable[1,i]-1)) for example, I get the error above. Not sure really how to approach this except to manually enter the values into another matrix (which I would like to avoid). Or, if I can compute CI's directly in the tabular() function that would work also. Cheers.
I shall quote for you the Value section of the documentation on the function ?tabular:
An object of S3 class "tabular". This is a matrix of mode list, whose
entries are computed summary values, with the following attributes:
rowLabels - A matrix of labels for the rows. This will have the same
number of rows as the main matrix, but may have multiple columns for
different nested levels of labels. If a label covers multiple rows, it
is entered in the first row, and NA is used to fill following rows.
colLabels - Like rowLabels, but labelling the columns.
table - The original table expression being displayed. A list of the
original format specifications are attached as a "fmtlist" attribute.
formats - A matrix of the same shape as the main result, containing NA
for default formatting, or an index into the format list.
As the documentation says, each element of the matrix is a list. If your tabular object is called tab type tab[1,1] and you should see a list containing one of your table values. If I wanted to modify that value, I would probably do something like:
tab[1,1]$term <- value
just like you would modify values in any other list.
Type attributes(tab) and you'll see the items listed above, containing a lot of the formatting information and row/col headers.

Resources