What is the difference between dataset[,'column'] and dataset$column in R?

What is the difference between dataset[,'column'] and dataset$column in R? - r

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?

In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.

Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

Related

R anesrake issue with list names non-binary argument

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...

I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.

It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

tapply function complains that args are unequal length yet they appear to match

Here is the failing call, error messages and some displays to show the lengths in question:
it <- tapply(molten, c(molten$Activity, molten$Subject, molten$variable), mean)
# Error in tapply(molten, c(molten$Activity, molten$Subject, molten$variable), :
# arguments must have same length
length(molten$Activity)
# [1] 679734
length(molten$Subject)
# [1] 679734
length(molten$variable)
# [1] 679734
dim(molten)
# [1] 679734 4
str(molten)
# 'data.frame': 679734 obs. of 4 variables:
# $ Activity: Factor w/ 6 levels "WALKING","WALKING_UPSTAIRS",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ Subject : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ variable: Factor w/ 66 levels "tBodyAcc-mean()-X",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 0.257 0.286 0.275 0.27 0.275 ...

If you have a look at ?tapply you will see that X should be "an atomic object, typically a vector". You feed tapply with a data frame ("molten"), which is not an atomic object. See is.atomic, and try is.atomic(molten). Furthermore, your grouping variables should be provided as a list (see INDEX argument).
Something like this works:
tapply(X = warpbreaks$breaks, INDEX = list(warpbreaks$wool, warpbreaks$tension), mean)
# L M H
# A 44.55556 24.00000 24.55556
# B 28.22222 28.77778 18.77778

You need to have a single object for INDEX, butc( )will string them all together which is the source of the eror, so use a list:
it <- tapply(molten$value, list(Act=molten$Activity, sub=molten$Subject, var=molten$variable), mean)
Better would be:
it <- with(molten , tapply(value, list(Act=Activity, Sub=Subject, var=variable), mean) )

Ever got this solved? Because I had the same issue reading in a CSV file and could fix the issue by saving the original CSV file as CSV(delimiter seperated) instead of CSV(delimiter seperated-UTF-8). My dataset had German Umlauts in it though so that might play a role aswell.

Assign list of attributes() to sublist in R

I have a dataframe called 'situations' containing list of attributes.
> str(situations)
'data.frame': 24 obs. of 8 variables:
$ ID.SITUATION : Factor w/ 24 levels "cnf_01_be","cnf_02_ch",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ELICITATION.D : Factor w/ 2 levels "NATUREL","SEMI.DIRIGE": 1 1 1 1 1 1 1 1 2 2 ...
$ INTERLOCUTEUR.C : Factor w/ 3 levels "DIALOGUE","MONOLOGUE",..: 2 2 2 2 3 3 3 3 1 1 ...
$ PREPARATION.D : Factor w/ 3 levels "PREPARE","SEMI.PREPARE",..: 2 2 2 2 3 3 3 3 3 3 ...
$ INTERACTIVITE.D : Factor w/ 3 levels "INTERACTIF","NON. INTERACTIF",..: 2 2 2 2 1 1 1 1 3 3 ...
$ MEDIATISATION.D : Factor w/ 3 levels "MEDIATIQUE","NON.MEDIATIQUE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ PROFESSIONNALISATION.C: Factor w/ 1 level "PRO": 1 1 1 1 1 1 1 1 1 1 ...
$ ID.TASK : Factor w/ 5 levels "conference scientifique",..: 1 1 1 1 2 2 2 2 3 3 ...
I have as many observation in this dataframes (24) than i have sublist in a given corpus.
ID situation names (cnf_01_be) correspond to the name of the sublist (cnf_01_be).
I know how to assign individual attributes :
attributes(corpus$cnf_01_be) = situations[1,]
attributes(corpus$cnf_02_ch) = situations[2,]
And retrieve them for a specific purpose :
attr(corpus$cnf_01_be, "ELICITATION.D")
attr(corpus$cnf_02_ch, "ELICITATION.D")
attr(corpus$cnf_02_ch, "PREPARATION.D")
But how can I use for example lapply to assign automatically attributes to all the sublist in my corpus ?
I feel like all my trial are going in the wrong direction :
setattr <- function(x,y) {
attributes(x) <- situations[y,]
return(attributes)
}
...or...
lapply(corpus,setattr)
lapply(corpus, attributes(corpus) <- situations[c(1:length(situations[,1])),])
Thanks in advance!

The main problem with using lapply (and similar approaches) is that they cannot normally change the original object of interest, but rather return a new structure. so if you already have a list "corpus" and just want to change its members' attributes you can't usually do that inside a function.
One way to overcome this limitation is to use eval.parent() call instead of the usual assignment. This function evaluates the assignment expression in the parent environment (the environment that called the function), rather than to the local instances (copies) of the objects you assign. if you use this you don't have to return any value.
Another option would be to create a local copy of your corpus list within the function, add to it all the attributes, then return the whole structure from the function and use it to substitute the old list. if your list is big/complex this is probably not a wise choice
Here is a code that does it. note - this is an ugly code. I'm still looking to see if I can make it simpler, but because of the issues above, i'm not sure there is a much simpler option. Anyway, I hope the following will do the trick for you:
f = function(lname,data) {
snames = eval.parent(parse(text=paste("names(",lname,")")))
for (xn in snames) {
rd = data[match(xn,as.character(data$id)),]
if (nrow(rd)>0) {
tmp___ <<-rd[1,]
cmm = paste("attributes(",lname,"[[",xn,"]]) = tmp___")
eval.parent(parse(text=cmm))
}
}
}
Note that in order to use it you need to supply your list name (as a character string, and not as a variable), and your data frame. In your case the call would be:
f("corpus",situations)
I hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

What is the difference between dataset[,'column'] and dataset$column in R? - r

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways: > dataset[,'column'] > dataset$column It appears that both give me the same result. What is the difference?

Related

R anesrake issue with list names non-binary argument

How can I store a value in a name?

Grouping error with lmer

tapply function complains that args are unequal length yet they appear to match

Assign list of attributes() to sublist in R

Categories

Resources