R: Can't select a specific column in a data frame - r

I have a problem with a function to select a given column. I have a data frame called Volume from which I want to make a subset DateSearch:
DateSearch = subset(Volume,select=c("TRI",name))
For some reason it does not work. I have used browser(). I can select TRI or name but I can't select both (either with their name or indice). I have tried with and without "".
Does anyone know why is that?
Many thanks,
Vincent

I just did what (I think) you describe:
str(dfrm)
#'data.frame': 20 obs. of 8 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ factor1: Factor w/ 4 levels "Not at all","To a small extent",..: 3 2 3 NA 3 NA 3 NA 4 1 ...
## <snip>
name = "factor1"
subset(dfrm, select=c("ID", name))
No error, .... results as expected.
Examine the spelling carefully. My guess is that you have a space at the beginning or end of the result of the as.character result. Perhaps even a non-printing character? You can use nchar(name) to check.

Related

Value labels (levels) are lost when modifing a memisc:data.set in R

I use memisc:data.set because I import data from SPSS. I can get the value labels (in SPSS meaning) from a object when asking for levels(). I use that for the labels of the tick-marks in a plot.
When I modify the data.set (like in the exmpale below) levels() doesn't work anymore.
library('memisc')
# example dta
d <- data.set(a = sample(1:100))
d$a_strat <- cut(d$a, breaks=seq(1,100, by=10))
# "modify" the data.set
e <- d[,c('a_strat')]
# it is still a data.set but "a_strat" changed it's type
> class(e)
[1] "data.set"
attr(,"package")
[1] "memisc"
Now have a look at the different data types of a_strat in the two data.set.
> str(d$a_strat)
Factor w/ 9 levels "(1,11]","(11,21]",..: 4 9 3 1 NA 9 5 4 9 9 ...
> str(e$a_strat)
$ Nmnl. item w/ 9 labels for 1,2,3,... int 4 9 3 1 NA 9 5 4 9 9 ...
The practical issue is I can not do that on the second data.set.
> levels(e$a_strat)
NULL
But this works
> labels(e$a_strat)
Values and labels:
1 '(1,11]'
2 '(11,21]'
3 '(21,31]'
4 '(31,41]'
5 '(41,51]'
6 '(51,61]'
7 '(61,71]'
8 '(71,81]'
9 '(81,91]'
But when I use that for plotting in axis(..., labels=labels(e$_strat)) the value labels (e.g. (32,41]) doesn't appear. Instead of that the values (1, 2, ..., 9) appear on the tickmarks.
I am not sure how to solve that.
The little helper here is as.factor().
So it could look like this
axis(..., labels=labels(as.factor(e$_strat)))
But please don't rate that answer positive. ;) I still can't understand why the type of a_strat changes in my example.

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

Assign list of attributes() to sublist in R

I have a dataframe called 'situations' containing list of attributes.
> str(situations)
'data.frame': 24 obs. of 8 variables:
$ ID.SITUATION : Factor w/ 24 levels "cnf_01_be","cnf_02_ch",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ELICITATION.D : Factor w/ 2 levels "NATUREL","SEMI.DIRIGE": 1 1 1 1 1 1 1 1 2 2 ...
$ INTERLOCUTEUR.C : Factor w/ 3 levels "DIALOGUE","MONOLOGUE",..: 2 2 2 2 3 3 3 3 1 1 ...
$ PREPARATION.D : Factor w/ 3 levels "PREPARE","SEMI.PREPARE",..: 2 2 2 2 3 3 3 3 3 3 ...
$ INTERACTIVITE.D : Factor w/ 3 levels "INTERACTIF","NON. INTERACTIF",..: 2 2 2 2 1 1 1 1 3 3 ...
$ MEDIATISATION.D : Factor w/ 3 levels "MEDIATIQUE","NON.MEDIATIQUE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ PROFESSIONNALISATION.C: Factor w/ 1 level "PRO": 1 1 1 1 1 1 1 1 1 1 ...
$ ID.TASK : Factor w/ 5 levels "conference scientifique",..: 1 1 1 1 2 2 2 2 3 3 ...
I have as many observation in this dataframes (24) than i have sublist in a given corpus.
ID situation names (cnf_01_be) correspond to the name of the sublist (cnf_01_be).
I know how to assign individual attributes :
attributes(corpus$cnf_01_be) = situations[1,]
attributes(corpus$cnf_02_ch) = situations[2,]
And retrieve them for a specific purpose :
attr(corpus$cnf_01_be, "ELICITATION.D")
attr(corpus$cnf_02_ch, "ELICITATION.D")
attr(corpus$cnf_02_ch, "PREPARATION.D")
But how can I use for example lapply to assign automatically attributes to all the sublist in my corpus ?
I feel like all my trial are going in the wrong direction :
setattr <- function(x,y) {
attributes(x) <- situations[y,]
return(attributes)
}
...or...
lapply(corpus,setattr)
lapply(corpus, attributes(corpus) <- situations[c(1:length(situations[,1])),])
Thanks in advance!
The main problem with using lapply (and similar approaches) is that they cannot normally change the original object of interest, but rather return a new structure. so if you already have a list "corpus" and just want to change its members' attributes you can't usually do that inside a function.
One way to overcome this limitation is to use eval.parent() call instead of the usual assignment. This function evaluates the assignment expression in the parent environment (the environment that called the function), rather than to the local instances (copies) of the objects you assign. if you use this you don't have to return any value.
Another option would be to create a local copy of your corpus list within the function, add to it all the attributes, then return the whole structure from the function and use it to substitute the old list. if your list is big/complex this is probably not a wise choice
Here is a code that does it. note - this is an ugly code. I'm still looking to see if I can make it simpler, but because of the issues above, i'm not sure there is a much simpler option. Anyway, I hope the following will do the trick for you:
f = function(lname,data) {
snames = eval.parent(parse(text=paste("names(",lname,")")))
for (xn in snames) {
rd = data[match(xn,as.character(data$id)),]
if (nrow(rd)>0) {
tmp___ <<-rd[1,]
cmm = paste("attributes(",lname,"[[",xn,"]]) = tmp___")
eval.parent(parse(text=cmm))
}
}
}
Note that in order to use it you need to supply your list name (as a character string, and not as a variable), and your data frame. In your case the call would be:
f("corpus",situations)
I hope this helps.

How does sort() work in data.frame within R

How does sort work, that is using what method to sort column in
data.frame (barley$site, barley$year, barley$variety)
as following
library(lattice)
barley <- barley[order(barley$site, barley$year, barley$variety), ]
You probably want:
barley[order(as.character(barley$site), as.numeric(barley$year), as.character(barley$variety)),]
As you have it you are ordering by the underlying levels of the data.frame, which leads to really odd stuff. Look at the structure of the data frame:
'data.frame': 120 obs. of 4 variables:
$ yield : num 27 48.9 27.4 39.9 33 ...
$ variety: Factor w/ 10 levels "Svansota","No. 462",..: 3 3 3 3 3 3 7 7 7 7 ...
$ year : Factor w/ 2 levels "1932","1931": 2 2 2 2 2 2 2 2 2 2 ...
$ site : Factor w/ 6 levels "Grand Rapids",..: 3 6 4 5 1 2 3 6 4 5 ...
Notice how the levels for year are in the opposite order you would expect. The documentation for order discusses this very briefly:
For factors, this sorts on the internal codes, which is particularly appropriate for ordered factors.
I personally think this terribly confusing, but it is what it is. factor are very useful in most contexts, but incredibly dangerous in others if you're not careful. Having numbers represented as factors (as year was here) is particularly bad.
See ?factor for more details.
By default, sort doesn't know how to do anything with a data frame. You can sort the individual columns within a data frame, with something like df$x <- sort(df$x) but you almost certainly don't want to do that; it will just mess up your data.
You order the rows in the data frame by using order as in the example code you have there. This orders the rows by values in the column site, breaking ties with year, and then with variety.

loop over columns with semi like columnnames

I have the following variable and dataframe
welltypes <- c("LC","HC")
qccast <- data.frame(
LC_mean=1:10,
HC_mean=10:1,
BC_mean=rep(0,10)
)
Now I only want to see the welltypes I selected(in this case LC and HC, but it could also be different ones.)
for(i in 1:length(welltypes)){
qccast$welltypes[i]_mean
}
This does not work, I know.
But how do i loop over those columns?
And it has to happen variable wise, because welltypes is of an unkown size.
The second argument to $ needs to be a column name of the first argument. I haven't run the code, but I would expect welltypes[i]_mean to be a syntax error. $ is similar to [[, so you can use paste to create the column name string and subset via [[.
For example:
qccast[[paste(welltypes[i],"_mean",sep="")]]
Depending on the rest of your code, you may be able to do something like this instead.
for(i in paste(welltypes,"_mean",sep="")){
qccast[[i]]
}
Here's another strategy:
qccast[ sapply(welltypes, grep, names(qccast)) ]
LC_mean HC_mean
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Another easy way to access given welltypes
qccast[,paste(welltypes, '_mean', sep = "")]

Resources