How can I store a value in a name? - r

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...

I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

Related

Recursive partitioning for factors/characters problem

Currently I am working with the dataset predictions. In this data I have converted clear character type variables into factors because I think factors work better than characters for glmtree() code (tell me if I am wrong with this):
> str(predictions)
'data.frame': 43804 obs. of 14 variables:
$ month : Factor w/ 7 levels "01","02","03",..: 6 6 6 6 1 1 2 2 3 3 ...
$ pred : num 0.21 0.269 0.806 0.945 0.954 ...
$ treatment : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ type : Factor w/ 4 levels "S","MS","ML",..: 1 1 4 4 4 4 4 4 4 4 ...
$ i_mode : Factor w/ 143 levels "AAA","ABC","CBB",..: 28 28 104 104 104 104 104 104 104 104 ...
$ r_mode : Factor w/ 29 levels "0","5","8","11",..: 4 4 2 2 2 2 2 2 2 2 ...
$ in_mode: Factor w/ 22 levels "XY",..: 11 11 6 6 6 6 6 6 6 6 ...
$ v_mode : Factor w/ 5 levels "1","3","4","7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ di : num 1157 1157 1945 1945 1945 ...
$ cont : Factor w/ 5 levels "AN","BE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hk : num 0.512 0.512 0.977 0.977 0.941 ...
$ np : num 2 2 2 2 2 2 2 2 2 2 ...
$ hd : num 1 1 0.408 0.408 0.504 ...
$ nd : num 1 1 9 9 9 9 7 7 9 9 ...
I want to estimate a recursive partitioning model of this kind:
library("partykit")
glmtr <- glmtree(formula = pred ~ treatment + 1 | (month+type+i_mode+r_mode+in_mode+v_mode+di+cont+np+nd+hd+hk),
data = predictions,
maxdepth=6,
family = quasibinomial)
My data does not have any NA. However, the following error arises (even after changing characters by factors):
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
Any clue?
Thank you
You are right that glmtree() and the underlying mob() function expect the split variables to be factors in case of nominal information. However, computationally this is only feasible for factors that have either a limited number of levels because the algorithm will try all possible partitions of the number of levels into two groups. Thus, for your i_mode factor this necessitates going through nl levels and mi splits into two groups with:
nl <- 143
mi <- 2^(nl - 1L) - 1L
mi
## [1] 5.575186e+42
Internally, mob() tries to create a matrix for storing all log-likelihoods associated with the corresponding partitioned models. And this is not possible because such a matrix cannot be represented. (And even if you could, then you wouldn't finish fitting all the associated models.) Admittedly, the error message is not very useful and should be improved. We will look into that for the next revision of the package.
For solving the problem, I would recommend to turn the variables i_mode, r_mode, and in_mode into variables that are more suitable for binary splitting with exhaustive search. Maybe, some of the variables are actually ordinal? If so, I would recommend to turn them into ordinal factors or in case of i_mode even into a numeric variable because the number of levels is large enough. Alternatively, you can maybe create several factors with different properties about the different levels that could then be used for partitioning.

R equivalent to SAS DO loop

ID Type Sales Date
1 1 $ 5,027 18-Jan-2016
2 1 $ 2,646 10-Nov-2012
3 1 $ 7,549 11-Feb-2018
4 2 $ 4,536 18-Feb-2016
5 2 $ 3,118 26-Aug-2017
6 3 $ 9,815 07-Jun-2017
7 3 $ 885 15-Dec-2017
8 3 $ 2,911 10-Nov-2017
9 3 $ 1,823 12-Oct-2015
10 4 $ 5,723 04-Jul-2014
11 5 $ 2,612 31-Mar-2015
12 5 $ 3,344 06-Jan-2016
13 5 $ 4,215 22-May-2016
14 6 $ 5,500 23-Mar-2018
To split the above dataset (Main) into Type wise, we may use the following macro. How to do the same in R.
Thanks in advance.
%MACRO split;
%DO m = 1 %TO 6 ;
DATA type_%eval(&m) ;
SET main ;
IF Type = &m then output type_%eval(&m) ;
RUN ;
%END ;
%MEND split ;
%split ;
ID Type Sales Date
1 1 $ 5,027 18-Jan-2016
2 1 $ 2,646 10-Nov-2012
3 1 $ 7,549 11-Feb-2018
ID Type Sales Date
4 2 $ 4,536 18-Feb-2016
5 2 $ 3,118 26-Aug-2017
ID Type Sales Date
6 3 $ 9,815 07-Jun-2017
7 3 $ 885 15-Dec-2017
8 3 $ 2,911 10-Nov-2017
9 3 $ 1,823 12-Oct-2015
this will give me following datasets Type1, Type2, Type3 ..... Type6
You can use split. If your data frame is called df, do
df.list <- split(df, df$Type)
This gives you a list of data frames. You can get individual data frames by using $ and the value of Type (as below). Since these names don't follow the convention of not starting with a number, you have to put ticks or quotes around them
df.list$'1'
You can also use the list indexing e.g. df.list[3] for the third data.frame. In your example, by coincidence, these align at times e.g. df.list$'1' is the same as df.list[1].

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources