R equivalent to SAS DO loop - r

ID Type Sales Date
1 1 $ 5,027 18-Jan-2016
2 1 $ 2,646 10-Nov-2012
3 1 $ 7,549 11-Feb-2018
4 2 $ 4,536 18-Feb-2016
5 2 $ 3,118 26-Aug-2017
6 3 $ 9,815 07-Jun-2017
7 3 $ 885 15-Dec-2017
8 3 $ 2,911 10-Nov-2017
9 3 $ 1,823 12-Oct-2015
10 4 $ 5,723 04-Jul-2014
11 5 $ 2,612 31-Mar-2015
12 5 $ 3,344 06-Jan-2016
13 5 $ 4,215 22-May-2016
14 6 $ 5,500 23-Mar-2018
To split the above dataset (Main) into Type wise, we may use the following macro. How to do the same in R.
Thanks in advance.
%MACRO split;
%DO m = 1 %TO 6 ;
DATA type_%eval(&m) ;
SET main ;
IF Type = &m then output type_%eval(&m) ;
RUN ;
%END ;
%MEND split ;
%split ;
ID Type Sales Date
1 1 $ 5,027 18-Jan-2016
2 1 $ 2,646 10-Nov-2012
3 1 $ 7,549 11-Feb-2018
ID Type Sales Date
4 2 $ 4,536 18-Feb-2016
5 2 $ 3,118 26-Aug-2017
ID Type Sales Date
6 3 $ 9,815 07-Jun-2017
7 3 $ 885 15-Dec-2017
8 3 $ 2,911 10-Nov-2017
9 3 $ 1,823 12-Oct-2015
this will give me following datasets Type1, Type2, Type3 ..... Type6

You can use split. If your data frame is called df, do
df.list <- split(df, df$Type)
This gives you a list of data frames. You can get individual data frames by using $ and the value of Type (as below). Since these names don't follow the convention of not starting with a number, you have to put ticks or quotes around them
df.list$'1'
You can also use the list indexing e.g. df.list[3] for the third data.frame. In your example, by coincidence, these align at times e.g. df.list$'1' is the same as df.list[1].

Related

R anesrake issue with list names non-binary argument

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

How to automatically convert a numeric column to categorical data using statistical techniques

>data
ACC_ID REG PRBLT OPP_TYPE_DESC PARENT_ID ACCT_NM INDUSTRY_ID BUY PWR REV QTY
11316456 No 90 A 2122628569 INF 7379 10190.82 6500 1
11456476 Yes 1 I 2385888136 Module 9199 17441.72 466.5 31
13453245 No 10 D 2122628087 Wooden 3559 44279.21 2500 500
15674568 No 1 I 2702074521 Nine 7379 183218.8 25.91 1
Above is the given dataset
When I load the same in R, I have the following structure
>str(data)
$ ACC_ID : int 11316974 11620677 11865091 ...
$ REG : Factor w/ 2 levels "No ","Yes ": 1 2 1 1 1 1 1 1 1 1 ...
$ PRBLT : int 90 1 10 1 30 30 10 1 60 1 ...
$ OPP_TYPE_DESC : Factor w/ 3 levels "D",..: 3 2 1 2 1 1 1 3 3 2 ...
$ PARENT_ID : num 2.12e+09 2.39e+09 2.12e+09 2.70e+09 2.12e+09 ...
$ ACCT_NM : Factor w/ 20 levels "Marketing Vertical",..: 10 15 20 17 8 16 2 14 7 11 ...
$ INDUSTRY_ID : int 7379 9199 3559 7379 2711 7374 7371 8742 4813 2111 ..
$ BUY PWR : num 1014791 17442 ...
$ REV : num 6500 46617 250000 25564 20000 ...
$ QTY : int 1 31 500 1 6 100 ...
But, I would want to somehow automatically want R to output the below fields as factors instead of int (with the help of statistical modelling or any other technique). Ideally, these are not continuous fields but categorical nominal fields
ACC_ID
PARENT_ID
INDUSTRY_ID
Whereas the REV and QTY columns should be left as is.
Also, the analysis should not be specific to the data and the columns shown here. The logic must be applicable to any data-set (with different columns) that we load in R
Can there be any method through which this is possible? Any ideas are welcome.
Thank you

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

Reading a csv file using ffdf and subsetting it successfully

I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.
The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:
'data.frame': 5 obs. of 1005 variables:
$ SAMPLING_EVENT_ID : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
$ LATITUDE : num 24.4 24.5 24.5 24.5 24.5
$ LONGITUDE : num -81.9 -81.9 -82 -82 -82
$ YEAR : int 2010 2010 2010 2010 2010
$ MONTH : int 4 3 10 10 10
$ DAY : int 97 88 299 298 300
$ TIME : num 9 10 10 11.58 9.58
$ COUNTRY : Factor w/ 1 level "United_States": 1 1 1 1 1
$ STATE_PROVINCE : Factor w/ 1 level "Florida": 1 1 1 1 1
$ COUNT_TYPE : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
$ EFFORT_HRS : num 6 2 7 6 3.5
$ EFFORT_DISTANCE_KM : num 48.28 8.05 0 0 0
$ EFFORT_AREA_HA : int 0 0 0 0 0
$ OBSERVER_ID : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
$ NUMBER_OBSERVERS : Factor w/ 2 levels "?","1": 2 1 2 2 2
$ Zenaida_macroura : int 0 0 1 0 0
All other variables being similar to this last one i.e. various species of bird.
Here is the code I used to “successfully: read the csv:
B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)
Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.
I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.
The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.
Thanks,
Joe
To subset an ffdf, use the ffbase package.
As in
require(ffbase)
x <- subset(B2010, BB2010$STATE_PROVINCE == “California”)
I finally found the solution to getting the ffdf variable names and types properly read and accessible for subsetting:
B2010 <- read.csv.ffdf (file = "filepath/name", colClasses = c("factor", "numeric", "numeric", "integer", "integer", "integer", "numeric", rep("factor",998)), first.rows = 10000, next.rows = 50000, nrows = -1)
This took forever to read but seemed to have worked i.e. I was able to create a subset of the data. Next step: to save the subset back to a "normal" dataframe and/or to a csv.
According to the help page at ?read.table.ffdf, you should be using read.csv.ffdf(...). Then go to the page cited by Brandon.

Resources