Character variables default to factor in R [duplicate] - r

I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal.
Any suggestions?
I'm running Windows 10 Pro and create .csv files in Excel 2013.

As Ronak Shah said in a comment to your question, R 4.0.0 changed the default behavior in how read.table() (and so its wrappers including read.csv()) treats character vectors. There has been a long debate over that issue, but basically stringsAsFactors == T setting was a default since the inception of R because it helped to save memory due to the way factor variables are implemented in R (essentially they are an integer vector with factor level information added on top). There is less of a reason do that nowadays since the memory is much more abundant and this option often produced unintended side effects.
You can read more about your particular issue and also other peculiarities of vectors in R in Chapter 3 of Advanced R by Hadley Wickham. In there he gives two articles that go into great detail on why default behavior was the way it was.
Here is one and here is another. I would also suggest that you check out Hadley's book if you already have some experience with R, it helped me very much to learn some of the less obvious features of the language.

As everyone here said - the default behaviour have changed in R 4.0.0 and strings aren't automatically converted to factors anymore. This affects various functions, including read.csv() and data.frame(). However some functions, that are explicitly made to work with factors, are not affected. These include expand.grid() and as.data.frame.table().
One way you can bypass this change is by setting a global option:
options(stringsAsFactors = TRUE)
But this will also be deprecated and eventually you will have to convert strings to factors manually.
The main reason for such a decision seems to be reproducibility. Automatic string to factor conversion produces factor levels and those levels can depend on the locale used by the system. Hence if you are from Russia and share your script with automatically converted factors with your friend in Japan he might end up with different order of factor levels.
You can read more about this on "The R Blog" stringsAsFactors post by Kurt Hornik

Related

R stopped converting chr to factor variables [duplicate]

I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal.
Any suggestions?
I'm running Windows 10 Pro and create .csv files in Excel 2013.
As Ronak Shah said in a comment to your question, R 4.0.0 changed the default behavior in how read.table() (and so its wrappers including read.csv()) treats character vectors. There has been a long debate over that issue, but basically stringsAsFactors == T setting was a default since the inception of R because it helped to save memory due to the way factor variables are implemented in R (essentially they are an integer vector with factor level information added on top). There is less of a reason do that nowadays since the memory is much more abundant and this option often produced unintended side effects.
You can read more about your particular issue and also other peculiarities of vectors in R in Chapter 3 of Advanced R by Hadley Wickham. In there he gives two articles that go into great detail on why default behavior was the way it was.
Here is one and here is another. I would also suggest that you check out Hadley's book if you already have some experience with R, it helped me very much to learn some of the less obvious features of the language.
As everyone here said - the default behaviour have changed in R 4.0.0 and strings aren't automatically converted to factors anymore. This affects various functions, including read.csv() and data.frame(). However some functions, that are explicitly made to work with factors, are not affected. These include expand.grid() and as.data.frame.table().
One way you can bypass this change is by setting a global option:
options(stringsAsFactors = TRUE)
But this will also be deprecated and eventually you will have to convert strings to factors manually.
The main reason for such a decision seems to be reproducibility. Automatic string to factor conversion produces factor levels and those levels can depend on the locale used by the system. Hence if you are from Russia and share your script with automatically converted factors with your friend in Japan he might end up with different order of factor levels.
You can read more about this on "The R Blog" stringsAsFactors post by Kurt Hornik

R version 4.0.1 does not analyse my data properly [duplicate]

I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal.
Any suggestions?
I'm running Windows 10 Pro and create .csv files in Excel 2013.
As Ronak Shah said in a comment to your question, R 4.0.0 changed the default behavior in how read.table() (and so its wrappers including read.csv()) treats character vectors. There has been a long debate over that issue, but basically stringsAsFactors == T setting was a default since the inception of R because it helped to save memory due to the way factor variables are implemented in R (essentially they are an integer vector with factor level information added on top). There is less of a reason do that nowadays since the memory is much more abundant and this option often produced unintended side effects.
You can read more about your particular issue and also other peculiarities of vectors in R in Chapter 3 of Advanced R by Hadley Wickham. In there he gives two articles that go into great detail on why default behavior was the way it was.
Here is one and here is another. I would also suggest that you check out Hadley's book if you already have some experience with R, it helped me very much to learn some of the less obvious features of the language.
As everyone here said - the default behaviour have changed in R 4.0.0 and strings aren't automatically converted to factors anymore. This affects various functions, including read.csv() and data.frame(). However some functions, that are explicitly made to work with factors, are not affected. These include expand.grid() and as.data.frame.table().
One way you can bypass this change is by setting a global option:
options(stringsAsFactors = TRUE)
But this will also be deprecated and eventually you will have to convert strings to factors manually.
The main reason for such a decision seems to be reproducibility. Automatic string to factor conversion produces factor levels and those levels can depend on the locale used by the system. Hence if you are from Russia and share your script with automatically converted factors with your friend in Japan he might end up with different order of factor levels.
You can read more about this on "The R Blog" stringsAsFactors post by Kurt Hornik

What is the practical difference between read_csv and read.csv? When should one be used over another?

I often work with comma separated values, and was curious to the differences between read_csv() and read.csv().
Are there any practical differences that could shine light on the situational usage of both?
Quoted from the introduction page.
11.2.1 Compared to base R
If you’ve used R before, you might wonder why we’re not using read.csv(). There are a few good reasons to favour readr functions over the base equivalents:
They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
They produce tibbles, they don’t convert character vectors to factors*, use row names, or munge the column names. These are common sources of frustration with the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.
*Note that from R 4.0.0
R [...] uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table().
read_csv() reads comma delimited numbers. It reads 1,000 as 1000.
original numbers
read by read_csv
read by read.csv

XTS size limitation

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

What is your preferred style for naming variables in R? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Which conventions for naming variables and functions do you favor in R code?
As far as I can tell, there are several different conventions, all of which coexist in cacophonous harmony:
1. Use of period separator, e.g.
stock.prices <- c(12.01, 10.12)
col.names <- c('symbol','price')
Pros: Has historical precedence in the R community, prevalent throughout the R core, and recommended by Google's R Style Guide.
Cons: Rife with object-oriented connotations, and confusing to R newbies
2. Use of underscores
stock_prices <- c(12.01, 10.12)
col_names <- c('symbol','price')
Pros: A common convention in many programming langs; favored by Hadley Wickham's Style Guide, and used in ggplot2 and plyr packages.
Cons: Not historically used by R programmers; is annoyingly mapped to '<-' operator in Emacs-Speaks-Statistics (alterable with 'ess-toggle-underscore').
3. Use of mixed capitalization (camelCase)
stockPrices <- c(12.01, 10.12)
colNames <- c('symbol','price')
Pros: Appears to have wide adoption in several language communities.
Cons: Has recent precedent, but not historically used (in either R base or its documentation).
Finally, as if it weren't confusing enough, I ought to point out that the Google Style Guide argues for dot notation for variables, but mixed capitalization for functions.
The lack of consistent style across R packages is problematic on several levels. From a developer standpoint, it makes maintaining and extending other's code difficult (esp. where its style is inconsistent with your own). From a R user standpoint, the inconsistent syntax steepens R's learning curve, by multiplying the ways a concept might be expressed (e.g. is that date casting function asDate(), as.date(), or as_date()? No, it's as.Date()).
Good previous answers so just a little to add here:
underscores are really annoying for ESS users; given that ESS is pretty widely used you won't see many underscores in code authored by ESS users (and that set includes a bunch of R Core as well as CRAN authors, excptions like Hadley notwithstanding);
dots are evil too because they can get mixed up in simple method dispatch; I believe I once read comments to this effect on one of the R list: dots are a historical artifact and no longer encouraged;
so we have a clear winner still standing in the last round: camelCase. I am also not sure if I really agree with the assertion of 'lacking precendent in the R community'.
And yes: pragmatism and consistency trump dogma. So whatever works and is used by colleagues and co-authors. After all, we still have white-space and braces to argue about :)
I did a survey of what naming conventions that are actually used on CRAN that got accepted to the R Journal :) Here is a graph summarizing the results:
Turns out (no surprises perhaps) that lowerCamelCase was most often used for function names and period.separated names most often used for parameters. To use UpperCamelCase, as advocated by Google's R style guide is really rare however, and it is a bit strange that they advocate using that naming convention.
The full paper is here:
http://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf
Underscores all the way! Contrary to popular opinion, there are a number of functions in base R that use underscores. Run grep("^[^\\.]*$", apropos("_"), value = T) to see them all.
I use the official Hadley style of coding ;)
I like camelCase when the camel actually provides something meaningful -- like the datatype.
dfProfitLoss, where df = dataframe
or
vdfMergedFiles(), where the function takes in a vector and spits out a dataframe
While I think _ really adds to the readability, there just seems to be too many issues with using .-_ or other characters in names. Especially if you work across several languages.
This comes down to personal preference, but I follow the google style guide because it's consistent with the style of the core team. I have yet to see an underscore in a variable in base R.
As I point out here:
How does the verbosity of identifiers affect the performance of a programmer?
it's worth bearing in mind how understandable your variable names are to your co-workers/users if they are non-native speakers...
For that reason I'd say underscores and periods are better than capitalisation, but as you point out consistency is essential within your script.
As others have mentioned, underscores will screw up a lot of folks. No, it's not verboten but it isn't particularly common either.
Using dots as a separator gets a little hairy with S3 classes and the like.
In my experience, it seems like a lot of the high muckity mucks of R prefer the use of camelCase, with some dot usage and a smattering of underscores.
I have a preference for mixedCapitals.
But I often use periods to indicate what the variable type is:
mixedCapitals.mat is a matrix.
mixedCapitals.lm is a linear model.
mixedCapitals.lst is a list object.
and so on.
Usually I rename my variables using a ix of underscores and a mixed capitalization (camelCase). Simple variables are naming using underscores, example:
PSOE_votes -> number of votes for the PSOE (political group of Spain).
PSOE_states -> Categorical, indicates the state where PSOE wins {Aragon, Andalucia, ...)
PSOE_political_force -> Categorial, indicates the position between political groups of PSOE {first, second, third)
PSOE_07 -> Union of PSOE_votes + PSOE_states + PSOE_political_force at 2007 (header -> votes, states, position)
If my variable is a result of to applied function in one/two Variables I using a mixed capitalization.
Example:
positionXstates <- xtabs(~states+position, PSOE_07)

Resources