Im new to R and have some trouble understanding how to handle local and global environments. I checked the Post on local and global variables, but couldn't figure it out.
If, for example, I would like to make several plots using a function and save them like this:
PlottingFunction <- function(type) {
type <<- mydata %>%
filter(typeVariable==type) %>%
qplot(a,b)
}
lapply(ListOfTypes, PlottingFunction)
Which didn't yield the desired result. I tried using the assign() function, but couldn't get it to work either.
I want to save the graphs in the global environment so I can combine them using gridExtra. This might not be the best way to do that, but I think it might be useful to understand this issue nevertheless.
You don't need to assign the plot to a gloabl variable. All plots can be saved in one list.
For this example, I use the iris data set.
library(gridExtra)
library(ggplot2)
library(dplyr)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The modified function without assignment:
PlottingFunction <- function(type) {
iris %>%
filter(Species == type) %>%
qplot(Sepal.Length, Sepal.Width, data = .)
}
One figure per Species is created
species <- unique(iris$Species)
# [1] setosa versicolor virginica
# Levels: setosa versicolor virginica
l <- lapply(species, PlottingFunction)
Now, the function do.call can be used to call grid.arrange with the plot objects in the list l.
do.call(grid.arrange, l)
Related
I´m trying to use levene Test from "car" library in R with the iris dataset.
The code I have is:
library(tidyverse)
library(car)
iris %>% group_by (Species) %>% leveneTest( Sepal.Length )
From there I´m getting the following error:
Error in leveneTest.default(., Sepal.Length) :
. is not a numeric variable
I don´t know how to fix this, since the data types seem to be of the rigth type:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
For levene test, you need to specify a grouping factor, for example:
leveneTest(Sepal.Length ~ Species,data=iris)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 6.3527 0.002259 **
147
This test whether the variances are homogenous across groups. It doesn't quite make sense for you to group them and do the leveneTest within the group. If you intend to do something else, you can elaborate more or comment.
try to do it this way
with(iris, leveneTest(Sepal.Length, Species))
maybe you are looking for such a solution
map(iris[, 1:4], ~ leveneTest(.x, iris$Species))
Are there any packages in R that can generate a random dataset given a pre-existing template dataset?
For example, let's say I have the iris dataset:
data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want some function random_df(iris) which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).
What is the easiest way to do this?
[Comment from question author moved here. --Editor's note]
I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed
How about this for a start:
Define a function that simulates data from df by
drawing samples from a normal distribution for numeric columns in df, with the same mean and sd as in the original data column, and
uniformly drawing samples from the levels of factor columns.
generate_data <- function(df, nrow = 10) {
as.data.frame(lapply(df, function(x) {
if (class(x) == "numeric") {
rnorm(nrow, mean = mean(x), sd = sd(x))
} else if (class(x) == "factor") {
sample(levels(x), nrow, replace = T)
}
}))
}
Then for example, if we take iris, we get
set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame': 10 obs. of 5 variables:
# $ Sepal.Length: num 6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num 2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num 4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num 0.487 1.68 1.779 0.809 1.963 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3
It should be fairly straightfoward to extend the generate_data function to account for other column types.
I want to remove parts from a list to reduce the list to the elements of it that have a certain number of columns.
This a dummy example of what I'm trying to do:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe and not matrix
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables <- tables[-k]
}
}
another option I tried:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables[[-k]] <- NULL
}
}
I'm getting
Error in tables[[k]] : subscript out of bounds.
Is there an alternative and correct approach?
We can use Filter
Filter(function(x) ncol(x)==5, tables)
Or with sapply to create a logical index and subset the list
tables[sapply(tables, ncol)==5]
Or as #Sotos commented
tables[lengths(tables)==5]
lengths return the length of each list element convert it a logical vector and subset the list. The length of a data.frame is the number of columns it has
For a tidyverse option you can use purrr:keep for this. You just define a predicate function, if true it keeps the list element, if false it removes it. Here I've done that with the formula option.
library(purrr)
tables <- list(mtcars, iris)
result <- purrr::keep(tables, ~ ncol(.x) == 5)
str(result)
#> List of 1
#> $ :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'
I am a newbie to R. After I ran a linear regression with categorical variable "sale year"
ols <- lm(logprice = x + factor(city) + factor(sale_year))
I would like to create a new variable, which tells me for each observation the regression coefficient of factor(sale_year) on the sale_year of that observation.
sale_year new variable
1980 coef(ols)["factor(sale_year)1980"]
1973 coef(ols)["factor(sale_year)1973"]
1990 coef(ols)["factor(sale_year)1990"]
1990 coef(ols)["factor(sale_year)1990"]
1973 coef(ols)["factor(sale_year)1973"]
...
If there is no other factor variables, then I can simply set all variables to zero except for sale_year, and use predict.lm to get the coefficients. But given multiple factor variables, it's messier, and I cannot get it right in R.
In Stata, I can do this:
xi: reg logprice x i.city i.sale_year
gen newvar = .
levelsof sale_year, local(saleyr)
foreach lv of local saleyr {
replace newvar = _b[_Isaleyr`lv'] if sale_year == `lv'
}
How can I do this in R? Thanks!
Since you didn't supply the sample data, I will use iris data from R:
mydata<-iris
mydata$Petal.Width<-as.factor(mydata$Petal.Width)
str(mydata)
str(mydata)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : Factor w/ 22 levels "0.1","0.2","0.3",..: 2 2 2 2 2 4 3 2 2 1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
myreg<-lm(Sepal.Length~Sepal.Width+Petal.Width+Species,data=mydata)
k<-length(levels(mydata$Petal.Width))
mycoef<-coef(myreg)[3:(k+1)]
mycoef<-data.frame(mycoef)
> head((mycoef)
mycoef
Petal.Width0.2 0.13981323
Petal.Width0.3 0.17193663
Petal.Width0.4 0.20220902
Petal.Width0.5 0.31915175
Petal.Width0.6 0.08864592
mycoef$var<-rownames(mycoef)
rownames(mycoef)<-1:dim(mycoef)[1]
mycoef[,c("var","mycoef")]
mycoef[,c("var","mycoef")]
var mycoef
1 Petal.Width0.2 0.13981323
2 Petal.Width0.3 0.17193663
3 Petal.Width0.4 0.20220902
4 Petal.Width0.5 0.31915175
Update:
mycoef$var1<-substring(mycoef$var,12,15)
myout<-merge(mydata1,mycoeff,by.x="Petal.Width",by.y="var1")
> head(myout)
Petal.Width Sepal.Length Sepal.Width Petal.Length Species var mycoef
1 0.2 4.9 3.0 1.4 setosa Petal.Width0.2 0.1398132
2 0.2 4.7 3.2 1.3 setosa Petal.Width0.2 0.1398132
3 0.2 4.6 3.1 1.5 setosa Petal.Width0.2 0.1398132
4 0.2 5.0 3.6 1.4 setosa Petal.Width0.2 0.1398132
5 0.2 5.1 3.5 1.4 setosa Petal.Width0.2 0.1398132
6 0.2 5.4 3.7 1.5 setosa Petal.Width0.2 0.1398132
You will still need to use predict.lm to get the baseline value for the first level of the factor, since there will be no coefficient for that level (or rather it will be 0). All the other coefficients are really offsets to that value (assuming that the result of predict is what you are expecting), so something like:
faclev1 <- predict(old, list(x=mean(x), city=levels(city)[1], sale_year =levels(sale_year)[1])
otherlevs <- faclev1 + coef(ols)[grep("sale_year", names(coef(ols) ) )]
For a vector of coefficients matching individual cases:
fac_coef <- c(0, coef(ols)[grep("sale_year", names(coef(ols) ) )]
fac_coef[ as.numeric(sale_year) ]
This works because the order of levels is the same as the order in which the coefficients get displayed and the numeric value is what determines how levels ordinarily get displayed.