I am exploring the iris data set in R and I would like some clarification on the following two codes:
cluster_iris<-kmeans(iris[,1:4], centers=3)
iris$ClusterM <- as.factor(cluster_iris$cluster)
I think the first one is performing a k-means cluster analysis using all the cases of the data file and only the first 4 columns with a choice of 3 clusters.
However I'm not sure what the second piece of code is doing? Is the first one just stating the preferences for the analysis and the second one actually executing it (i.e. performing the k-means)?
Any help is appreciated
The first line does the cluster analysis, and stores the cluster labels in a component called cluster_iris$cluster which is just a vector of numbers.
The second line puts that cluster number as a categorical label onto the rows of the original data set. So now your iris data has all the petal and sepal stuff and a cluster index in a column called "ClusterM".
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ClusterM
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 3
3 4.7 3.2 1.3 0.2 setosa 3
4 4.6 3.1 1.5 0.2 setosa 3
Related
Is there a way to extract the mean and p-value from a t.test output and create a table that includes the features, mean, and p-value? Say there are 10 columns put through t.test, and that means there are 10 means, and 10 p-values. How would I be able to create a table which only shows those specific items?
here is an example: data (iris):
a. b. c. d. e.
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
t.test(a)
t.test(b) #...ect we obtain the mean and p-value.
this is the output im looking for:
feature mean p-val
col1 0.01 0.95
col2 0.01 0.95
.
.
.
coln
hope it makes sense!
Using the iris built in data set as an example
t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
The sapply() function is iteratively performing that custom function - which performs a t-test on a variable and returns the estimate and p-value - through columns 1 to 4 of iris. That is then transposed by t() to rotate the data to the desired shape. You can store that as a data.frame using data.frame() and use row.names() to get the variable names into a new column on that if you like.
values <- t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
values <- data.frame("feature"=row.names(values), values)
row.names(values) <- NULL
values
Beware multiple testing though...
I would like to use the ntile function from dplyr or a similar function on a list of data frames but using a different n for each data frame. My list contains 150 data frames so a manual solution like the one below will not work. How can I rewrite the code below to act on the list of data frames and return the list of data frames with the new column?
library(tidyverse)
iris_list=split(iris,iris$Species)
iris_setosa=iris_list[[1]]
iris_versicolor=iris_list[[2]]
iris_virginica=iris_list[[3]]
iris_setosa$n3=ntile(iris_setosa$Sepal.Length,3)
iris_versicolor$n5=ntile(iris_setosa$Sepal.Length,5)
iris_virginica$n7=ntile(iris_setosa$Sepal.Length,7)
The final result should be this
final_list=list(iris_setosa,iris_versicolor,iris_virginica)
head(final_list[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n3
1 5.1 3.5 1.4 0.2 setosa 2
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 3
There are several ways to achieve this, depending on what type of object you want in the end.
One way would be to use base::expand.grid and purrr::pmap like this:
percentiles = list(3,5,7)
iris_list %>%
map("Sepal.Length") %>%
expand.grid(percentiles) %>%
pmap(~ntile(..1,..2))
First, you want only the Sepal.Length variable of all your datasets, so you use purrr::map to get them.
Then, expand.grid creates a dataframe of all combinations of its parameters. Here, with 2 lists of 3 members, it would return a dataframe of 3x3=9 rows: setosa 3, versicolor 3, virginica 3, setosa 5, ...
Finally, pmap can iterate over the dataframe and apply the function ntile, with the first column (iris_list) as the first argument and the second column (percentiles) as the second argument. Unfortunately, purrr is very bad in dealing with names, but it seems that it is on purpose.
EDIT:
Your edit is somehow another question, so here is another answer:
iris_list %>%
map(~mutate(.x, n3=ntile(Sepal.Length,3)),
n5=ntile(Sepal.Length,5)), n7=ntile(Sepal.Length,7)))
I've found a way that works
n_size=data.frame(Species=c("setosa ","versicolor","virginica"),size=c(3,5,7))
iris_bin=iris %>% inner_join(n_size,by="Species") %>%
group_by(Species)%>%
mutate(bin=ntile(Sepal.Length,size[1])) %>%
arrange(Species,Sepal.Length,bin)
I have a big dataset with +100 observation and 68 variables.
I was wondering whether there might be a way to generate plots and histograms for all those variables at once without having to write down the code for a boxplot/histogram one by one, and save them in a folder as pns or in a pdf.
possibly I'd like to have more than one plot on the same page (i know you can do that using "par")
I know is probably a simple piece of coding but it would be really helpful for me.
Thank you
Ok I think an example could be the data from the iris dataset:
"Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa"
But instead of having just "Sepal.Length Sepal.Width Petal.Length Petal.Width " as observed variables, I have 68 of them.
My interest is to check normality distribution for the sample on all my 68 variables and boxplot . I know how to create boxplots and histogram variable per variable, but that would take a lot of time and I imagine there must be a way to do it at once, probably using a loop or a %>% ?
Take a look at the DataExplorer, skimr and inspectdf packages. They all produce summaries like the one you want. These articles give an overview:
https://www.littlemissdata.com/blog/simple-eda
https://www.littlemissdata.com/blog/inspectdf
I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.
I have a dataset with many missing values. Some of the missing values are NAs, some are Nulls, and others have varying lengths of blank spaces. I would like to utilize the fread function in R to be able to read all these values as missing.
Here is an example:
#Find fake data
iris <- data.table(iris)[1:5]
#Add missing values non-uniformly
iris[1,Species:=' ']
iris[2,Species:=' ']
iris[3,Species:='NULL']
#Write to csv and read back in using fread
write.csv(iris,file="iris.csv")
fread("iris.csv",na.strings=c("NULL"," "))
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2
2: 2 4.9 3.0 1.4 0.2 NA
3: 3 4.7 3.2 1.3 0.2 NA
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
From the above example, we see that I am unable to account for the first missing value since there are many blank spaces. Any one know of a way to account for this?
Thanks so much for the wonderful answer from #eddi.
fread("sed 's/ *//g' iris.csv",na.strings=c("",NA,"NULL"))