Methods of using mean function to parts of data.frame - r

Hello,
I am learning with R and at this moment I use "iris" data which is default part of R. In iris data, I want to apply "mean" function to part of data frame.
My question does not concern anything complicated and
it's because I am still quite new to R.
The data I am using are:
library(datasets)
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
What I want to do with it is apply mean function only e.g. to "setosa" in Species
and I want to calculate it just for "Sepal.Lenght".
The way I did it is (I actually apply it to "virginica" Species here,
but I mean it just as example):
virgin<- iris[101:150,1]
virgin
and then
mean(virgin)
It gives me the correct mean but I think this method is kind of simple and is
probably not suited when you don't want to search through data.frame manually
So my questions is how to do the same via other functions like
apply or others I do not know about.
You can also suggest some sources from where could I read more about it.
It can be this page as well (I found only more advanced questions though).
If you want of course.
Thank you.

Your question is really about how to subset a data frame.
Here is one way:
mean(iris$Sepal.Length[iris$Species=="virginica"])
[1] 6.588
You can rewrite this with less duplication by using the function with():
mean(with(iris, Sepal.Length[Species=="virginica"]))
[1] 6.588
And another way:
mean(with(iris, iris[Species=="virginica", "Sepal.Length"]))
[1] 6.588

Related

Creating a dataframe in R that is a subset of a number of other columns

I have a data frame with 854 observations and 47 variables (India_Summary). I want to create another data frame that contains only some columns from the 47 variables, named 'MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears'.
I thought I could simply use this (assuming I am just naming the new df 'India_Summary2'):
India_Summary2 <- India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
The error I receive is:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds.
I tried using an equal sign instead:
India_Summary2 = India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
and I receive the below error:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
display list redraw incomplete
2: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
3: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
Your code looks like Python. In R, I'd recommend using the dplyr package. You'd have something like this:
library(dplyr)
India_Summary2 <- India_Summary %>%
select(MEMSEXCOV1, PostSecAvailable, TertiaryYears)
You haven't provided any of your data and Justin already provided a solution using the dplyr package. It's impossible to know if this will work for you since your data is not available, so I show a way to do it with the iris dataset already in R, employing a method that doesn't require libraries.
First, the data. I can inspect the top with head(iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want Sepal.Length and Sepal.Width. So I can achieve this in R's base functions in two ways. First, with matrix notation, I select a row x column location of values [X, X]. Since I only want columns Sepal.Width and Sepal.Length, I ask for only columns by omitting the row [,X].
#### Subset by Matrix Notation ####
iris.2 <- iris[,c(1,2)]
Alternatively, I can do the same thing by specifying specifically what I want with subset using the select argument.
#### Subset with Function ####
iris.2 <- subset(iris,
select = c("Sepal.Length",
"Sepal.Width"))
Both achieve the same thing. If I now use head(iris), I only see two columns:
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9

Create several subsets at the same time

I have a dataset (insti) and I want to create 3 different subsets according to a factor (xarxa) with three levels (linkedin, instagram, twitter).
I used this:
linkedin <- subset(insti, insti$xarxa=="linkedin")
twitter <- subset(insti, insti$xarxa=="twitter")
instagram <- subset(insti, insti$xarxa=="instagram")
It does work, however, I was wondering if this can be done with tapply, so I tried:
tapply(insti, insti$xarxa, subset)
It gives this error:
Error in tapply(insti, insti$xarxa, subset) : arguments must have same length
I think that there might be some straigth forward way to do this but I can not work it out. Can you help me with this without using loops?
Thanks a lot.
It's usually better to deal with data frames in a named list. This makes them easy to iterate over, and stops your global workspace being filled up with lots of different variables. The easiest way to get a named list is with split(insti, insti$xarxa).
If you really want the variables written directly to your global environment rather than in a list with a single line, you can do
list2env(split(insti, insti$xarxa), globalenv())
Example
Obviously, I don't have the insti data frame, since you did not supply any example data in your question, but we can demonstrate that the above solution works using the built-in iris data set.
First we can see that my global environment is empty:
ls()
#> character(0)
Now we get the iris data set, split it by species, and put the result in the global environment:
list2env(split(datasets::iris, datasets::iris$Species), globalenv())
#> <environment: R_GlobalEnv>
So now when we check the global environment's contents, we can see that we have three data frames: one for each Species:
ls()
#> [1] "setosa" "versicolor" "virginica"
head(setosa)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
And of course, we can also access versicolor and virginica in the same way
Created on 2021-11-12 by the reprex package (v2.0.0)

Possible way(s) to create a random sample from a data set by subsetting for dput purposes

Triggered by a comment of mine, I read the (very useful) post about how to create a reproducible example and I think this question might be very related to it.
I noticed that sometimes users (a growing number of them) ask here solutions but the introduction is always the same "I have a very large dataset..." and as a result they do not dput an inch of code.
So I was wondering if there is a way to create a little sample of the data but not with just a head(<data>, n) because sometimes (most times actually) there are factors etc. that are very important for the purposes of the question and to be successfully, the example data set provided must have (even) just few rows of the different factors in the original data. This lead to the classic dput(head(data)) useless.
Browsing, I found a good solution here which I am about to write down here, but before the question:
are there other ways to do that (of course they are) ? more efficient ones? or more "stable", in the sense that a presence of all factors is guarantee?
Here is the solution I have found:
set.seed(123)
samp_dat <- iris[ sample(1:nrow(iris), 10, replace = F ), ]
samp_dat
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
44 5.0 3.5 1.6 0.6 setosa
118 7.7 3.8 6.7 2.2 virginica
61 5.0 2.0 3.5 1.0 versicolor
130 7.2 3.0 5.8 1.6 virginica
138 6.4 3.1 5.5 1.8 virginica
7 4.6 3.4 1.4 0.3 setosa
77 6.8 2.8 4.8 1.4 versicolor
128 6.1 3.0 4.9 1.8 virginica
79 6.0 2.9 4.5 1.5 versicolor
65 5.6 2.9 3.6 1.3 versicolor
Edit
Solutions provided until now are very good (and I have upvoted them) and I do thank posters of course, but please do consider this: the purpose was to create a simple sample of the original data set, so I invite you all to post as easy solutions as possible because it might be that an user asking for help does not have a deep knowledge of R and so I think that avoiding long solution and avoiding solution with external packages (even though I have to admit that the dplyr one is very easy [with a little knowledge of dplyr of course] ).
If you just want to maintain a view of every factor probably the easiest way is to just use dplyr and sample a specific number. For example:
iris %>% group_by(Species) %>% sample_n(3)
Though on a practical standpoint you probably want to do stratified sampling like caret's create data partition or other packages with more complex sampling approaches.
Here are a couple of options that ensure sampling all the factor combinations
## Factor columns
cols <- sapply(iris, class) == "factor"
## Using dplyr
library(dplyr)
iris %>% group_by(interaction(iris[, cols])) %>%
sample_n(2) -> output
## Base R
do.call(rbind, lapply(split(iris, interaction(iris[, cols])), function(group)
group[sample(nrow(group), 2),]))

Using data.table and RJSONIO / jsonlite - results are transposed

I have implemented a wrapper library part of rstudio's htmlwidgets that renders a pivot table.
The package is here.
The package works well with data.tables and data.frame (as it should!). For example it works with iris.
On the other hand if I try to convert iris to data.table my package (actually htmlwidgets - which internally uses RJSONIO) throws an error.
I know that it seems convoluted, but you can sort of reproduce the error just checking the difference between the following codes:
library(data.table)
library(RJSONIO)
data.table(fromJSON(toJSON(data.table(iris))))
The result is different from the dear iris dataset:
V1
1: 5.1,4.9,4.7,4.6,5.0,5.4,
2: 3.5,3.0,3.2,3.1,3.6,3.9,
3: 1.4,1.4,1.3,1.5,1.4,1.7,
4: 0.2,0.2,0.2,0.2,0.2,0.4,
5: setosa,setosa,setosa,setosa,setosa,setosa,
On the other hand jsonlite is able to re-build iris properly (just remember to detach RJSONIO before running the code):
library(data.table)
library(jsonlite)
data.table(fromJSON(toJSON(data.table(iris))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
I am not sure if the problems lies with data.table or RJSONIO...
This is not related to json.
RJSONIO::fromJSON returns list while jsonlite::fromJSON returns data.frame.
It is related to data.table call on list which is different than call on data.frame, but still behave as expected.
Try as.data.table instead of data.table in the outer call.
as.data.table(fromJSON(toJSON(data.table(iris))))
This was already discussed on data.table github. I've reply to your issue the reference to the discussion.

Abbreviate column names during data frame print

R's abbreviate() is useful for truncating, among other things, the column names of a data frame to a set length, with nice checks to ensure uniqueness, etc.:
abbreviate(names(dframe), minlength=2)
One could, of course, use this function to abbreviate the column names in-place and then print out the altered data frame
>>names(dframe) <- abbreviate(names(dframe), minlength=2)
>>dframe
But I would like to print out the data frame with abbreviated column names without altering the data frame in the process. Hopefully this can be done through a simple format option in the print() call, though my search through the help pages of print and format methods like print.data.frame didn't turn up any obvious solution (the available options seem more for formatting the column values, not their names).
So, does print() or format() have any options that call abbreviate() on the column names? If not, is there a way to apply abbreviate() to the column names of a data frame before passing it to print(), again without altering the passed data frame?
The more I think about it, the more I think that the only way would be to pass print() a copy of the data frame with already abbreviated column names. But this is not a solution for me, because I don't want to constantly be updating this copy as I update the original during an interactive session. The original column names must remain unaltered, because I use which(colnames(dframe)=="name_of_column") to interface with the data.
My ultimate goal is to work better remotely on the small screen of my mobile device when working in ssh apps like Server Auditor. If the the column names are abbreviated to only 2-3 characters I can still recognize them but can fit much more data on the screen. Perhaps there even are R packages that are better suited for condensed printing?
You could define your own print method
print.myDF <- function(x, abbr = TRUE, minlength = 2, ...) {
if (abbr) {
names(x) <- abbreviate(names(x), minlength = minlength)
}
print.data.frame(x, ...)
}
Then add the class myDF to the data and print
class(iris) <- c("myDF", class(iris))
head(iris, 3)
# S.L S.W P.L P.W Sp
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), abbr = FALSE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), minlength = 5)
# Spl.L Spl.W Ptl.L Ptl.W Specs
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
Just rewrite print.data.frame:
print.data.frame <-
function(x) setNames( print(x),
abbreviate(names(dframe), minlength=2) )
(You will probably want an auxiliary printfull.data.frame to which you first copy print.data.frame.)

Resources