Create several subsets at the same time - r

I have a dataset (insti) and I want to create 3 different subsets according to a factor (xarxa) with three levels (linkedin, instagram, twitter).
I used this:
linkedin <- subset(insti, insti$xarxa=="linkedin")
twitter <- subset(insti, insti$xarxa=="twitter")
instagram <- subset(insti, insti$xarxa=="instagram")
It does work, however, I was wondering if this can be done with tapply, so I tried:
tapply(insti, insti$xarxa, subset)
It gives this error:
Error in tapply(insti, insti$xarxa, subset) : arguments must have same length
I think that there might be some straigth forward way to do this but I can not work it out. Can you help me with this without using loops?
Thanks a lot.

It's usually better to deal with data frames in a named list. This makes them easy to iterate over, and stops your global workspace being filled up with lots of different variables. The easiest way to get a named list is with split(insti, insti$xarxa).
If you really want the variables written directly to your global environment rather than in a list with a single line, you can do
list2env(split(insti, insti$xarxa), globalenv())
Example
Obviously, I don't have the insti data frame, since you did not supply any example data in your question, but we can demonstrate that the above solution works using the built-in iris data set.
First we can see that my global environment is empty:
ls()
#> character(0)
Now we get the iris data set, split it by species, and put the result in the global environment:
list2env(split(datasets::iris, datasets::iris$Species), globalenv())
#> <environment: R_GlobalEnv>
So now when we check the global environment's contents, we can see that we have three data frames: one for each Species:
ls()
#> [1] "setosa" "versicolor" "virginica"
head(setosa)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
And of course, we can also access versicolor and virginica in the same way
Created on 2021-11-12 by the reprex package (v2.0.0)

Related

Creating a dataframe in R that is a subset of a number of other columns

I have a data frame with 854 observations and 47 variables (India_Summary). I want to create another data frame that contains only some columns from the 47 variables, named 'MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears'.
I thought I could simply use this (assuming I am just naming the new df 'India_Summary2'):
India_Summary2 <- India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
The error I receive is:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds.
I tried using an equal sign instead:
India_Summary2 = India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
and I receive the below error:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
display list redraw incomplete
2: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
3: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
Your code looks like Python. In R, I'd recommend using the dplyr package. You'd have something like this:
library(dplyr)
India_Summary2 <- India_Summary %>%
select(MEMSEXCOV1, PostSecAvailable, TertiaryYears)
You haven't provided any of your data and Justin already provided a solution using the dplyr package. It's impossible to know if this will work for you since your data is not available, so I show a way to do it with the iris dataset already in R, employing a method that doesn't require libraries.
First, the data. I can inspect the top with head(iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want Sepal.Length and Sepal.Width. So I can achieve this in R's base functions in two ways. First, with matrix notation, I select a row x column location of values [X, X]. Since I only want columns Sepal.Width and Sepal.Length, I ask for only columns by omitting the row [,X].
#### Subset by Matrix Notation ####
iris.2 <- iris[,c(1,2)]
Alternatively, I can do the same thing by specifying specifically what I want with subset using the select argument.
#### Subset with Function ####
iris.2 <- subset(iris,
select = c("Sepal.Length",
"Sepal.Width"))
Both achieve the same thing. If I now use head(iris), I only see two columns:
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9

How to rename columns by their number?

I've got some dataset that is updated once in a while. I want to make an automatic analysis of that dataset, so I've made an R script. The problem is, that with every update the names of the columns change but their order stays the same. I want to rename the colums no matter what name it got this time. I wanted to use rename() from dplyr, but it requires the old names of the columns. I tried something like that:
dataset %<>% rename('new.name1'=.[[1]], 'new.name2'=.[[2]], 'new.name3'=.[[3]])
but it didn't work. So how can I replace the old name with column number in the rename() function? Or what other function can I use to get it done?
Full example to my comment:
dataset <- ...
new_names <- c("new_name_1", "new_name_2", ...)
dataset <- dataset %>% set_names(new_names)
If you only want to replace some older names, use something like this:
dataset <- ...
mtch <- c("old_name_2" = "new_name_2", ...)
new_names <- names(dataset)
new_names[names(mtch)] <- as.character(mtch)
dataset <- dataset %>% set_names(new_names)
Probably too late to be of use to the OP. But for new readers stumbling here, you can use rename() in the usual way even with the column numbers as the old name. Just drop the .[[i]]'s and use the column number by itself:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
rename("NEW_NAME" = 1) %>%
head
#> NEW_NAME Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Created on 2022-03-02 by the reprex package (v2.0.1)
As with #David's 2nd solution, this solution lets you rename only specific columns, though in a bit more straightforward way IMO.
You can use setNames :
dataset <- setNames(dataset, paste0('new.name', seq_along(dataset)))

Abbreviate column names during data frame print

R's abbreviate() is useful for truncating, among other things, the column names of a data frame to a set length, with nice checks to ensure uniqueness, etc.:
abbreviate(names(dframe), minlength=2)
One could, of course, use this function to abbreviate the column names in-place and then print out the altered data frame
>>names(dframe) <- abbreviate(names(dframe), minlength=2)
>>dframe
But I would like to print out the data frame with abbreviated column names without altering the data frame in the process. Hopefully this can be done through a simple format option in the print() call, though my search through the help pages of print and format methods like print.data.frame didn't turn up any obvious solution (the available options seem more for formatting the column values, not their names).
So, does print() or format() have any options that call abbreviate() on the column names? If not, is there a way to apply abbreviate() to the column names of a data frame before passing it to print(), again without altering the passed data frame?
The more I think about it, the more I think that the only way would be to pass print() a copy of the data frame with already abbreviated column names. But this is not a solution for me, because I don't want to constantly be updating this copy as I update the original during an interactive session. The original column names must remain unaltered, because I use which(colnames(dframe)=="name_of_column") to interface with the data.
My ultimate goal is to work better remotely on the small screen of my mobile device when working in ssh apps like Server Auditor. If the the column names are abbreviated to only 2-3 characters I can still recognize them but can fit much more data on the screen. Perhaps there even are R packages that are better suited for condensed printing?
You could define your own print method
print.myDF <- function(x, abbr = TRUE, minlength = 2, ...) {
if (abbr) {
names(x) <- abbreviate(names(x), minlength = minlength)
}
print.data.frame(x, ...)
}
Then add the class myDF to the data and print
class(iris) <- c("myDF", class(iris))
head(iris, 3)
# S.L S.W P.L P.W Sp
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), abbr = FALSE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), minlength = 5)
# Spl.L Spl.W Ptl.L Ptl.W Specs
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
Just rewrite print.data.frame:
print.data.frame <-
function(x) setNames( print(x),
abbreviate(names(dframe), minlength=2) )
(You will probably want an auxiliary printfull.data.frame to which you first copy print.data.frame.)

Error in `[.data.frame`(ABCD, , -xyz) : object 'xyz' not found [duplicate]

This question already has answers here:
Remove an entire column from a data.frame in R
(8 answers)
Closed 9 years ago.
I'm trying to run a cor function to do PCA analysis. The dat frame I have clearly has the column name, I'm trying to ignore in the correlation. I'm getting an error message stating that object is not found.
Error in `[.data.frame`(ABCD, , -xyz) : object 'xyz' not found
In the above example 'xyz' is the column name. What should I be doing differently?
I'm trying to learn from the data set that is available in "HSAUR" package, called heptathlon.
> head(heptathlon)
hurdles highjump shot run200m longjump javelin run800m score
Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291
The column "score" is the eighth column and I get the error when I run:
> round(cor(heptathlon[,-score]), 2)
Error in `[.data.frame`(heptathlon, , -score) : object 'score' not found
If I substitute the column name with the column number, it seems to work. Clearly, I cannot use this approach for large data sets.
You can't remove a column by name with a - sign, like you can with numerical indices.
But you can easily remove a column by name by using logical indexing. Here's an example, removing the column Sepal.Width from iris:
head(iris, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
i <- iris[,names(iris) != 'Sepal.Width']
head(i, 2)
Sepal.Length Petal.Length Petal.Width Species
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
Note that - is not used, and the column name is quoted.

Methods of using mean function to parts of data.frame

Hello,
I am learning with R and at this moment I use "iris" data which is default part of R. In iris data, I want to apply "mean" function to part of data frame.
My question does not concern anything complicated and
it's because I am still quite new to R.
The data I am using are:
library(datasets)
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
What I want to do with it is apply mean function only e.g. to "setosa" in Species
and I want to calculate it just for "Sepal.Lenght".
The way I did it is (I actually apply it to "virginica" Species here,
but I mean it just as example):
virgin<- iris[101:150,1]
virgin
and then
mean(virgin)
It gives me the correct mean but I think this method is kind of simple and is
probably not suited when you don't want to search through data.frame manually
So my questions is how to do the same via other functions like
apply or others I do not know about.
You can also suggest some sources from where could I read more about it.
It can be this page as well (I found only more advanced questions though).
If you want of course.
Thank you.
Your question is really about how to subset a data frame.
Here is one way:
mean(iris$Sepal.Length[iris$Species=="virginica"])
[1] 6.588
You can rewrite this with less duplication by using the function with():
mean(with(iris, Sepal.Length[Species=="virginica"]))
[1] 6.588
And another way:
mean(with(iris, iris[Species=="virginica", "Sepal.Length"]))
[1] 6.588

Resources