Error in `[.data.frame`(ABCD, , -xyz) : object 'xyz' not found [duplicate] - r

This question already has answers here:
Remove an entire column from a data.frame in R
(8 answers)
Closed 9 years ago.
I'm trying to run a cor function to do PCA analysis. The dat frame I have clearly has the column name, I'm trying to ignore in the correlation. I'm getting an error message stating that object is not found.
Error in `[.data.frame`(ABCD, , -xyz) : object 'xyz' not found
In the above example 'xyz' is the column name. What should I be doing differently?
I'm trying to learn from the data set that is available in "HSAUR" package, called heptathlon.
> head(heptathlon)
hurdles highjump shot run200m longjump javelin run800m score
Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291
The column "score" is the eighth column and I get the error when I run:
> round(cor(heptathlon[,-score]), 2)
Error in `[.data.frame`(heptathlon, , -score) : object 'score' not found
If I substitute the column name with the column number, it seems to work. Clearly, I cannot use this approach for large data sets.

You can't remove a column by name with a - sign, like you can with numerical indices.
But you can easily remove a column by name by using logical indexing. Here's an example, removing the column Sepal.Width from iris:
head(iris, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
i <- iris[,names(iris) != 'Sepal.Width']
head(i, 2)
Sepal.Length Petal.Length Petal.Width Species
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
Note that - is not used, and the column name is quoted.

Related

Creating a dataframe in R that is a subset of a number of other columns

I have a data frame with 854 observations and 47 variables (India_Summary). I want to create another data frame that contains only some columns from the 47 variables, named 'MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears'.
I thought I could simply use this (assuming I am just naming the new df 'India_Summary2'):
India_Summary2 <- India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
The error I receive is:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds.
I tried using an equal sign instead:
India_Summary2 = India_Summary[['MEMSEXCOV1', 'PostSecAvailable', 'TertiaryYears']]
and I receive the below error:
Error in `[[.default`(col, i, exact = exact) : subscript out of bounds
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
display list redraw incomplete
2: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
3: In doTryCatch(return(expr), name, parentenv, handler) :
invalid graphics state
Your code looks like Python. In R, I'd recommend using the dplyr package. You'd have something like this:
library(dplyr)
India_Summary2 <- India_Summary %>%
select(MEMSEXCOV1, PostSecAvailable, TertiaryYears)
You haven't provided any of your data and Justin already provided a solution using the dplyr package. It's impossible to know if this will work for you since your data is not available, so I show a way to do it with the iris dataset already in R, employing a method that doesn't require libraries.
First, the data. I can inspect the top with head(iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want Sepal.Length and Sepal.Width. So I can achieve this in R's base functions in two ways. First, with matrix notation, I select a row x column location of values [X, X]. Since I only want columns Sepal.Width and Sepal.Length, I ask for only columns by omitting the row [,X].
#### Subset by Matrix Notation ####
iris.2 <- iris[,c(1,2)]
Alternatively, I can do the same thing by specifying specifically what I want with subset using the select argument.
#### Subset with Function ####
iris.2 <- subset(iris,
select = c("Sepal.Length",
"Sepal.Width"))
Both achieve the same thing. If I now use head(iris), I only see two columns:
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9

selecting column by name data and ignoring other column attributes such as "label" R

Any tips on selecting columns by just column name when other column attributes are present?
When using dplyr::subset and selecting from a c(x, y, z) sort of list, which should work when given exact column names, I am not able to get columns that also contain "*,label" attributes.
This is new to me, something I've not seen or dealt with before.
I can use select and starts_with but it's not limited enough as a character search and is getting other columns with similar character strings.
Doesn't work because of "label" attribute?
subset(NHSDA_2001_W, select = c("YEAR", "IRPINC3"))
works but yields far too many columns.
%>% select(starts_with("YEAR") | starts_with("IRPINC3"))
Is it possible to subset by name and ignore other column attributes?
It's a little unclear what the issue is, as labels do not prevent you from selecting by column name. Note: I used a different function to print the labels in the console.
library(tidyverse)
library(labelled)
df <- iris
var_label(df) <- list(Petal.Length = "Length of petal", Petal.Width = "Width of Petal")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 Length of petal Width of Petal
#2 5.1 3.5 1.4 0.2 setosa
#3 4.9 3 1.4 0.2 setosa
We can just use select as normal from dplyr:
df %>%
select(Petal.Length, Petal.Width) %>%
head()
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
Same for using subset (not part of dplyr):
subset(df, select = c(Petal.Length, Petal.Width))

Create several subsets at the same time

I have a dataset (insti) and I want to create 3 different subsets according to a factor (xarxa) with three levels (linkedin, instagram, twitter).
I used this:
linkedin <- subset(insti, insti$xarxa=="linkedin")
twitter <- subset(insti, insti$xarxa=="twitter")
instagram <- subset(insti, insti$xarxa=="instagram")
It does work, however, I was wondering if this can be done with tapply, so I tried:
tapply(insti, insti$xarxa, subset)
It gives this error:
Error in tapply(insti, insti$xarxa, subset) : arguments must have same length
I think that there might be some straigth forward way to do this but I can not work it out. Can you help me with this without using loops?
Thanks a lot.
It's usually better to deal with data frames in a named list. This makes them easy to iterate over, and stops your global workspace being filled up with lots of different variables. The easiest way to get a named list is with split(insti, insti$xarxa).
If you really want the variables written directly to your global environment rather than in a list with a single line, you can do
list2env(split(insti, insti$xarxa), globalenv())
Example
Obviously, I don't have the insti data frame, since you did not supply any example data in your question, but we can demonstrate that the above solution works using the built-in iris data set.
First we can see that my global environment is empty:
ls()
#> character(0)
Now we get the iris data set, split it by species, and put the result in the global environment:
list2env(split(datasets::iris, datasets::iris$Species), globalenv())
#> <environment: R_GlobalEnv>
So now when we check the global environment's contents, we can see that we have three data frames: one for each Species:
ls()
#> [1] "setosa" "versicolor" "virginica"
head(setosa)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
And of course, we can also access versicolor and virginica in the same way
Created on 2021-11-12 by the reprex package (v2.0.0)

Explanation for R code used to delete column

Can anyone tell me the piece-by-piece meaning of the following code used to conditionally delete a column of a data frame?
df2=df[,!names(df)%in%c("column")]
Conditions:
column is the column I want to delete from the dataframe df. df2 is the new dataframe.
Let's break it down:
df2=df[,!names(df)%in%c("column")]
df is our dataframe.
So we are choosing columns in df that are not "column".
Choosing Columns is done like:
df[,mycol]
The names(df) chooses the column names.
! is a falsifier(negation mark) and tells us that out of the column names in df choose columns that are not "column".
!names(df)%in%c("column")
We then assign our selection to df2(a new dataframe).
Illustration:
This chooses all columns that are not Species.
iris[,!names(iris)%in%c("Species")]
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
What were the original columns?
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The %in% operator is exhaustively tackled here:
The R %in% operator

Operation on specific columns in R

more a curiosity than a question. Is it possible to make some operation only on specific columns of a dataframe but maintaining the dataframe original structure?
For example, suppose I want simply to add 1 to the first 4 columns of the iris dataset because the 5th column is a factor and it is nonsense to add values to it.
1. ignoring the factor column
just perform the operation without caring of the Warning Message
ex <- iris[,] + 1
head(ex, 2)
#gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.1 4.5 2.4 1.2 NA
2 5.9 4.0 2.4 1.2 NA
so the 5th original column loose the original values due to the nonsense operation.
2. excluding the last column
excluding the index of the column from the operation
ex <- iris[,-c(5)] + 1
head(ex, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.1 4.5 2.4 1.2
2 5.9 4.0 2.4 1.2
but doing so I have to perform a cbind operation to recover the original column (not a big deal with this dataframe)
I was wondering if there is a smarter solution for this operation. Imagine the dataframe is very big,with cbind one loose the original position of the columns and it could be quite tricky to do it.
Thanks to all

Resources