more a curiosity than a question. Is it possible to make some operation only on specific columns of a dataframe but maintaining the dataframe original structure?
For example, suppose I want simply to add 1 to the first 4 columns of the iris dataset because the 5th column is a factor and it is nonsense to add values to it.
1. ignoring the factor column
just perform the operation without caring of the Warning Message
ex <- iris[,] + 1
head(ex, 2)
#gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.1 4.5 2.4 1.2 NA
2 5.9 4.0 2.4 1.2 NA
so the 5th original column loose the original values due to the nonsense operation.
2. excluding the last column
excluding the index of the column from the operation
ex <- iris[,-c(5)] + 1
head(ex, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.1 4.5 2.4 1.2
2 5.9 4.0 2.4 1.2
but doing so I have to perform a cbind operation to recover the original column (not a big deal with this dataframe)
I was wondering if there is a smarter solution for this operation. Imagine the dataframe is very big,with cbind one loose the original position of the columns and it could be quite tricky to do it.
Thanks to all
Related
I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.
Can anyone tell me the piece-by-piece meaning of the following code used to conditionally delete a column of a data frame?
df2=df[,!names(df)%in%c("column")]
Conditions:
column is the column I want to delete from the dataframe df. df2 is the new dataframe.
Let's break it down:
df2=df[,!names(df)%in%c("column")]
df is our dataframe.
So we are choosing columns in df that are not "column".
Choosing Columns is done like:
df[,mycol]
The names(df) chooses the column names.
! is a falsifier(negation mark) and tells us that out of the column names in df choose columns that are not "column".
!names(df)%in%c("column")
We then assign our selection to df2(a new dataframe).
Illustration:
This chooses all columns that are not Species.
iris[,!names(iris)%in%c("Species")]
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
What were the original columns?
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The %in% operator is exhaustively tackled here:
The R %in% operator
For someone new to R, what is the best way to view the range for a number of variables? I've run the summary command on the entire dataset, can I do range () on the entire dataset as well or do i need to create variables for each variable in the dataset?
For individual variable, you can use range. To see the range of multiple variables, you can combine range with one of the apply functions. See below for an example.
range(iris$Sepal.Length)
# [1] 4.3 7.9
sapply(iris[, 1:4], range)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,] 4.3 2.0 1.0 0.1
#[2,] 7.9 4.4 6.9 2.5
(only the first four columns were selected from iris since the 5th is a factor, and range doesn't apply for factors)
This question already has answers here:
Remove an entire column from a data.frame in R
(8 answers)
Closed 9 years ago.
I'm trying to run a cor function to do PCA analysis. The dat frame I have clearly has the column name, I'm trying to ignore in the correlation. I'm getting an error message stating that object is not found.
Error in `[.data.frame`(ABCD, , -xyz) : object 'xyz' not found
In the above example 'xyz' is the column name. What should I be doing differently?
I'm trying to learn from the data set that is available in "HSAUR" package, called heptathlon.
> head(heptathlon)
hurdles highjump shot run200m longjump javelin run800m score
Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291
The column "score" is the eighth column and I get the error when I run:
> round(cor(heptathlon[,-score]), 2)
Error in `[.data.frame`(heptathlon, , -score) : object 'score' not found
If I substitute the column name with the column number, it seems to work. Clearly, I cannot use this approach for large data sets.
You can't remove a column by name with a - sign, like you can with numerical indices.
But you can easily remove a column by name by using logical indexing. Here's an example, removing the column Sepal.Width from iris:
head(iris, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
i <- iris[,names(iris) != 'Sepal.Width']
head(i, 2)
Sepal.Length Petal.Length Petal.Width Species
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
Note that - is not used, and the column name is quoted.
I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE