I am trying to subset a data frame so that if a column name is present I subset but if not I ignore. For the example I will use mtcars data set. What I am trying to accomplish is if there is a column "vs" subset the first 3 columns and vs. This would be a dateframe named "vsdf".
df <- mtcars
if(colnames(df)=="vs") {
vsdf <- df[,1,2,3,"vs"]
} else {
NULL
}
Any help or guidance would be greatly appreciated.
There are two problems with your code:
1) using ==
You want to check whether "vs" is part of the columns names, but since you're using == it means that you're checking whether the column names (all that are present) are exactly "vs". This will only be true if there's only one column and that is called "vs". Instead you need to use %in%, as in
if("vs" %in% colnames(d))
{...}
2) the subetting syntax df[,1,2,3,"vs"]
subsetting a data.frame usually follows the syntax
df[i, j]
where i denotes rows and j denotes columns. Since you want to subset columnns, you'll do this in j. What you did is supply much more arguments to [.data.frame than it takes because you didn't put those values into a vector. The vector can be numeric / integer or a character vector, but not both forms mixed, as you did. Instead you can build the vector like this:
df[, c(names(df)[1:3], "vs")]
Related
I have a dataframe with 62 columns and 110 rows. In the column "date_observed" I have 57 dates with some of them having multiple records for the same date.
I am trying to extract only 12 dates out of this. They are not in any given order.
I tried this:
datesubset <- original %>% select (original$date_observed == c("13-Jun-21","21-Jun-21", "28-Jun-21", "13-Jul-21", "20-Jul-21", "8-Aug-21", "9-Aug-21", "25-Aug-21", "31-Aug-21", "8-Sep-21", "27-Sep-21"))
But, I got the following error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type logical.
i It must be numeric or character.
I did try searching here and on google but I could find results only for how to subset a set of columns but not for specific values within columns. I am still new to R so please pardon me if this was a very simple question to ask.
In {dplyr}, the select() function is for selecting particular columns, but if you want to subset particular rows you want to use filter().
The logical operator == will also compare what is on the left, to EVERYTHING on the right, giving you a vector of TRUE/FALSE for each row, rather than just a single TRUE or FALSE for each row, which is what you are after.
What I think you are after is the logical operator %in% which checks to see if what is on the left appears at all on the right, and returns a single TRUE or FALSE.
As was mentioned, inside of tidyverse functions you don't need the $, you can just input the column name as in the example below.
I don't have your original data to double check, but the example below should work with your original data frame.
specific_dates <- c(
"13-Jun-21",
"21-Jun-21",
"28-Jun-21",
"13-Jul-21",
"20-Jul-21",
"8-Aug-21",
"9-Aug-21",
"25-Aug-21",
"31-Aug-21",
"8-Sep-21",
"27-Sep-21"
)
datesubset <- original %>%
filter(date_observed %in% specific_dates)
I am having trouble creating a subset for a large dataframe. I need to extract all rows that match one of two correct cities in one of the columns, however any subset that I create ends up empty. Given the main dataframe, I try:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN")]
However R returns "undefined columns selected"
A comma is missing:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN"), ]
That is because you are selecting rows, not columns; if you leave out the comma, R tries to subset columns instead of rows.
I recommend to use data.table so:
# install.packages(data.table)
library(data.table)
data <- as.data.table(data)
new_data <- data[Home.port %in% c("ARDGLASS","NEWLYN")]
You can check this web to learn data.table is very fast with big data bases
The subset function will also do this task
new <- subset(data, subset = Home.port %in% c("ARDGLASS","NEWLYN"))
The base approach is functionally the same, its just a matter of using a declarative function for the task or not.
When using subset() the first argument is the data frame you want to subset. When you want to check for several variables you do not need to put "data$" in front. This save time and makes it easier to read.
datasubset <- subset(data, Home.port %in% c("ARDGLASS","NEWLYN"))
You can also use multiple conditions to subset use "&" for AND condition or "|" for OR condition depending on what you plan to do.
datasubset <- subset(data, Home.port == "ARDGLASS" & Home.port == "NEWLYN"))
I need to create a data subset from multiple "inclusion" criteria from a column (V5:Format) of my df.
I have tried :
new.data <- old.data[grep("text1", old.data$V5), ]
This works for 1 inclusion criteria. I want to add a second inclusion criteria - data must include "text1" & "text2" for data subset
Thanks in advance.
You can use grepl() instead of grep() to get a boolean vector which tells you which strings contain the pattern. On these vectors, you can use logical conditions like &:
new.data <- old.data[grepl("text1", old.data$V5)&grepl("text2", old.data$V5), ]
I want to create a new column in a data.frame where its value is equal to the value in another data.frame where a particular condition is satisfied between two columns in each data frame.
The R pseudo-code being something like this:
DF1$Activity <- DF2$Activity where DF2$NAME == DF1$NAME
In each data.frame values for $NAME are unique in the column.
Use the ifelse function. Here, I put NA when the condition is not met. However, you may choose any value or values from any vector.
Recycling rules1 apply.
DF1$Activity <- ifelse(DF2$NAME == DF1$NAME, DF2$Activity, NA)
I'm not sure this one actually needs an example. What happens when you create a column with a set of NA values and then assign the required rows with the same logical vector on both sides:
DF1$Activity <- NA
DF1$Activity[DF2$NAME == DF1$NAME] <- DF2$Activity[DF2$NAME == DF1$NAME]
without an example its quite hard to tell. But from your description it sounds like a base::merge or dplyr::inner_join operation. Those are quite fast in comparison to if statements.
Cheers
I am trying to exclude a series of rows from a dataset by using the subset() command by identifying a sequence of numbers in the "Rec" column that I want to remove. My attempts to use : and > within subset have failed, for example:
dataset<-subset(dataset,Rec !1812:1843) #here I'd like to exclude all rows with values of 1812:1843 for Rec in the dataset
or
dataset<-subset(dataset,Rec !>1812) #here I'd like to exclude all rows with Rec>1812
Can someone show me how to use the <> and : operators in this way? Can it be done with subset()?
For inclusion/exclusion based on membership in a list in general, you can use the %in% operator:
dataset <- subset(dataset, !(Rec %in% 1812:1843))