This question already has answers here:
How to extract columns with same name but different identifiers in R
(3 answers)
Closed 3 years ago.
I have a very large dataset. Of those, a small subset have the same column name with an indexing value that is numeric (unlike the post "How to extract columns with same name but different identifiers in R" where the indexing value is a string). For example
Q_1_1, Q_1_2, Q_1_3, ...
I am looking for a way to either loop through just those columns using the indices or to subset them all at once.
I have tried to use paste() to write their column names but have had no luck. See sample code below
Define Dataframe
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5))
Define the Column Name Using Paste
cn <- as.symbol(paste("Q_1_",1, sep=""))
cn
df$cn
df$Q_1_1
I want df$cn to return the same thing as df$Q_1_1, but df$cn returns NULL.
If you are just trying to subset your data frame by column name, you could use dplyr for subseting all your indexed columns at once and a regex to match all column names with a certain pattern:
library(dplyr)
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5), "A_1" = rep(4,5))
newdf <- df %>%
dplyr::select(matches("Q_[0-9]_[0-9]"))
the [0-9] in the regex matches any digit between the _. Depending on what variable you're trying to match you might have to change the regular expression.
The problem with your solution was that you only saved the name of your columns but did not actually assign it back to the data frame / to a column.
I hope this helps!
Related
This question already has answers here:
Rename Columns with names from another data frame
(3 answers)
How can I rename all columns of a data frame based on another data frame in R?
(3 answers)
Closed 2 years ago.
I have a dataset called "df" and it has 5 variables called year, v1,v2,v3,v4. I have also another dataset (i.e. df_name) including two columns of old_name and new_names. The old_names contains the current name of the variables in "df" and the new_names contains the variable names that I want to assign to variable names of the "df".
So, I am trying to find a solution by which, the code looks for the variable names in "df" that are in the old_names variable of df_name and replace it with the corresponding new_names value. In fact, I am expecting something like "df_expected"
In my real dataset, I have more than 1000 variables, so I have to use the old_name and new_names in df_names that is, I cannot refer to each variable name individually.
Thanks in advance for your help.
I tried to use the solution here: Rename Columns with names from another data frame, however, it did not work. Applying the code to my case like so
names(df)[match(df_names[,"old_names"], names(df))] = df_names[,"new_names"]
returns the error:
#> Error in names(df)[match(df_names[, "old_names"], names(df))] = df_names[, : NAs are not allowed in subscripted assignments
df <- data.frame(year = 2019:2020, v1=1:3, v2=4:6, v3=7:9, v4=10:12)
df_names <- data.frame(old_names = c("v1","v2","v3","v4","v5"),new_names = c("A","B","C","D","E"))
df_expected <- data.frame(year = 2019:2020, A=1:3, B=4:6, C=7:9, D=10:12)
you can set name in data.table without mentioning old names
library(data.table)
setDT(df)
setnames(df, c('year','A','B','C','D'))
This question already has answers here:
Selecting data frame rows based on partial string match in a column
(4 answers)
Closed 1 year ago.
I want to filter a dataframe using dplyr contains() and filter. Must be simple, right? The examples I've seen use base R grepl which sort of defeats the object. Here's a simple dataframe:
site_type <- c('Urban','Rural','Rural Background','Urban Background','Roadside','Kerbside')
df <- data.frame(row_id, site_type)
df <- as.tibble(df)
df
Now I want to filter the dataframe by all rows where site.type contains the string background.
I can find the string directly if I know the unique values of site_type:
filtered_df <- filter(df, site_type == 'Urban Background')
But I want to do something like:
filtered_df <- filter(df, site_type(contains('background', match_case = False)))
Any ideas how to do that? Can dplyr helper contains only be used with columns and not rows?
The contains function in dplyr is a select helper. It's purpose is to help when using the select function, and the select function is focused on selecting columns not rows. See documentation here.
filter is the intended mechanism for selecting rows. The function you are probably looking for is grepl which does pattern matching for text.
So the solution you are looking for is probably:
filtered_df <- filter(df, grepl("background", site_type, ignore.case = TRUE))
I suspect that contains is mostly a wrapper applying grepl to the column names. So the logic is very similar.
References:
grep R documentation
high rated question applying exactly this technique
This question already has answers here:
Coerce multiple columns to factors at once
(11 answers)
Access a URL and read Data with R
(5 answers)
Closed 2 years ago.
I have a csv file with categorical and numerical data. I want to read in the csv file as a data frame, but I want to convert certain categorical variables to factors, and I want to transform the data of certain numerical variables with a log10 transformation.
I know that the relevant functions are read.csv() (automatically reads data in as a data frame), factor(), and log10(), but I've been unable to find a way to do this. How is this done?
Using read.csv read the data in R.
df <- read.csv('/path/of/file.csv')
Let's assume your df looks something like this :
set.seed(123)
df <- data.frame(a = runif(5), b = letters[sample(5)],
c = letters[sample(5)], d = runif(5), e = 1:5)
Create a vector of column names that you want to change to respective class.
factor_cols <- c('b', 'c')
log_cols <- c('a', 'd')
Now apply the functions to those columns. Using dplyr, you can do as :
library(dplyr)
new_df <- df %>%
mutate(across(factor_cols, factor),
across(log_cols, log10))
Or in base R :
df[factor_cols] <- lapply(df[factor_cols], factor)
df[log_cols] <- lapply(df[log_cols], log10)
Here is a complete, working example using the Pokémon Stats data. We can automate the conversion of columns by obtaining the column types from the input data.
gen01file <- "https://raw.githubusercontent.com/lgreski/pokemonData/master/gen01.csv"
gen01 <- read.csv(gen01file,header=TRUE,stringsAsFactors = FALSE)
At this point, gen01 data frame consists of some character columns, some integer columns, and a logical column.
Next, we'll extract the column types with a combination of lapply() and unlist().
# extract the column types
colTypes<- unlist(lapply(gen01[colnames(gen01)],typeof))
At this point, colTypes is a vector that contains the column types, where each element is named by the column name. This is important, because now we can extract the names and automate the process of converting character variables to factor, and integer / double variables with a log10() transformation.
# find character types to convert to factor, using element names from
# colTypes vector
factorColumns <- names(colTypes[colTypes == "character"])
logColumns <- names(colTypes[colTypes %in% c("integer","double")])
Note that at this point we could potentially subset the column name objects further (e.g. use regular expressions to pull certain names from the list of columns, given their data type).
Finally, we use lapply() to apply the appropriate transform on the relevant columns, as noted in Ronak Shah's answer.
gen01[factorColumns] <- lapply(gen01[factorColumns],factor)
gen01[logColumns] <- lapply(gen01[logColumns],log10)
As we can see from the RStudio object viewer, the character variables are now factors, and the values of the integer columns have been log transformed. The Legendary logical column is untouched.
This question already has answers here:
grep using a character vector with multiple patterns
(11 answers)
Closed 3 years ago.
Looking for help to find a way to pass a vector of strings into a select statement. I want to subset a data frame to only output variables that contain the same string as my vector. I don't want it to match exactly and hence need to pass a function like contains as there are some text in the data frame variables that I do not have in my vector.
here is an example of the vector I want to pass into my select statement.
c("clrs_name", "_clrs_sitedetails_value", "_clrs_targetlicence_value",
"clrs_licenceclass", "clrs_licenceownership", "clrs_type", "statuscode")
For example, I want to extract the variable "odate_value_clrs_name" from my data frame and the string "clrs_name" in vector should extract that, but I am not sure how to incorporate contains and a vector into a select statement.
We can use matches in select after collapseing the pattern vector with | by either paste from base R or str_c (str_c would also return NA if there are any NAs). This would not return any error or warning if one of the pattern is missing or doesn't have any match with the column names
library(dplyr)
library(stringr)
df1 %>%
select(matches(str_c(v1, collapse = "|")))
where
v1 <- c("clrs_name", "_clrs_sitedetails_value", "_clrs_targetlicence_value",
"clrs_licenceclass", "clrs_licenceownership", "clrs_type", "statuscode")
This question already has answers here:
Subset data to contain only columns whose names match a condition
(10 answers)
Closed 3 years ago.
I have a dataframe dat that has many variables like
"x_tp1_y"
"g_tp1_z"
"f_tp2_h"
I would like to extract elements that include "tp1".
I already tried this:
grep("tp1", dat)
grepl("tp1", dat)
dat["tp1",]
I just want R to give me elements with this pattern so I do not have to type in all variable names that are in the dataframe dat.
Like this:
command that extracts elements with pattern "tp1"
R returns parts of the dataframe that have pattern "tp1":
x_tp1_y g_tp1_z
1 2
0 3
And then I would like to create a new dataframe.
I know that I just can use
newdat <- data.frame( dat[[1]], dat[ c(1:30)])
but I have so many elements in my dataframe that this would take ages.
Thank you for your help!
dat[,grep("tp1", colnames(dat))]
grep finds the index numbers in the column names of the data.frame (the vector colnames(dat)) that contain the necessary pattern. "[" subsets