Noob here to R. Trying to figure something out. I need to build a function that adds a new column to the beginning of a dataset. This new column is a concatenation of the values in other columns that the user specifies.
Imagine this is the data set named myDataSet:
col_1 col_2 col_3 col_4
bat red 1 a
cow orange 2 b
dog green 3 c
The user could use the function like so:
addPrimaryKey(myDataSet, cols=c(1,3,4))
to get the result of a new data set with columns 1, 3 and 4 concatenated into a column called ID and added to the beginning, like so:
ID col_1 col_2 col_3 col_4
bat1a bat red 1 a
cow2b cow orange 2 b
dog4c dog green 3 c
This is the script I have been working on but I have been staring at it so long, I think I have made a few mistakes. I can't figure out how to get the column numbers from the arguments into the paste function properly.
addPrimaryKey <- function(df, cols=NULL){
newVector = rep(NA, length(cols)) ##initialize vector to length of columns
colsN <- as.numeric(cols)
df <- cbind(ID=paste(
for(i in 1:length(colsN)){
holder <- df[colsN[i]]
holder
}
, sep=""), df) ##concatenate the selected columns and add as ID column to df
df
}
Any help would be greatly appreciated. Thanks so much
paste0 works fine, with some help from do.call:
do.call(paste0, mydf[c(1, 3, 4)])
# [1] "bat1a" "cow2b" "dog3c"
Your function, thus, can be something like:
addPrimaryKey <- function(inDF, cols) {
cbind(ID = do.call(paste0, inDF[cols]),
inDF)
}
You may also want to look at interaction:
interaction(mydf[c(1, 3, 4)], drop=TRUE)
# [1] bat.1.a cow.2.b dog.3.c
# Levels: bat.1.a cow.2.b dog.3.c
This should do the trick
addPrimaryKey <-function(df, cols){
q<-apply(df[,cols], 1, function(x) paste(x, collapse=""))
df<-cbind(q, df)
return(df)
}
Just add in some conditional logic for your nulls
Two other options for combining columns are dplyr::mutate() and tidyr::unite():
library(dplyr)
df %>%
mutate(new_col = paste0(col1, col3, col4)) %>%
select(new_col, everything()) # to order the column names with the new column first
library(tidyr)
df %>%
unite(new_col, c(col1, col3, col4), sep = '', remove = FALSE)
The default argument in tidy::unite() is remove = TRUE, which drops the original columns from the data frame leaving only the new column.
Related
I can construct a data.frame like this -
data.frame('a_1' = 3)
However, I want to make the column-name a1 as variable. So I tried this -
data.frame(get(paste("a", 1, sep = "_")) = 3)
With this I get below error -
Error: unexpected '=' in "data.frame(get(paste("a", 1, sep = "_")) ="
Can you please help me to understand the right approach to make the colnames as variable?
Thanks for your pointer.
We can use tibble with := to do this
library(stringr)
library(tibble)
tibble(!! str_c("a", "_", 1) := 3)
-output
# A tibble: 1 x 1
a_1
<dbl>
1 3
In base R, this can be done using setNames
df1 <- setNames(data.frame(3), paste0("a", "_", 1))
-output
df1
a_1
1 3
Or if it is only for a specific number of columns, create the dataset, and use names
df1 <- data.frame(3)
names(df1)[1] <- paste0("a_", 1)
I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means
Im trying to remove all rows that contain a ? in any column in a data frame. I have 950 rows by 11 columns.
Ive tried this to do it all at once.
dataNew <- data %>% filter_all(all_vars(!grepl("?",.)))
and this to see if i could even get it to work for one column.
dataNew <- data[!grepl('?',data$column),]
Both of these attempts resulted in an empty dataframe. Any help is appreciated, thank you.
We can use fixed = TRUE as ? is a metacharacter (or escape (\\?) or wrap it inside square bracket ([?]) when the default mode for grep is fixed = FALSE
library(dplyr)
data %>%
filter_all(all_vars(!grepl("?",., fixed = TRUE)))
# col1 col2
#1 1 2
Or using across from the devel version of dplyr
data %>%
filter(across(everything(), ~ !grepl("?", ., fixed = TRUE)))
# col1 col2
#1 1 2
Or using base R
data[!Reduce(`|`, lapply(data, grepl, pattern = '?', fixed = TRUE)),]
data
data <- data.frame(col1 = c("?", 1, 3, "?"), col2 = c(1, 2, "?", "?"),
stringsAsFactors = FALSE)
I have below data
df<- data_frame(State= c('CA', 'IN', 'CHI'),
Age= c(46,29,32),
Status= c('Employed', '', 'Employed')
)
In the end, I want to create data that looks like this:
df<- data_frame(col1= c('State-CA', 'State-IN', 'State-CHI'),
col2= c('Age-46','Age-29','Age-32'),
col3= c('Status-Employed', '', 'Status-Employed')
)
Connecting the name of a column and its value with a dash. If a value is missing, the column name shouldn't connect to the value of the table. Could anyone help? Thanks in advance!
With imap, it is a single step. As data.frames are named list with columns of equal length, the imap loops over the list, with the anonymous function call (~), get the .y as the column name and the values as .x, then paste it with str_c
library(purrr)
library(stringr)
imap_dfc(df, ~ case_when(.x ==""|is.na(.x) ~ as.character(.x), TRUE ~ str_c(.y, .x, sep='-')))
# A tibble: 3 x 3
# State Age Status
# <chr> <chr> <chr>
#1 State-CA Age-46 Status-Employed
#2 State-IN Age-29 ""
#3 State-CHI Age-32 Status-Employed
In base R
df[] <- Map(function(x, y) ifelse(x=="", x, paste(x, y, sep="-")),df, names(df))
I think what you are looking for has been answered on this thread - Insert Column Name into its Value using R. Hope you find this helpful!
Also, this code should work for you -
col_names <- names(df)
for (c in col_names) {
df[[c]] <- ifelse(df[[c]] != "", paste(c, df[[c]], sep = "-"), "")
}
df
Output -
State Age Status
1 State-CA Age-46 Status-Employed
2 State-IN Age-29
3 State-CHI Age-32 Status-Employed
library(dplyr)
clean_name <- function(df,col_name,new_col_name){
#remove whitespace and common titles.
df$new_col_name <- mutate_all(df,
trimws(gsub("MR.?|MRS.?|MS.?|MISS.?|MASTER.?","",df$col_name)))
#remove any chunks of text where a number is present
df$new_col_name<- transmute_all(df,
gsub("[^\\s]*[\\d]+[^\\s]*","",df$col_name,perl = TRUE))
}
I get the following error
"Error: Column new_col_name must be a 1d atomic #vector or a list"
what you want to do is make sure that the output of the functions you're using is either a vector or a list with only one dimension so that you can add it as a new column in the desired data frame. You can verify the class of an object with the Class function which comes within the base package.
The mutate function by itself should do what you want, it returns the same data frame but with the new column:
library(dplyr)
clean_name <- function(df, col_name, new_col_name) {
# first_cleaning_to_colname = The first change you want to make to the col_name column. This should be a vector.
# second_cleaning_to_colname = The change you're going to make to the col_name column after the first one. This should be a vector too.
first_change <- mutate(df, col_name = first_cleaning_to_colname)
second_change <- mutate(first_change, new_col_name = second_cleaning_to_colname)
return(second_change)
}
You can make both this changes at the same time but I thought this way it's easier to read.
If we are passing unquoted column names, then use
library(tidyverse)
clean_name <- function(df,col_name, new_col_name){
col_name <- enquo(col_name)
new_col_name <- enquo(new_col_name)
df %>%
mutate(!! new_col_name :=
trimws(str_replace_all(!!col_name, "MR.?|MRS.?|MS.?|MISS.?|MASTER.?","")) ) %>%
transmute(!! new_col_name := trimws(str_replace_all(!! new_col_name,
"[^\\s]*[\\d]+[^\\s]*","")))
}
clean_name(dat1, col1, colN)
# colN
#1 one
#2 two
data
dat1 <- data.frame(col1 = c("MR. one", "MS. two 24"), stringsAsFactors = FALSE)