I want to cbind a column to the data frame with the column name dynamically assigned from a string
y_attribute = "Survived"
cbind(test_data, y_attribute = NA)
this results in a new column added as y_attribute instead of the required Survived attribute which in provided as a string to the y_attribute variable. What needs to be done to get a column in the data frame with the column name provided from a variable?
You don't actually need cbind to add a new column. Any of these will work:
test_data[, y_attribute] = NA # data frame row,column syntax
test_data[y_attribute] = NA # list syntax (would work for multiple columns at once)
test_data[[y_attribute]] = NA # list single item syntax (single column only)
New columns are added after the existing columns, just like cbind.
We can use tidyverse to do this
library(dplyr)
test_data %>%
mutate(!! y_attribute := NA)
# col1 Survived
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 NA
#5 5 NA
data
test_data <- data.frame(col1 = 1:5)
Not proud of this but I usually will do somethingl like this:
dyn.col <- "XYZ"
cbind(test.data, UNIQUE_NAMEXXX=NA)
colnames(test.data)[colnames(test.data == 'UNIQUE_NAMEXXX')] <- dyn.col
We can also do it with data.table
library(data.table)
setDT(test_data)[, (y_attribute) := NA]
Related
I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)
We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.
x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")
library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!
I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5
I'm trying to modify the data in a data set based on a vector of columns to change. That way I could factorize the treatment based on a config file which would have the list of columns to change as a variable.
Ideally, I'd like to be able to use ddply like that :
column <- "var2"
df <- ddply(df, .(), transform, column = func(column))
The output would be the same dataframe but in the column "B", each letter would have an "A" added behind it
Which would change each element of the column var2 by the element through func (func here is used to trim a chr in a particular way). I've tried several solutions, like :
df[do.call(func, df[,column]), ]
which doesn't accept the df[,column] as argument (not a list), or
param = c("var1", "var2")
for(p in param){
df <- df[func(df[,p]),]
}
which destroys the other data, or
df[, column] <- lapply(df[, column], func)
Which doesn't work because it takes the whole column as argument instead of changing each element 1 by 1. I'm kinda out of ideas on how to make this treatment more automatic.
Example :
df <- data.frame(A=1:10, B=letters[2:11])
colname <- "B"
addA <- function(text) { paste0(text, "A") }
And I would like to do something like this :
df <- ddply(df, .(), transform, colname = addA(colname))
Though if the solution does not use ddply, it's not an issue, it's just what I'm the most used to
You could use mutate_at from package dplyr for this.
library(dplyr)
mutate_at(df, colname, addA)
A B
1 1 bA
2 2 cA
3 3 dA
4 4 eA
5 5 fA
6 6 gA
7 7 hA
8 8 iA
9 9 jA
10 10 kA
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
My goal is to be able to allocate column names to a data frame that I create based on a passed variable. For instance:
i='column1'
data.frame(i=1)
i
1 1
Above the column name is 'i' when I want it to be 'column1'. I know the following works but isn't as efficient as I'd like:
i='column1'
df<-data.frame(x=1)
setnames(df,i)
column1
1 1
It's good to learn how base R works this way:
i <- 'cloumn1'
df <- `names<-`(data.frame(1), i)
df
# cloumn1
#1 1
Aside from the answers posted by other users, I think you may be stuck with the solution you've already presented. If you already have a data frame with the intended number of rows, you can add a new column using brackets:
df <- data.frame('column1'=1)
i <- 'column2'
df[[i]] <- 2
df
column1 column2
1 2
If the idea is to get rid of the setNames, you would probably never do this but
i <- 'column1'
data.frame(`attr<-`(list(1), "names", i))
# column1
# 1 1
You can see in data.frame, it has the code
x <- list(...)
vnames <- names(x)
so, you can mess with the name attribute.
Not exactly sure how you want it more efficient but you could add all the column names at once after your data frame has been assembled with colnames. Here's an example based on yours.
data.frame(Td)
a b
1 1 4
2 1 5
nam<-c("Test1","Test2")
colnames(Td)<-nam
data.frame(Td)
Test1 Test2
1 1 4
2 1 5
You could simply pass the name of your column variable and its values as arguments to a dataframe, without adding more lines:
df <- data.frame(column1=1)
df
# column1
#1 1