I am writing a function to automatically read in columns of several data frames. The name and the position of the column that I want to access differs sometimes
as a reproducable example:
a= c(1:3)
b= c(4:6)
A = data.frame(a,b)
colnames(A) = c("column1", "column2")
B = data.frame(b,a)
colnames(B) = c("col1", "col2")
Now I want to access the values a, something like:
`$`(A, "column1"|"col2")
Because logical operators don't work for characters, so I tried to go with
A[,which(colnames(A[,1]) == "column1" | names(A[,2]) =="col2")]
but it seems that the colnames()-function doesn't work that.
Does anyone have an idea how to approach this?
Try using grep. It works with both A and B.
A[, grep("column1|col2", names(A))]
#[1] 1 2 3
B[, grep("column1|col2", names(B))]
#[1] 1 2 3
Related
I have a series of dataframes that are noncumulative. During the cleaning process I want to loop through the dataframe and check if certain columns exists and if they aren't present create them. I can't for the life of me figure out a method to do this. I am not package shy and prefer them to base.
Any direction is much appreciated.
You can use this dummy data df and colToAdd columns to check if not exists to add
df <- data.frame(A = rnorm(5) , B = rnorm(5) , C = rnorm(5))
colToAdd <- c("B" , "D")
then apply the check if the column exists NULL produced else add your column e.g. rnorm(5)
add <- sapply(colToAdd , \(x) if(!(x %in% colnames(df))) rnorm(5))
data.frame(do.call(cbind , c(df , add)))
output
A B C D
1 1.5681665 -0.1767517 0.6658019 -0.8477818
2 -0.5814281 -1.0720196 0.5343765 -0.8259426
3 -0.5649507 -1.1552189 -0.8525945 1.0447395
4 1.2024881 -0.6584889 -0.1551638 0.5726059
5 0.7927576 0.5340098 -0.5139548 -0.7805733
I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.
Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I have two dataframes which look like follows:
df1 <- data.frame(V1 = 1:4, V2 = rep(2, 4), V3 = 7:4)
df2 <- data.frame(V2 = rep(NA, 4), V1 = rep(NA, 4), V3 = rep(NA, 4))
I need to write a function which assigns the values of df1 to df2, if the columnnames of both dataframes are the same. The structure of the function should look like this:
fun <- function(x){
if(# If the name of x is the same like the name of a column in df1)
out <- df1$? # Here I need to assign df1$"x" somehow
out
}
fun(df2$V1)
The output should look like this:
[1] 1 2 3 4
Unfortunately I couldnt find a solution by myself. Is there a way how I could do this? Thank you very much in advance!
I need to write a function which assigns the values of df1 to df2, if
the columnnames of both dataframes are the same.
Are you sure you need a function?
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
Using Joachim Schork's code:
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
and if you want to change a single column of df2:
names_in_common <- intersect(names(df1), names(df2[, "V1", drop=FALSE]))
df2[,names_in_common] <- df1[,names_in_common]
This is impossible, because when you access a column of a data.frame using the dollar syntax you lose the column name. There's no way for fun() to determine the column name of the vector that was passed in as an argument.
Instead, you can simply call fun() using the column name itself as the argument, rather than the vector of NAs, which are not useful and not used at all inside the function. In other words, the call becomes
fun('V1');
Then you can write the function as follows:
fun <- function(name) df1[[name]];
Demo:
fun('V1');
## [1] 1 2 3 4
Although now that I think about it, you might as well just index df1 directly, since that's all the function does now:
df1$V1;
## [1] 1 2 3 4
Rereading your question, you said you want to assign the column from df1 to df2, although your example code doesn't do that. Assuming you did want to carry out this assignment inside the function, you could do this:
fun <- function(name) df2[[name]] <<- df1[[name]];
Demo:
fun('V1');
df2;
## V2 V1 V3
## 1 NA 1 NA
## 2 NA 2 NA
## 3 NA 3 NA
## 4 NA 4 NA
This makes use of the superassignment operator <<-.
I want to update one column of a dataframe, referencing it using its original name, is this possible? For example say I had the table 'data'
a b c
1 2 2
3 2 3
4 1 2
and I wanted to update the name of column b to 'd'. I know I could use
colnames(data)[2] <- 'd'
but can I make the change by specifically referencing b, i.e. something like
colnames(data)['b'] <- 'd'
so that if the column ordering of the dataframe changes the correct column name will still be updated.
Thanks in advance
There is a function setnames built into package data.table for exactly that.
setnames(DT, "b", "d")
It changes the names by reference with no copy at all. Any other method using names(data)<- or names(data)[i]<- or similar will copy the entire object, usually several times. Even though all you're doing is changing a column name.
DT must be type data.table for setnames to work, though. So you'd need to switch to data.table or convert using as.data.table, to use it.
Here is the extract from ?setnames. The intention is that you run example(setnames) at the prompt and then the comments relate to the copies you see being reported by tracemem.
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
tracemem(DF)
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 2 copies of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
DT = data.table(a=1:2,b=3:4,c=5:6)
tracemem(DT)
setnames(DT,"b","B") # by name; no match() needed. No copy.
setnames(DT,3,"C") # by position. No copy.
setnames(DT,2:3,c("D","E")) # multiple. No copy.
setnames(DT,c("a","E"),c("A","F")) # multiple by name. No copy.
setnames(DT,c("X","Y","Z")) # replace all. No copy.
As of October 2014 this can now be done easily in the dplyr package:
rename(data, d = b)
This seems like a hack, but the first thing that came to mind was to use grepl() with a sufficiently detailed enough search string to only get the column you want. I'm sure there are better options:
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)
colnames(dat)[grepl("b", colnames(dat))] <- "foo"
dat
#------
a foo c
1 1 1 1
2 2 2 2
3 3 3 3
As Joran points out below, I overcomplicated things...no need for a regex at all. This saves a few characters on the typing too.
colnames(dat)[colnames(dat) == "foo"] <- "bar"
#------
a bar c
1 1 1 1
2 2 2 2
3 3 3 3
Yes but it's more difficult (as far as I know) than numeric indexing. I'm going to provide a dirty function that will do this and if you want to see how to do it just tear the function apart line by line:
rename <- function(df, column, new){
x <- names(df) #Did this to avoid typing twice
if (is.numeric(column)) column <- x[column] #Take numeric input by indexing
names(df)[x %in% column] <- new #What you're interested in
return(df)
}
#try it out
rename(mtcars, 'mpg', 'NEW')
rename(mtcars, 1, 'NEW')
I disagree with #Chase - the grepl solution ain't the luckiest one. I'd say: go with simple ==. Here's why:
d <- data.frame(matrix(rnorm(100), 10))
colnames(d) <- replicate(10, paste(sample(letters[1:5], size = 5, replace=TRUE, prob=c(.1, .6, .1, .1, .1)), collapse = ""))
Now try doing grepl("b", colnames(d)). Either pass fixed = TRUE, or even better do simple colnames(d) == "b" like #joran suggested. Regex matching will always be slower than ==, so for simple tasks like this you may want to use simple ==.