Subsetting data frame by vector of elements - r

I spent about 20 minutes looking through previous questions, but could not find what I am looking for. I have a large data frame I want to subset down based on a list of names, but the names in the data frame can also have a postfix not indicated in the list.
In other words, is there a simpler generic way (for infinite numbers of postfixes) to do the following:
data <- data.frame("name"=c("name1","name1_post1","name2","name2_post1",
"name2_post2","name3","name4"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2","name3")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
In response to #Arun's answer. The names in my data actually include more than one underscore, making the problem more complicated.
data <- data.frame("name"=c("name1_target_time","name1_target_time_post1","name2_target_time","name2_target_time_post1",
"name2_target_time_post2","name3_target_time","name4_target_time"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2_target_time","name3_target_time")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]

Edit: solution using regular expressions (following OP's follow-up in comment):
data[grepl(paste(names, collapse="|"), data$name), ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
On your new data:
# name data
# 3 name2_target_time 0.6295361
# 4 name2_target_time_post1 0.8951720
# 5 name2_target_time_post2 0.6602126
# 6 name3_target_time 2.2734835
Also, as #flodel shows under comments, this also works fine!
subset(data, sub("_post\\d+$", "", name) %in% names)
Old solution:
data[sapply(strsplit(data$name, "_"), "[[", 1) %in% names, ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
The idea: First split the string at _ using strsplit. This results in a list. For ex: name2 will result in just name2 (first element of the list). But name2_post1 will result in name2 and post1 (second element of the list). By wrapping it with sapply and using [[ with 1, we can select just the "first" element of this resulting list. Then we can use that with %in% to check if they are present in names (which is straightforward).

A grep solution would probably look something like the following:
subset <- data[grep("(name2)|(name3)",names(data)),]

Related

How to get a value from vector and assign it as a name of new dataframe

IN R
I have a vector of NAME:
[1] "ALKR50SV" "AMKR71SV" "AOKR71SV" "AZKR52SV" "BFKR70SV" "BJKR61SV" "BUKR6HSV"
"CDKR61SV" "CFKR31SV"
I want to use them as a name for each new dataframe
Like dataframe of ALKR50SV, dataframe of ALKR50SV ......
for loop like:
NAME[i] <- data1
will cause problem.
What should I do? Thank you.
As #joran and #neilfws said, best to work with a list of data.frames.
For example, consider the following list of three data.frames
lst <- lapply(1:3, function(x) as.data.frame(matrix(sample(20), ncol = 4)));
You can name list elements
names(lst) <- c("ALKR50SV", "AMKR71SV", "AOKR71SV");
and operate on list elements using lapply, e.g.
lapply(lst, dim);
#$ALKR50SV
#[1] 5 4
#
#$AMKR71SV
#[1] 5 4
#
#$AOKR71SV
#[1] 5 4
You can use assign:
numbers <- c('one', 'two', 'three')
for (i in 1:3) {
assign(nms[i], i)
}
one # 1
two # 2
three # 3
But as others have commented, it is most likely better to put your dataframes into a named list.

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Splitting a dataframe by column name indices

This is a variation of an earlier question.
df <- data.frame(matrix(rnorm(9*9), ncol=9))
names(df) <- c("c_1", "d_1", "e_1", "a_p", "b_p", "c_p", "1_o1", "2_o1", "3_o1")
I want to split the dataframe by the index that is given in the column.names after the underscore "_". (The indices can be any character/number in different lengths; these are just random examples).
indx <- gsub(".*_", "", names(df))
and name the resulting dataframes accordingly n the end i would like get three dataframes, called:
df_1
df_p
df_o1
Thank you!
Here, you can split the column names by indx, get the subset of data within the list using lapply and [, set the names of the list elements using setNames, and use list2env if you need them as individual datasets (not so recommended as most of the operations can be done within the list and later if you want, it can be saved using write.table with lapply.
list2env(
setNames(
lapply(split(colnames(df), indx), function(x) df[x]),
paste('df', sort(unique(indx)), sep="_")),
envir=.GlobalEnv)
head(df_1,2)
# c_1 d_1 e_1
#1 1.0085829 -0.7219199 0.3502958
#2 -0.9069805 -0.7043354 -1.1974415
head(df_o1,2)
# 1_o1 2_o1 3_o1
#1 0.7924930 0.434396 1.7388130
#2 0.9202404 -2.079311 -0.6567794
head(df_p,2)
# a_p b_p c_p
#1 -0.12392272 -1.183582 0.8176486
#2 0.06330595 -0.659597 -0.6350215
Or using Map. This is similar to the above approach ie. split the column names by indx and use [ to extract the columns, and the rest is as above.
list2env(setNames(Map(`[` ,
list(df), split(colnames(df), indx)),
paste('df',unique(sort(indx)), sep="_")), envir=.GlobalEnv)
Update
You can do:
indx1 <- factor(indx, levels=unique(indx))
split(colnames(df), indx1)
you can try this :
invisible(sapply(unique(indx),
function(x)
assign(paste("df",x,sep="_"),
df[,grepl(paste0("_",x,"$"),colnames(df))],
envir=.GlobalEnv)))
# the code applies to each unique element of indx the assignement (in the global environment)
# of the columns corresponding to indx in a new data.frame, named according to the indx.
# invisible function avoids that the data.frames are printed on screen.
> ls()
[1] "df" "df_1" "df_o1" "df_p" "indx"
> df_1[1:3,]
c_1 d_1 e_1
1 1.8033188 0.5578494 2.2458750
2 1.0095556 -0.4042410 -0.9274981
3 0.7122638 1.4677821 0.7770603
> df_o1[1:3,]
1_o1 2_o1 3_o1
1 -2.05854176 -0.92394923 -0.4932116
2 -0.05743123 -0.24143979 1.9060076
3 0.68055653 -0.70908036 1.4514368
> df_p[1:3,]
a_p b_p c_p
1 -0.2106823 -0.1170719 2.3205184
2 -0.1826542 -0.5138504 1.9341230
3 -1.0551739 -0.2990706 0.5054421

Accessing column names in a data frame

I have a data frame with column names z_1, z_2 upto z_200. In the following example, for ease of representation, I am showing only z_1
df <- data.frame(x=1:5, y=2:6, z_1=3:7, u=4:8)
df
i=1
tmp <- paste("z",i,sep="_")
subset(df, select=-c(tmp))
The above code will be used in a loop i for accessing certain elements that need to be removed from the data frame
While executing the above code, I get the error "Error in -c(tmp) : invalid argument to unary operator"
Thank you for your help
Try:
df[names(df)!=tmp]
The reason your code is not working is because -c(tmp), where tmp is a character, evaluates to nothing. You can use this way of excluding with numerical values only.
Alternatively this would also work:
subset(df, select=-which(names(df)==tmp))
Because which returns a number.
I you want to use subset and have a large number of columns of similar names to include or exclude, I usually think about using grepl to construct a logical vector of matches to column names (or you could use it to construct a numeric vector just as easily). Negation of the result would remove columns
df <- data.frame(x=1:5, y=2:6, z_1=3:7, u=4:8)
df
i=1
tmp <- paste("z",i,sep="_")
subset(df, select= !grepl("^z", names(df) ) )
x y u
1 1 2 4
2 2 3 5
3 3 4 6
4 4 5 7
5 5 6 8
With negation this lets you remove (or without it include) all of the columns starting with "z" using that pattern. Or you can use grep with value =TRUE in combination with character values:
subset(df, select= c("x", grep("^z", names(df), value=TRUE ) ) )

Does column exist and how to rearrange columns in R data frame

How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)

Resources