Remove an entire column from a data.frame in R - r

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.

You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame

To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.

(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.

The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.

There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10

With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )

Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)

I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Related

R - dplyr joining two df with conditions for rows (char)

I am still a beginner with stackoverflow and dplyr. Perhaps that's why I couldn't find any other similar question.
Problem:
I have two df.
df1 contains a variable ("a") whose entries I want to compare with the entries of a variable in df2 ("c"). Both variables are characters.
If I have a match between both dfs I want to add a row in a new column ("new") which contains the string of df1 ("birne" etc.).
However, the length of each entrie differs between both variables. So perhaps a str_detect, or ends_with should be helpful.
##DFs
df1 <-data.frame("a"= c("055","022","010","0105","0777","077"), "b"= c("birne", "apfel", "banane","traube","blaubeere","kiwi"))
df2 <-data.frame("c"= c("GX00000055","GX0000022","GX00000010","GX00000105","GX0000777","GX0000077"))
## I want
df2_newcolumn<-data.frame("c"= c("GX00000055","GX0000022","GX00000010","GX00000105","GX0000777","GX0000077"), "new"=c("birne", "apfel","NA","NA","blaubeere","NA"))
I thought I can get it using left_join and filter in combination with ends_with, grepl or str_detect. However, I struggeld getting the correct combination and order of command.
I cannot reproduce your desired output (what are there NA's in there?), but a regex join might be what you need:
library(tidyverse)
library(fuzzyjoin)
df2 %>%
regex_left_join(df1 %>% mutate(regex = paste0(a, "$")), by = c(c = "regex")) %>%
# c a b regex
# 1 GX00000055 055 birne 055$
# 2 GX0000022 022 apfel 022$
# 3 GX00000010 010 banane 010$
# 4 GX00000105 0105 traube 0105$
# 5 GX0000777 0777 blaubeere 0777$
# 6 GX0000077 077 kiwi 077$
select(c,b)
# c b
# 1 GX00000055 birne
# 2 GX0000022 apfel
# 3 GX00000010 banane
# 4 GX00000105 traube
# 5 GX0000777 blaubeere
# 6 GX0000077 kiwi

Which() for the whole dataset

I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1

Incorrect output from inner_join of dplyr package

I have two datasets, named "results" and "support2", available here.
I want to merge the two datasets by the only common column name "SNP". Code below:
> library(dplyr)
> results <- read_delim("<path>\\results", delim = "\t", col_name = T)
> support2 <- read_delim("<path>\\support2", delim = "\t", col_name = T)
> head(results)
# A tibble: 6 x 2
SNP p.value
<chr> <dbl>
1 rs28436661 0.334
2 rs9922067 0.322
3 rs2562132 0.848
4 rs3930588 0.332
5 rs2562137 0.323
6 rs3848343 0.363
> head(support2)
# A tibble: 6 x 2
SNP position
<chr> <dbl>
1 rs62028702 60054
2 rs190434815 60085
3 rs62028703 60087
4 rs62028704 60095
5 rs181534180 60164
6 rs186233776 60177
> dim(results)
[1] 188242 2
> dim(support2)
[1] 1210619 2
# determine the number of common SNPs
length(Reduce(intersect, list(results$SNP, support2$SNP)))
[1] 187613
I would expect that after inner_join, the new data would have 187613 rows.
> newdata <- inner_join(results, support2)
Joining, by = "SNP"
> dim(newdata)
[1] 1409812 3
Strangely, instead of have 187613 rows, the new data have 1409812 rows, which is even larger than the sum of the number of rows of the two dataframes.
I switched to the merge function as below:
> newdata2 <- merge(results, support2)
> dim(newdata2)
[1] 1409812 3
This second new dataframe has the same issue. No idea why.
I wish to know how should I obtain a new dataframe whose rows represent the common rows of the two dataframes (should have 187613 rows) and whose columns contain columns of both dataframes.
It could be a result of duplicate elements
results <- data.frame(col1 = rep(letters[1:3], each = 3), col2 = rnorm(9))
support2 <- data.frame(col1 = rep(letters[1:5],each = 2), newcol = runif(10))
library(dplyr)
out <- inner_join(results, support2)
nrow(out)
#[1] 18
Here, the initial datasets in the common column ('col1') are duplicated which confuses the join statement as to which row it should take as a match resulting in a situation similar to a cross join but not exactly that
As already pointed out by #akrun, the data may have duplicates, possibly that is the only explanation of this behavior.
From the documentation of intersect, it always returns a unique value but inner join can have duplicates if the "by" value has duplicates, Hence the count mismatch.
If you truly want to see its right, see the unique counts of by variable (unique key in your case), it should match with your intersect result. But that doesn't mean your join/merge is right, ideally any join which has duplicates in both table A and B is not recommended(unless offcourse you have business/other justification). So, check if the duplicates are present in both the tables or only one of them. If it only found in one of the tables then probably your merge/join should be alright. I hope I am able to explain the scenario.
Please let me know if it doesn't answer your question, I shall remove it.
From Documentations:
intersect:
Each of union, intersect, setdiff and setequal will discard any
duplicated values in the arguments, and they apply as.vector to their
arguments
inner_join():
return all rows from x where there are matching values in y, and all
columns from x and y. If there are multiple matches between x and y,
all combination of the matches are returned.

R: Populating a data frame with multiple matches for a single value without looping

I have a working solution to this problem using a while-loop. I have been made aware that it is typically bad practice to use loops in R so was wondering of alternative approaches.
I have two dataframes, one single-column df full of gene names:
head(genes)
Genes
1 C1QA
2 C1QB
3 C1QC
4 CSF1R
5 CTSC
6 CTSS
And a two-column df that has pairs of the gene name (HGNC.symbol) and accompanying ensembl ID (Gene.stable.ID) for each transcript of the given gene:
head(ensembl_key)
Gene.stable.ID HGNC.symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
My goal is to create a df that for each gene in the genes df extracts all corresponding transcript ID's (Gene.stable.ID) from the ensembl_key df.
The reason I have only found the looping solution is because a single entry in genes may have multiple matches in ensembl_key. I need to retain all matches and include them in the final df and I also do not know the number of matches a single ID from genes has a priori.
Here is my current working solution:
# Create large empty df to hold all transcripts
gene_transcript<- data.frame(matrix(NA, nrow= 5000, ncol= 2))
colnames(gene_transcript)<- c("geneID", "ensemblID")
# Populate Ensembl column
curr_gene<- 1
gene_count<- 1
while(gene_count <= dim(genes)[1]){
transcripts<- ensembl_key[which(ensembl_key$HGNC.symbol== genes$Genes[gene_count]),1]
if(length(transcripts)>1){
num<- length(transcripts)-1
gene_transcript$geneID[curr_gene:(curr_gene+num)]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene:(curr_gene+num)]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + num + 1
}
else{
gene_transcript$geneID[curr_gene]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + 1
}
}
# Remove unneccessary columns
last_row<- which(is.na(gene_transcript$geneID)==T)[1]-1
gene_transcript<- gene_transcript[1:last_row,]
Any help is greatly appreciated, thanks!
It sounds like you want to join or merge. Several ways to do this, but the following should work.
merge(genes,
ensembl_key,
by.x = "Genes",
by.y = "HGNC.symbol")

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Resources