Remove columns that contain a specific word - r

I have a data set that has 313 columns, ~52000 rows of information. I need to remove each column that contains the word "PERMISSIONS". I've tried grep and dplyr but I can't seem to get it to work.
I've read the file in,
testSet <- read.csv("/Users/.../data.csv")
Other examples show how to remove columns by name but I don't know how to handle wildcards. Not quite sure where to go from here.

If you want to just remove columns that are named PERMISSIONS then you can use the select function in the dplyr package.
df <- data.frame("PERMISSIONS" = c(1,2), "Col2" = c(1,4), "Col3" = c(1,2))
PERMISSIONS Col2 Col3
1 1 1
2 4 2
df_sub <- select(df, -contains("PERMISSIONS"))
Col2 Col3
1 1
4 2

From what I could understand from the question, the OP has a data frame like this:
df <- read.table(text = '
a b c d
e f PERMISSIONS g
h i j k
PERMISSIONS l m n',
stringsAsFactors = F)
The goal is to remove every column that has any 'PERMISSIONS' entry. Assuming that there's no variability in 'PERMISSIONS', this code should work:
cols <- colSums(mapply('==', 'PERMISSIONS', df))
new.df <- df[,which(cols == 0)]

Try this,
New.testSet <- testSet[,!grepl("PERMISSIONS", colnames(testSet))]
EDIT: changed script as per comment.

We can use grepl with ! negate,
New.testSet <- testSet[!grepl("PERMISSIONS",row.names(testSet)),
!grepl("PERMISSIONS", colnames(testSet))]

It looks like these answers only do part of what you want. I think this is what you're looking for. There is probably a better way to write this though.
library(data.table)
df = data.frame("PERMISSIONS" = c(1,2), "Col2" = c("PERMISSIONS","A"), "Col3" = c(1,2))
PERMISSIONS Col2 Col3
1 1 PERMISSIONS 1
2 2 A 2
df = df[,!grepl("PERMISSIONS",colnames(df))]
setDT(df)
ind = df[, lapply(.SD, function(x) grepl("PERMISSIONS", x, perl=TRUE))]
df[,which(colSums(ind) == 0), with = FALSE]
Col3
1: 1
2: 2

Related

Transpose dataframe based on column name/number condition

I am trying to standardize feedback from an API in R. However in some cases, the API returns a different format. This does not allow me to standardize and automate. I have thought of a solution which is as follows:
if dataframe has more than 1 variable, keep dataframe as it is
if dataframe has 1 variable then transpose
this id what I tried till now
col <- ncol(df)
df <- ifelse( col > 1, as.data.frame(df), as.data.frame(t(df))
This however returns a list and does not allow the process further. Thank you for the help in advance. any links would help too.
Thanks
Maybe you need something like this:
# some simple dataframes
df1 <- data.frame(col1 = c("a","b"))
df2 <- data.frame(col1 = c("a","b"),
col2 = c("c","d"))
func <- function(df) {
if (ncol(df) ==1) {
as.data.frame(t(df))
} else {
(df)
}
}
func(df1)
V1 V2
col1 a b
func(df2)
col1 col2
1 a c
2 b d

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Create a new data frame that will act as a dictionary with key and value pairs

I was playing with some data and trying to create a new data frame that contains key and value pairs that could be a dictionary. Here's some sample data and a quick manual solution.
df = data.frame(col1 = c("one", "one", "two", "two", "one"),
col2 = c("AG", "AB", "AC", "AG", "AB"),
col3 = c("F3", "F1", "F2", "F3", "F2") )
df
d1 = data.frame(vals = unique(df$col1))
d2 = data.frame(vals = unique(df$col2))
d3 = data.frame(vals = unique(df$col3))
d1
d2
d3
d1$name = "col1"
d2$name = "col2"
d3$name = "col3"
d1
d2
d3
rbind(d1,d2,d3)
Of course, this is a simple use case so real data is going to be a bit more mundane. For that reason, I was looking for a loop that could go through and set the key value pairs in a dictionary.
Most of my attempts have resulted in failure. Here's the format for my solution but I'm not sure how to dynamically create the new_df dictionary. Any suggestions?
new_df=data.frame()
prod.cols = c("col1", "col2", "col3")
for(col in prod.cols){
if(col %in% colnames(df)){
## solution in here
}
}
new_df
tidyr makes this easy:
library(tidyr)
df %>% gather(name, vals) %>% unique()
# name vals
# 1 col1 one
# 3 col1 two
# 6 col2 AG
# 7 col2 AB
# 8 col2 AC
# 11 col3 F3
# 12 col3 F1
# 13 col3 F2
alistaire's answer is quite elegant and readable. Just for fun, here's a base R approach. Not that efficiency is particularly important here, but this scales relatively well as more rows and columns are added:
My second and third approaches are nicer than my first, so I'm moving them to the top of the answer:
Approach # 2, implementing thelatemail's comment for a nice, efficient one-liner:
stack(lapply(df, function(ii) as.character(unique(ii))))
What's nice about this solution is that it first reduces the columns using unique, which makes less work for as.character and then for stack.
Approach # 3: more concise and more efficient version of approach 2 that avoids the need for unique and character conversion by using levels to deal with the factor columns:
stack(lapply(df, levels))
First approach:
Reduce(rbind,
lapply(seq_along(df),
function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
)
)
# vals name
#1 one col1
#2 two col1
#3 AG col2
#4 AB col2
#5 AC col2
#6 F3 col3
#7 F1 col3
#8 F2 col3
Using do.call instead of Reduce is roughly equivalent here:
do.call(rbind,
lapply(seq_along(df),
function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
)
)
We can also do
library(reshape2)
unique(melt(as.matrix(df))[-1])

Find the index of the column in data frame that contains the string as value

I have data frame like this :
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to find the index of the column of df that has values matching to string "a".
i.e. it should give me 1 as result.
I tried using which in sapply but its not working.
Anybody knows how to do it without a loop ??
Something like this?
which(apply(df, 2, function(x) any(grepl("a", x))))
The steps are:
With apply go over each column
Search if a is in this column with grepl
Since we get a vector back, use any to get TRUE if any element has been matched to a
Finally check which elements (columns) are TRUE (i.e. contain the searched letter a).
Since you mention you were trying to use sapply() but were unsuccessful, here's how you can do it:
> sapply(df, function(x) any(x == "a"))
col1 col2 col3
TRUE FALSE FALSE
> which(sapply(df, function(x) any(x == "a")))
col1
1
Of course, you can also use the grep()/grepl() approach if you prefer string matching. You can also wrap your which() function with unname() if you want just the column number.

How to merge two columns in R with a specific symbol?

I have a table read in R as follows:
column1 column2
A B
What is the command to be used to match two columns together as follows?
Column 3
A_B
I'm a bit unsure what you mean by "merge", but is this what you mean?
> DF = data.frame(A = LETTERS[1:10], B = LETTERS[11:20])
> DF$C = paste(DF$A, DF$B, sep="_")
> head(DF)
A B C
1 A K A_K
2 B L B_L
3 C M C_M
4 D N D_N
Or equivalently, as #daroczig points out:
within(DF, C <- paste(A, B, sep='_'))
My personal favourite involves making use of the unite in tidyr:
set.seed(1)
df <- data.frame(colA = sample(LETTERS, 10),
colB = sample(LETTERS, 10))
# packs: pipe + unite
require(magrittr); require(tidyr)
# Unite
df %<>%
unite(ColAandB, colA, colB, remove = FALSE)
Results
> head(df, 3)
ColAandB colA colB
1 G_F G F
2 J_E J E
3 N_Q N Q
Side notes
Personally, I find the remove = TRUE / FALSE functionality of unite very useful. In addition tidyr firs the dplyr workflow very well and plays well with separate in case you change your mind about the columns being merged. On the same lines, if NAs are the problem introducing na.omit to your workflow would enable you to conveniently drop the undesirable rows before creating the desired column.

Resources