Subset a dataframe using a string of column names - r

I need to subset a dataframe (df) by a string of columns names that I have created - not sure how to inject this into a subet..?
for example
colstoKeep is a character string:
"col1", "col2", "col3", "col4"
how do I push this into a subset function
df<- df[colstoKeep]
I'm sure this is easy.? because the above doesn't work.

df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
colsToKeep <- "\"A\", \"C\""
If I understand your question correctly, your colsToKeep variable is a string as given above. In order to extract the variables, you will have to convert that into a vector. If I've used the right format, you can do that with the following code.
library(magrittr)
colsToKeepVector <-
strsplit(colsToKeep, ",") %>%
unlist() %>%
trimws() %>%
gsub("\"", "", .)
df[colsToKeepVector]
However, if I'm also understanding that you had a vector that you collapsed to a string (paste(..., collapse = ", ")?), I would strongly advise you not to do that.
(Edited to match the string format in the question)

df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
cols_to_keep <- c("A","C")
df[,cols_to_keep]
A C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

Related

dplyr::mutate_if() with multiple conditions including column class not working

really confused why this is not working:
df <- data.frame(a = c("1", "2", "3"),
b = c(2, 3, 4),
c = c(4, 3, 2),
d = c("1", "5", "9"))
varnames = c("a", "c")
df %>%
mutate_if((is.character(.) & names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 1 2 4 1
2 2 3 3 5
3 3 4 2 9
Expected output would be
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
It works with a single condition, but the class condition I've actually only gotten to work using this formulation (which I don't know how to combine with the column name condition):
df %>%
mutate_if(function(col) is.character(col),
funs(mean(as.numeric(.))))
a b c d
1 2 2 4 5
2 2 3 3 5
3 2 4 2 5
However is.factor seems to work fine with the column names?
df %>%
mutate_if(!is.factor(.) & (names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 2 2 3 1
2 2 3 3 5
3 2 4 3 9
Note that mutate_if is being phased out in favour of across, so the following is perhaps what you want...
df %>%
mutate(across(where(is.character) & matches(varnames), ~mean(as.numeric(.))))
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
mutate_if() doesn't work like you do. In its help page, it says that the second argument to set the conditions need to be one of the following two cases:
A predicate function to be applied to the columns. (In this case, it can be a normal function or a lambda function, i.e. the form of ~ fun(.))
A logical vector.
If you want to calculate means for character columns, the correct syntax is
Code 1:
df %>% mutate_if(~ is.character(.), funs(mean(as.numeric(.))))
instead of
df %>% mutate_if(is.character(.), funs(mean(as.numeric(.))))
which results in an error message. Then, let's talk about the following code:
Code 2:
df %>% mutate_if(names(.) %in% varnames, funs(mean(as.numeric(.))))
Theoretically, mutate_if only extract column values, not column names, so ~ names(.) should make no sense in it. But why does Code 2 work fine without the ~ symbol in front of names(.)? The reason is that the "." in names actually represents df per se instead of each column from df owing to the feature of the pipe operator (%>%). Therefore, Code 2 is actually executed equivalently as
df %>% mutate_if(names(df) %in% varnames,funs(mean(as.numeric(.))))
where a logical vector is passed to it rather than a predicate function. names(df) %in% varnames returns TRUE FALSE TRUE FALSE and hence a and c are selected. This can explain why your first block fails but the last one works.
The first block
df %>% mutate_if(is.character(.) & names(.) %in% varnames,
funs(mean(as.numeric(.))))
Replace all "." with df, you can find
is.character(df) returns FALSE
names(df) %in% varnames returns TRUE FALSE TRUE FALSE
The & operator makes the final condition FALSE FALSE FALSE FALSE and hence no column is selected. The same goes for the last block.

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

remove repetitive string in the sequence of strings and keep only those appear at the last time

I have a dictionary of words. I also have a column in a dataframe including the sequence of combinations of the words in the dictionary.
I want to remove repetitive words and keep only those appear last time in the sequence. So, we will have each unique word that appear as its last time. For example, if dictionary<- c("A","B","C") and my sequence is mySeq<-"ABCCBCA" I want the result to be: "BCA"
lets try it in following data
dic<- c("AA","BB","c","p")
df<-read.table(text="
id mySequece
1 AAcAABBcPAA
2 AABBAA
3 AABBAABB
4 AAcBBc
5 cBBAABBBBBBBB
6 cBBAABBBBcBB
7 ccp
8 ccppcc",header=T,stringsAsFactors = F)
desired result:
id My_new_sequence
1 BBcPAA
2 BBAA
3 AABB
4 AABBc
5 cAABB
6 AAcBB
7 cp
8 pc
How can I do it in R?
We can extract the elements based on the 'dic', then use duplicated to remove the duplicates from the end and paste it together
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(mySequece = str_extract_all(mySequece, str_c(dic, collapse="|")) %>%
map_chr(~ str_c(.x[!duplicated(.x,
fromLast = TRUE)], collapse="")))
# id mySequece
#1 1 BBcAA
#2 2 BBAA
#3 3 AABB
#4 4 AABBc
#5 5 cAABB
#6 6 AAcBB
#7 7 cp
#8 8 pc
Or using base R
sapply(regmatches(df$mySequece, gregexpr(paste(dic, collapse="|"),
df$mySequece)), function(x)
paste(x[!duplicated(x, fromLast = TRUE)], collapse=""))
#[1] "BBcAA" "BBAA" "AABB" "AABBc" "cAABB" "AAcBB" "cp" "pc"
data
df <- structure(list(id = 1:8, mySequece = c("AAcAABBcPAA", "AABBAA",
"AABBAABB", "AAcBBc", "cBBAABBBBBBBB", "cBBAABBBBcBB", "ccp",
"ccppcc")), class = "data.frame", row.names = c(NA, -8L))

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

Resources