I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6
Related
really confused why this is not working:
df <- data.frame(a = c("1", "2", "3"),
b = c(2, 3, 4),
c = c(4, 3, 2),
d = c("1", "5", "9"))
varnames = c("a", "c")
df %>%
mutate_if((is.character(.) & names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 1 2 4 1
2 2 3 3 5
3 3 4 2 9
Expected output would be
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
It works with a single condition, but the class condition I've actually only gotten to work using this formulation (which I don't know how to combine with the column name condition):
df %>%
mutate_if(function(col) is.character(col),
funs(mean(as.numeric(.))))
a b c d
1 2 2 4 5
2 2 3 3 5
3 2 4 2 5
However is.factor seems to work fine with the column names?
df %>%
mutate_if(!is.factor(.) & (names(.) %in% varnames),
funs(mean(as.numeric(.))))
a b c d
1 2 2 3 1
2 2 3 3 5
3 2 4 3 9
Note that mutate_if is being phased out in favour of across, so the following is perhaps what you want...
df %>%
mutate(across(where(is.character) & matches(varnames), ~mean(as.numeric(.))))
a b c d
1 2 2 4 1
2 2 3 3 5
3 2 4 2 9
mutate_if() doesn't work like you do. In its help page, it says that the second argument to set the conditions need to be one of the following two cases:
A predicate function to be applied to the columns. (In this case, it can be a normal function or a lambda function, i.e. the form of ~ fun(.))
A logical vector.
If you want to calculate means for character columns, the correct syntax is
Code 1:
df %>% mutate_if(~ is.character(.), funs(mean(as.numeric(.))))
instead of
df %>% mutate_if(is.character(.), funs(mean(as.numeric(.))))
which results in an error message. Then, let's talk about the following code:
Code 2:
df %>% mutate_if(names(.) %in% varnames, funs(mean(as.numeric(.))))
Theoretically, mutate_if only extract column values, not column names, so ~ names(.) should make no sense in it. But why does Code 2 work fine without the ~ symbol in front of names(.)? The reason is that the "." in names actually represents df per se instead of each column from df owing to the feature of the pipe operator (%>%). Therefore, Code 2 is actually executed equivalently as
df %>% mutate_if(names(df) %in% varnames,funs(mean(as.numeric(.))))
where a logical vector is passed to it rather than a predicate function. names(df) %in% varnames returns TRUE FALSE TRUE FALSE and hence a and c are selected. This can explain why your first block fails but the last one works.
The first block
df %>% mutate_if(is.character(.) & names(.) %in% varnames,
funs(mean(as.numeric(.))))
Replace all "." with df, you can find
is.character(df) returns FALSE
names(df) %in% varnames returns TRUE FALSE TRUE FALSE
The & operator makes the final condition FALSE FALSE FALSE FALSE and hence no column is selected. The same goes for the last block.
Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)
Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]