I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))
Related
What are the differences between distinct and unique in R using dplyr in consideration to:
Speed
Capabilities (valid inputs, parameters, etc) & Uses
Output
For example:
library(dplyr)
data(iris)
# creating data with duplicates
iris_dup <- bind_rows(iris, iris)
d <- distinct(iris_dup)
u <- unique(iris_dup)
all(d==u) # returns True
In this example distinct and unique perform the same function. Are there examples of times you should use one but not the other? Are there any tricks or common uses of one?
These functions may be used interchangeably, as there exists equivalent commands in both functions. The main difference lies in the speed and the output format.
distinct() is a function under the package dplyr, and may be customized. For example, the following snippet returns only the distinct elements of a specified set of columns in the dataframe
distinct(iris_dup, Petal.Width, Species)
unique() strictly returns the unique rows in a dataframe. All the elements in each row must match in order to be termed as duplicates.
Edit: As Imo points out, unique() has a similar functionality. We obtain a temporary dataframe and find the unique rows from that. This process may be slower for large dataframes.
unique(iris_dup[c("Petal.Width", "Species")])
Both return the same output (albeit with a small difference - they indicate different row numbers). distinct returns an ordered list, whereas unique returns the row number of the first occurrence of each unique element.
Petal.Width Species
1 0.2 setosa
2 0.4 setosa
3 0.3 setosa
4 0.1 setosa
5 0.5 setosa
6 0.6 setosa
7 1.4 versicolor
8 1.5 versicolor
9 1.3 versicolor
10 1.6 versicolor
11 1.0 versicolor
12 1.1 versicolor
13 1.8 versicolor
14 1.2 versicolor
15 1.7 versicolor
16 2.5 virginica
17 1.9 virginica
18 2.1 virginica
19 1.8 virginica
20 2.2 virginica
21 1.7 virginica
22 2.0 virginica
23 2.4 virginica
24 2.3 virginica
25 1.5 virginica
26 1.6 virginica
27 1.4 virginica
Overall, both functions return the unique row elements based on the combined set of columns chosen. However, I am inclined to quote the dplyr library and state that distinct is faster.
With regard to two of your criteria, speed and input, here's a little function using the tictoc library. It shows that distinct() is notably faster (the input has numeric and character columns):
library(dplyr)
library(tictoc)
library(glue)
make_a_df <- function(nrows = NULL){
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
unique(df)
print(glue('Unique with {nrows}: '))
toc()
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
distinct(df)
print(glue('Distinct with {nrows}: '))
toc()
}
Result:
> make_a_df(50); make_a_df(500); make_a_df(5000); make_a_df(50000); make_a_df(500000)
Unique with 50:
0.02 sec elapsed
Distinct with 50:
0 sec elapsed
Unique with 500:
0 sec elapsed
Distinct with 500:
0 sec elapsed
Unique with 5000:
0.02 sec elapsed
Distinct with 5000:
0 sec elapsed
Unique with 50000:
0.09 sec elapsed
Distinct with 50000:
0.01 sec elapsed
Unique with 5e+05:
1.77 sec elapsed
Distinct with 5e+05:
0.34 sec elapsed
I have a dataframe in R.
index seq change change1 change2 change3 change4 change5 change6
1 1 0.12 0.34 1.2 1.7 4.5 2.5 3.4
2 2 1.12 2.54 1.1 0.56 0.87 2.6 3.2
3 3 1.86 3.23 1.6 0.23 3.4 0.75 11.2
... ... ... ... ... ... ... ... ...
The name of the dataframe is just FUllData. I can access each column of the FullData using the code:
FullData[2] for 'change'
FullData[3] for 'change1'
FullData[4] for 'change3'
...
...
Now, I wish to calculate the standard deviation of values in first row of first four columns and so on for all the columns
standarddeviation = sd ( 0.12 0.34 1.2 1.7 )
then
standarddeviation = sd ( 0.34 1.2 1.7 4.5 )
Above has to be for all rows. so basically I want to calulate sd row wise and the data is stored sort of column wise is it possible to do this.
How can I access the row of the data frame with using a for loop on index or seq variable ?
How can I do this in R ? is there any better way ?
I guess you're looking for something like this.
st.dev=numeric()
for(i in 1:dim(FUllData)[1])
{
for(j in 1:dim(FUllData)[2])
{
st.dev=cbind(st.dev,sd(FUllData[i:dim(FUllData)[1],j:dim(FUllData)[2]]))
}
}
I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4
Say I have an "integer" factor vector of length 5:
vecFactor = c(1,3,2,2,3)
and another "integer" data vector of length 5:
vecData = c(1.3,4.5,6.7,3,2)
How can I find the average of the data in each factor, so that I would get a result of:
Factor 1: Average = 1.3
Factor 2: Average = 4.85
Factor 3: Average = 3.25
tapply(vecData, vecFactor, FUN=mean)
1 2 3
1.30 4.85 3.25
I sometimes use a linear model to do this instead of tapply, which is quite flexible (for instance if you need to add weights...). Don't forget the "-1" in the formula
lm(vecData~factor(vecFactor)-1)$coef
factor(vecFactor)1 factor(vecFactor)2 factor(vecFactor)3
1.30 4.85 3.25
To get a good table, try aggregate function with data.frame:
ddf = data.frame(vecData, vecFactor)
aggregate(vecData~vecFactor, data=ddf, mean)
vecFactor vecData
1 1 1.30
2 2 4.85
3 3 3.25
data.table can also be used for this:
library(data.table)
ddt = data.table(ddf)
ddt[,list(meanval=mean(vecData)),by=vecFactor]
vecFactor meanval
1: 1 1.30
2: 3 3.25
3: 2 4.85
I have a dataframe containing one column (depth, z) in which I am trying to find the difference in the accumulative depth values based on regular depth values. I would like to create a new dataframe with 3 columns: the criteria value, its respective cumulative depth value, and the third column with difference between consecutive cumulative depths, so for example:
z1<-c(1.2, 1.5, 0.8, 0.7, 1.6, 1.9, 1.1, 0.6, 1.3, 1.0)
z<-data.frame(z1)
crit1<-c(0.5,1,1.5,2)
# A loop comes to mind,
for(i in c(0.5,1,1.5,2)){
print( sum(subset(z1,z1<=i)))
} # But I get an error, because I cannot use integers
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
Attempting with cumsum
cumsum(z1)[seq(0.5,2,by=0.5)] # Which doesn't work either
I would like to get a table like this:
Crit Cumulative Difference
0.5 0 0
1 3.1 3.1
1.5 8.2 5.1
Don't use a for loop here , you should use sapply since you store the result.
y <- sapply(crit1,function(x)sum(z1[z1<=x]))
d <- c(0,diff(y))
data.frame(Crit = crit1, Cumulative =y, Difference=d)
# Crit Cumulative Difference
# 1 0.5 0.0 0.0
# 2 1.0 3.1 3.1
# 3 1.5 8.2 5.1
# 4 2.0 11.7 3.5
You could try
Difference <- setNames(c(0,tapply(z1,cut(z1, breaks=crit1,labels=F),FUN=sum)),NULL)
data.frame(Crit=crit1, Cumulative=cumsum(Difference), Difference)
# Crit Cumulative Difference
#1 0.5 0.0 0.0
#2 1.0 3.1 3.1
#3 1.5 8.2 5.1
#4 2.0 11.7 3.5