I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4
Related
I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))
I have asked these question before and solve the problem with Saga's help.
I am working on a simulation study. I have to reorganize my results and continue to analysis.
I have a data matrix contains may results like this
> data
It S X Y F
1 1 0.5 0.8 2.39
1 2 0.3 0.2 1.56
2 1 1.56 2.13 1.48
3 1 2.08 1.05 2.14
3 2 1.56 2.04 2.45
.......
It shows iteration
S shows second iteration working inside of IT
X shows coordinate of X obtained from a method
Y shows coordinate of Y obtained from a method
F shows the F statistic.
My problem is I have to find minimum F value for every iteration. So I have to store every iteration on a different matrix or data frame and find minimum F value.
I have tried many things but not worked. Any help, idea will be appreciated.
EDIT: Updated table information
This was the solution:
library(dplyr)
data %>%
group_by(It) %>%
slice(which.min(F))
A tibble: 3 x 5
Groups: It [3]
It S X Y F
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14
However , I will continue another for loop and I want to select every X values providing above conditions.
For example when I use data$X[i] This code doesn't select to values of X (0.30, 1.56, 2.08). It selected original values from "data" before grouping. How can I solve this problem?
I hope this is what you are expecting:
> library(dplyr)
> data %>%
group_by(It) %>%
slice(which.min(F))
# A tibble: 3 x 5
# Groups: It [3]
It S X Y F
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14
I have data like so:
aye <- c(0,0,3,4,5,6)
bee <- c(3,4,0,0,7,8)
see <- c(9,8,3,5,0,0)
df <- data.frame(aye, bee, see)
I am looking for a concise way to create columns based on the mean for each of the columns in the data frame, where zero is kept at zero.
To obtain the mean excluding zero:
df2 <- as.data.frame(t(apply(df, 2, function(x) mean(x[x>0]))))
I can't figure out how to simply replace the values in the column with the mean excluding zero. My approach so far is:
df$aye <- ifelse(df$aye == 0, 0, df2$aye)
df$bee <- ifelse(df$bee == 0, 0, df2$bee)
df$see <- ifelse(df$see == 0, 0, df2$see)
But this gets messy with many variables - would be nice to wrap it up in one function.
Thanks for your help!
Why can't we just use
data.frame(lapply(dat, function (u) ave(u, u > 0, FUN = mean)))
# aye bee see
#1 0.0 5.5 6.25
#2 0.0 5.5 6.25
#3 4.5 0.0 6.25
#4 4.5 0.0 6.25
#5 4.5 5.5 0.00
#6 4.5 5.5 0.00
Note, I used dat rather than df as the name of your data frame. df is a function in R and don't mask it.
We can keep the result of apply function as numeric vector in x.
x <- apply(df, 2, function(x){ mean(x[x>0])})
df[which(df!=0, arr.ind = T)] <- x[ceiling(which(df!=0)/nrow(df))]
df
# aye bee see
#1 0.0 5.5 6.25
#2 0.0 5.5 6.25
#3 4.5 0.0 6.25
#4 4.5 0.0 6.25
#5 4.5 5.5 0.00
#6 4.5 5.5 0.00
Breaking the code down further to explain the working
Gives the indices where the value is not zero
which(df! = 0)
#[1] 3 4 5 6 7 8 11 12 13 14 15 16
This line decides which index we are going to select from x
ceiling(which(df!=0)/nrow(df))
#[1] 1 1 1 1 2 2 2 2 3 3 3 3
x[ceiling(which(df!=0)/nrow(df))]
#aye aye aye aye bee bee bee bee see see see see
#4.50 4.50 4.50 4.50 5.50 5.50 5.50 5.50 6.25 6.25 6.25 6.25
Now substituting the above values where value isn't equal to 0 in the dataframe
df[which(df!=0, arr.ind = T)] <- x[ceiling(which(df!=0)/nrow(df))]
Try rearranging what you already have into a zeroless_mean function, and then use apply on each column of your data.frame:
# Data
aye <- c(0,0,3,4,5,6)
bee <- c(3,4,0,0,7,8)
see <- c(9,8,3,5,0,0)
dff <- data.frame(aye, bee, see)
# Function
zeroless_mean <- function(x) ifelse(x==0,0,mean(x[x!=0]))
# apply
data.frame(apply(dff, 2, zeroless_mean))
# Output
aye bee see
1 0.0 5.5 6.25
2 0.0 5.5 6.25
3 4.5 0.0 6.25
4 4.5 0.0 6.25
5 4.5 5.5 0.00
6 4.5 5.5 0.00
I hope this helps.
Say I have an "integer" factor vector of length 5:
vecFactor = c(1,3,2,2,3)
and another "integer" data vector of length 5:
vecData = c(1.3,4.5,6.7,3,2)
How can I find the average of the data in each factor, so that I would get a result of:
Factor 1: Average = 1.3
Factor 2: Average = 4.85
Factor 3: Average = 3.25
tapply(vecData, vecFactor, FUN=mean)
1 2 3
1.30 4.85 3.25
I sometimes use a linear model to do this instead of tapply, which is quite flexible (for instance if you need to add weights...). Don't forget the "-1" in the formula
lm(vecData~factor(vecFactor)-1)$coef
factor(vecFactor)1 factor(vecFactor)2 factor(vecFactor)3
1.30 4.85 3.25
To get a good table, try aggregate function with data.frame:
ddf = data.frame(vecData, vecFactor)
aggregate(vecData~vecFactor, data=ddf, mean)
vecFactor vecData
1 1 1.30
2 2 4.85
3 3 3.25
data.table can also be used for this:
library(data.table)
ddt = data.table(ddf)
ddt[,list(meanval=mean(vecData)),by=vecFactor]
vecFactor meanval
1: 1 1.30
2: 3 3.25
3: 2 4.85
I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))