I'm trying to get the overall distance that an animal traveled by using a function that uses differences in lat long positions to output a distance traveled. Having some issues with nested loops.
The function ComputeDistance takes the arguments Lat1, Lat2, Long1, Long 2 in that order. Column 5 of DistTest contains latitude values and 6 contains longitude values.
So for the object "output", I'm trying to get sequential distances going through all 38 rows.
e.g.
ComputeDistance(DistTest[1,5],DistTest[2,5],DistTest[1,6],DistTest[2,6]
followed by:
ComputeDistance(DistTest[2,5],DistTest[3,5],DistTest[2,6],DistTest[3,6]
followed by:
ComputeDistance(DistTest[3,5],DistTest[4,5],DistTest[3,6],DistTest[4,6]
....
ComputeDistance(DistTest[37,5],DistTest[38,5],DistTest[37,6],DistTest[38,6]
I'm thinking that the problem is that the loop is going through every possible combination of DL and EL, not just going sequentially in order.
Below is the code I'm using currently.
## rows 1-37 and rows 2-38
DL <- 1:37
EL <- 2:38
## subsetting for one tagged animal
DistTest <- subset(Dispsum, Tag.ID == 1658)
## creating blank objects to save output in
testid <- c()
testdistance <- c()
for( j in DL){
for( k in EL){
output <- (ComputeDistance(DistTest[j,5], DistTest[k,5],DistTest[j,6], DistTest[k,6]))
Name <- 1658
testid <- rbind(testid, Name)
testdistance <- rbind(testdistance,output)
}
}
Generally in R, it is better to find functions that do the looping for you, as most of them are set up for that. In this case, you can try using mutate and lead from the dplyr package:
library(dplyr)
df <- dplyr::tibble(lat = 1:5, lon = 5:1)
df
# A tibble: 5 x 3
# lat lon distance
# <int> <int> <dbl>
# 1 1 5 1.41
# 2 2 4 1.41
# 3 3 3 1.41
# 4 4 2 1.41
# 5 5 1 NA
df %>% mutate(distance = ComputeDistance(lat, lead(lat), lon, lead(lon)))
# A tibble: 5 x 3
# lat lon distance
# <int> <int> <dbl>
# 1 1 10 1.41
# 2 2 9 1.41
# 3 3 8 1.41
# 4 4 7 NA
If you really want to stick with for loops, you only need one for this problem. You were right in saying that you're going through every combination. One alternative would be:
for (i in 1:37) {
output <- ComputeDistance(DistTest[i, 5], DistTest[i + 1, 5],
DistTest[i, 6], DistTest[i + 1, 6])
Name <- 1658
testid <- rbind(testid, Name)
testdistance <- rbind(testdistance, output)
}
One reason to avoid this construct is that you are incrementally growing an object (see here for more about that).
Related
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
I have a dataframe:
Start <- data.frame("Number" = 2,"Square" = 4,"Cube" = 8)
A Vector of inputs:
Numbers <- c(3,5)
I want to iterate the elements of Numbers in the function Squarecube and fill the dataframe with the results:
SquareCube <- function(x){ df <- c(x^2,x^3)
df}
Desired Output:
Filled <- data.frame("Number" = c(2,3,5),"Square" = c(4,9,25),"Cube" = c(8,27,125))
Note: Already searched for this topic , but in this case the size of the vector Numbers can be different. My intent is to fill the dataframe with the results of the function.
Thanks
If I am reading your question right, you may just be having issues with structure that do.call may be able to help with. I also redefined the function slightly to accommodate the naming:
Start <- data.frame("Number" = 2,"Square" = 4, "Cube" = 8)
Number <- c(3,5)
Define your function:
SquareCube <- function(x){ list(Number=x,Square=x^2,Cube=x^3) }
Then construct the data frame with desired end results:
> rbind(Start, data.frame( do.call(cbind, SquareCube(Number)) ))
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
You can also make a wrapper function and just hand it the Start data and the original Number list that you want to process, which will yield a data frame:
> makeResults <- function(a, b) { rbind(a, data.frame(do.call(cbind,SquareCube(b)))) }
> makeResults(Start, Number)
Number Square Cube
1 2 4 8
2 3 9 27
3 5 25 125
outer() function produces matrix which has exactly same output of yours. You can just change it to data frame and rename.
(Filled <- outer(
c(2, 3, 5),
1:3,
FUN = "^"
))
#> [,1] [,2] [,3]
#> [1,] 2 4 8
#> [2,] 3 9 27
#> [3,] 5 25 125
For this matrix, you can use any function what you know to
change class
change column names
Here, for instance, dplyr::rename():
library(tidyverse)
Filled %>%
as_tibble() %>% # make data frame
rename(Number = V1, Square = V2, Cube = V3) # rename column names
#> # A tibble: 3 x 3
#> Number Square Cube
#> <dbl> <dbl> <dbl>
#> 1 2 4 8
#> 2 3 9 27
#> 3 5 25 125
I've been going through the posts regarding summarizing data, but haven't seem to have found what I'm looking for.
I wish to create a summary "count-table" which will allow me to see how often a certain medication was given to patients. The fact that some patients received multiple medications simultaneously doesn't matter, because I simply want a summary of all the medication given and then calculate which percentage each medication class is of all medication given. The issue is, that I don't know the names of the possible medication given, they're "hidden" somewhere in the data.frame, thus, I have to specify which columns R would have to look through first to create a "list" by which it can then summarize the columns.
I anticipate that this points towards the plyr package but my attempts to use the functions in it correctly haven't worked until now.
My df looks something like this
x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df<-data.frame(x,y,z)
head(df)
x y z
1 a a f
2 a c d
3 b b e
4 c d b
5 a a b
6 c d d
as you can see, the data.frame contains three columns which have the same but also different letters, indicating the name of the medication given.
What I'd now like to do is create a list of unique characters,
unique(x)
unique(y)
unique(z)
which serves as my reference list by which R can then summarize the counts in each column.
summary(df)
returns a summary of counts of each column but not of each ID itself and also without a percentage of all unique counts.
I also tried the following, which sort of goes in the right direction, but ideally, I'd like to have a list of unique characters, which I can feed to the length argument
ddply(df, .(x), summarize, counts=length(unique(y)))
Any idea how I could do this? Help much appreciated.
If you just want to have a count for the whole dataframe, you can use table(unlist(df)) (see also #goctlr's answer) & if you also want to have probabilities: prop.table(table(unlist(df))). When you also want to get the count for the individual columns, it gets more difficult.
To get the count for each column and the total count, I wrote the following function:
# some reproducible data:
set.seed(1)
x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df <- data.frame(x,y,z)
# the function
func <- function(x) {
x2 <- data.frame()
nms <- names(x)
id <- sort(unique(unlist(x)))
for(i in 1:length(id)) {
for(j in 1:length(nms)) {
x2[i,j] <- sum(x[,j] %in% id[i])
}
}
names(x2) <- nms
x2$total <- rowSums(x2)
x2 <- cbind(id,x2)
assign("dat", x2, envir = .GlobalEnv)
}
Executing the function with func(df) will give you a dataframe dat in your global envirenment:
> dat
id x y z total
1 a 4 4 3 11
2 b 5 5 2 12
3 c 5 4 4 13
4 d 6 4 5 15
5 e 0 3 5 8
6 f 0 0 1 1
After that, you can calculate the percentages with for example the dplyr package:
library(dplyr)
dat <- dat %>% mutate(xperc=round(100*x/sum(total),1),
yperc=round(100*y/sum(total),1),
zperc=round(100*z/sum(total),1),
perc=round(100*total/sum(total),1))
which results in:
> dat
id x y z total xperc yperc zperc perc
1 a 4 4 3 11 6.7 6.7 5.0 18.3
2 b 5 5 2 12 8.3 8.3 3.3 20.0
3 c 5 4 4 13 8.3 6.7 6.7 21.7
4 d 6 4 5 15 10.0 6.7 8.3 25.0
5 e 0 3 5 8 0.0 5.0 8.3 13.3
6 f 0 0 1 1 0.0 0.0 1.7 1.7
For a summary of counts for the whole data frame you can unlist the data frame and then call the table function:
table(unlist(df))
To get the percentage of total counts, save the result and use the prop.table function:
tout <- table(unlist(df))
prop.table(tout)
I want to make calculations for elements of individual rows using a for-loop.
I have two data.frames
df: contains data of all trading-days stocks
events: contains data of only event days of stocks
Even though there might be a much easier approach for this specific example, I’d like to know how to do such a task with a loop in a loop (for-loops).
First, my data.frames:
comp1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date1 <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
ret <- c(1.2,2.2,-0.5,0.98,0.73,-1.3,-0.02,0.3,1.1,2.0,1.9,-0.98,1.45,1.71,0.03)
df <- data.frame(comp1,date1,ret)
comp2 <- c(1,1,2,2,2,3,3)
date2 <- c(2,4,1,2,5,4,5)
q <- paste("")
events <- data.frame(comp2,date2,q)
df
# comp1 date1 ret
# 1 1 1 1.20
# 2 1 2 2.20
# 3 1 3 -0.50
# 4 1 4 0.98
# 5 1 5 0.73
# 6 2 1 -1.30
# 7 2 2 -0.02
# 8 2 3 0.30
# 9 2 4 1.10
# 10 2 5 2.00
# 11 3 1 1.90
# 12 3 2 -0.98
# 13 3 3 1.45
# 14 3 4 1.71
# 15 3 5 0.03
events
# comp2 date2 q
# 1 1 2
# 2 1 4
# 3 2 1
# 4 2 2
# 5 2 5
# 6 3 4
# 7 3 5
I want to make calculations of df$ret. As an example let's just take 2 * df$ret. The results for each event-day should be stored in mylist. The final output should be the data.frame "events" with a column "q" where I want the results of the calculation to be stored.
# important objects:
companies <- as.vector(unique(df$comp1)) # all the companies (here: 1, 2, 3)
days <- as.vector(unique(df$date1)) # all the trading-days (here: 1, 2, 3, 4, 5)
mylist <- vector('list', length(companies)) # a list where the results should be stored for each company
I came up with some piece of code which doesn't work. But I still think it should look something like this:
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2
}
mylist[i] <- events_k
}
I don't understand how to set up the loop inside the other loop and how to store the results in mylist. Any help appreciated!!
Thank you!
Don't feel bad. All of your problems are common R gotchas. First, try changing
events <- data.frame(comp2,date2,q,stringsAsFactors=FALSE)
earlier instead. Your column q is being converted to a factor implicitly, disallowing the arithmetic * 2 operation later.
Next, let's consider the fixed loop
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <-
if (0 == length(tmp <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2)) NA
else tmp
}
mylist[[i]] <- events_k
}
Your first problem was on the last line, where you used [ instead of [[ (in R, the former means always wrapped with a list, whereas the latter actually accessed the value in the list).
Your second problem is that sometimes which(days==events_k[j,"date2"]) is numeric(0) (i.e., there is no matching event date). The code will then work, but you'll still have a lot of dataframes with NAs. To remove those, you could do something like:
mylist <- Filter(function(df) nrow(df) > 0,
lapply(mylist, function(df) df[apply(df, 1, function(row) !all(is.na(row))), ]))
which will filter out list elements with empty dataframes, and rows in dataframes with all NA.
I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733