Convert xy dataset to array - r

I have a dataset of time/value e.g.
> df <- data.frame(time=c(1,4,5,6), speed=c(100, 500, 600, 800))
I want to convert it to an array with one item per time tick:
> some_func(df, speed ~ time, step=1)
[1] 100 100 100 500 600 800
notice that a values for time == 2 and 3 were added.
Then I can use it in the cross correlation function.

A possible way is by using the stepfunction to interpolate, as you wanted this in your above example.
sf<-stepfun(df$time,c(df$speed[1],df$speed))
sf(1:6)

This will return the values that you want.
df$speed[rep(seq_len(nrow(df)), c(diff(df$time), 1))]
[1] 100 100 100 500 600 800
Here, we use rep along with a vector argument to the second argument to return the indices of the desired vector for df$speed. We achieve this with diff, which will calculate the time period gaps and append 1 at the end to return the final value.

Related

How to calculate the sum of a dimension of an array and redistribute it to the other dimensions?

I am working with climatalogical data that consists of a gridded data set of hourly precipitation values. The grid consists of 300 x 300 points (which include the geographical coordinates) and the data values are for one day period (24 hours), so the resulting array has the dimensions 300 300 24.
Now I need to calculate the day sum of the precipitation for the same grid. So what needs to be done is to sum up all the values in the third dimension of the array and assign these day sum values to the same grid values from the first two dimensions.
The resulting dimensions should then be 300 300 1.
Anybody with an advice how to achieve that?
You can use apply across the first two dimensions.
Suppose I have an array with the same dimensions as the one in your question:
set.seed(69)
data_array <- array(runif(300 * 300 * 24), dim = c(300, 300, 24))
dim(data_array)
#> [1] 300 300 24
Now if I wanted the daily sum of the precipitation in the cell on the very top left of the grid, I could do:
sum(data_array[1, 1, ])
#> [1] 12.04836
So to get it for every cell in the grid, I use apply over the first two dimensions:
result <- apply(data_array, 1:2, sum)
This gives me a 300 * 300 grid:
dim(result)
#> [1] 300 300
And we can see that the top left cell has the sum of the top left cells of each of the 24 slices of the array:
result[1, 1]
#> [1] 12.04836

Composing a data.frame from loop-generated sequences

I have a data set which is made up of observations of the weights of fish, the julian dates they were captured on, and their names. I am seeking to assess what the average growth rate of these fish is according to the day of the year (julian date). I believe the best method to do this is to compose a data.frame with two fields: "Julian Date" and "Growth". The idea is this: for a fish which is observed on January 1 (1) at weight 100 and a fish observed again on April 10 (101) at weight 200, the growth rate would be 100g/100days, or 1g/day. I would represent this in a data.frame as 100 rows in which the "Julian Date" column is composed of the Julian date sequence (1:100) and the "Growth" column is composed of the average growth rate (1g/day) over all days.
I have attempted to compose a for loop which passes through each fish, calculates the average growth rate, then creates a list in which each index contains the sequence of Julian dates and the growth rate (repeated the number of times equal to the length of the Julian date sequence). I would then utilize the function to compose my data.frame.
growth_list <- list() # initialize empty list
p <- 1 # initialize increment count
# Looks at every other fish ID beginning at 1 (all even-number observations are the same fish at a later observation)
for (i in seq(1, length(df$FISH_ID), by = 2)){
rate <- (df$growth[i+1]-df$growth[i])/(as.double(df$date[i+1])-as.double(df$date[i]))
growth_list[[p]] <- list(c(seq(as.numeric(df$date[i]),as.numeric(df$date[i+1]))), rep(rate, length(seq(from = as.numeric(df$date[i]), to = as.numeric(df$date[i+1])))))
p <- p+1 # increase to change index of list item in next iteration
}
# Converts list of vectors (the rows which fulfill above criteria) into a data.frame
growth_df <- do.call(rbind, growth_list)
My expected results can be illustrated here: https://imgur.com/YXKLkpK
My actual results are illustrated here: https://imgur.com/Zg4vuVd
As you can see, the actual results appear to be a data.frame with two columns specifying the type of the object, as well as the length of the original list item. That is, row 1 of this dataset contained 169 days between observations, and therefore contained 169 julian dates and 169 repetitions of the growth rate.
Instead of list(), use data.frame() with named columns to build a list of data frames to be row binded at the end:
growth_list <- vector(mode="list", length=length(df$FISH_ID)/2)
for (i in seq(1, length(df$FISH_ID), by=2)){
rate <- with(df, (growth[i+1]-growth[i])/(as.double(date[i+1])-as.double(date[i])))
date_seq <- seq(as.numeric(df$date[i]), as.numeric(df$date[i+1]))
growth_list[[p]] <- data.frame(Julian_Date = date_seq,
Growth_Rate = rep(rate, length(date_seq))
p <- p + 1
}
growth_df <- do.call(rbind, growth_list)
Welcome to stackoverflow
Couple things about your code:
I recommend using the apply function instead of the for loop. You can set parameters in apply to perform row-wise functions. It makes you code run faster. The apply family of functions also creates a list for you, which reduces the code you write to make the list and populate it.
It is common to supply users with a snippet example of your initial data to work with. Sometimes the way we describe our data is not representative of our actual data. This tradition is necessary to alleviate any communication errors. If you can, please manufacture a dummy dataset for us to use.
Have you tried using as.data.frame(growth_list), or data.frame(growth_list)?
Another option is to use an if else statement within your for loop that performs the rbind function. This would look something like this:
#make a row-wise for loop
for(x in 1:nrow(i)){
#insert your desired calculations here. You can turn the rows into their own dataframe by using this, which may make it easier to perform your calculations:
dataCurrent <- data.frame(i[x,])
#finish with something like this to turn your calculations for each row into an output dataframe of your choice.
outFish <- cbind(date, length, rate)
#make your final dataframe as follows
if(exists("finalFishOut") == FALSE){
finalFishOut <- outFish
}else{
finalFishOut <- rbind(finalFishOut, outFish)
}
}
Please update with a snippet of data and I'll update this answer with your exact solution.
Here is a solution using dplyr and plyr with some toy data. There are 20 fish, with a random start and end time, plus random weights at each time. Find the growth rate over time, then create a new df for each fish with 1 row per day elapsed and the daily average growth rate, and output a new df containing all fish.
df <- data.frame(fish=rep(seq(1:20),2),weight=sample(c(50:100),40,T),
time=sample(c(1:100),40,T))
df1 <- df %>% group_by(fish) %>% arrange(time) %>%
mutate(diff.weight=weight-lag(weight),
diff.time=time-lag(time)) %>%
mutate(rate=diff.weight/diff.time) %>%
filter(!is.na(rate)) %>%
ddply(.,.(fish),function(x){
data.frame(time=seq(1:x$diff.time),rate=x$rate)
})
head(df1)
fish time rate
1 1 1 -0.7105263
2 1 2 -0.7105263
3 1 3 -0.7105263
4 1 4 -0.7105263
5 1 5 -0.7105263
6 1 6 -0.7105263
tail(df1)
fish time rate
696 20 47 -0.2307692
697 20 48 -0.2307692
698 20 49 -0.2307692
699 20 50 -0.2307692
700 20 51 -0.2307692
701 20 52 -0.2307692

Apply a function that requires seq() in R

I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30

R: Comparing values in vector to column in data frame

Apologies if this has been asked before, but I've searched for a while and can't find anything to answer my question. I'm somewhat comfortable using R but never really learned the fundamentals. Here's what I'm trying to do.
I've got a vector (call it "responseTimes") that looks something like this:
150 50 250 200 100 150 250
(It's actually much longer, but I'm truncating it here.)
I've also got a data frame where one column, timeBin, is essentially counting up by 50 from 0 (so 0 50 100 150 200 250 etc).
What I'm trying to do is to count how many values in responseTimes are less than or equal to each row in the data frame. I want to store these counts in a new column of my data frame. My output should look something like this:
timeBin counts
0 0
50 1
100 2
150 4
200 5
250 7
I know I can use the sum function to compare vector elements to some constant (e.g., sum(responseTimes>100) would give me 5 for the data I've shown here) but I don't know how to do this to compare to a changing value (that is, to compare to each row in the timeBin column).
I'd prefer not to use a loop, as I'm told those can be particularly slow in R and I have quite a large data set that I'm working with. Any suggestions would be much appreciated! Thanks in advance.
You can use sapply this way:
> timeBin <- seq(0, 250, by=50)
> responseTimes <- c(150, 50, 250, 200, 100, 150, 250 )
>
> # using sapply (after all `sapply` is a loop)
> ans <- sapply(timeBin, function(x) sum(responseTimes<=x))
> data.frame(timeBin, counts=ans) # your desired output.
timeBin counts
1 0 0
2 50 1
3 100 2
4 150 4
5 200 5
6 250 7
That might help:
responseTimes <- c(150, 50, 250, 200, 100, 150, 250)
bins1 <- seq(0, 250, by = 50)
sahil1 <- function(input = responseTimes, binsx = bins1) {
tablem <- table(cut(input, binsx)) # count of input across bins
tablem <- cumsum(tablem) # cumulative sums
return(as.data.frame(tablem)) # table to data frame
}

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)

Resources