Composing a data.frame from loop-generated sequences - r

I have a data set which is made up of observations of the weights of fish, the julian dates they were captured on, and their names. I am seeking to assess what the average growth rate of these fish is according to the day of the year (julian date). I believe the best method to do this is to compose a data.frame with two fields: "Julian Date" and "Growth". The idea is this: for a fish which is observed on January 1 (1) at weight 100 and a fish observed again on April 10 (101) at weight 200, the growth rate would be 100g/100days, or 1g/day. I would represent this in a data.frame as 100 rows in which the "Julian Date" column is composed of the Julian date sequence (1:100) and the "Growth" column is composed of the average growth rate (1g/day) over all days.
I have attempted to compose a for loop which passes through each fish, calculates the average growth rate, then creates a list in which each index contains the sequence of Julian dates and the growth rate (repeated the number of times equal to the length of the Julian date sequence). I would then utilize the function to compose my data.frame.
growth_list <- list() # initialize empty list
p <- 1 # initialize increment count
# Looks at every other fish ID beginning at 1 (all even-number observations are the same fish at a later observation)
for (i in seq(1, length(df$FISH_ID), by = 2)){
rate <- (df$growth[i+1]-df$growth[i])/(as.double(df$date[i+1])-as.double(df$date[i]))
growth_list[[p]] <- list(c(seq(as.numeric(df$date[i]),as.numeric(df$date[i+1]))), rep(rate, length(seq(from = as.numeric(df$date[i]), to = as.numeric(df$date[i+1])))))
p <- p+1 # increase to change index of list item in next iteration
}
# Converts list of vectors (the rows which fulfill above criteria) into a data.frame
growth_df <- do.call(rbind, growth_list)
My expected results can be illustrated here: https://imgur.com/YXKLkpK
My actual results are illustrated here: https://imgur.com/Zg4vuVd
As you can see, the actual results appear to be a data.frame with two columns specifying the type of the object, as well as the length of the original list item. That is, row 1 of this dataset contained 169 days between observations, and therefore contained 169 julian dates and 169 repetitions of the growth rate.

Instead of list(), use data.frame() with named columns to build a list of data frames to be row binded at the end:
growth_list <- vector(mode="list", length=length(df$FISH_ID)/2)
for (i in seq(1, length(df$FISH_ID), by=2)){
rate <- with(df, (growth[i+1]-growth[i])/(as.double(date[i+1])-as.double(date[i])))
date_seq <- seq(as.numeric(df$date[i]), as.numeric(df$date[i+1]))
growth_list[[p]] <- data.frame(Julian_Date = date_seq,
Growth_Rate = rep(rate, length(date_seq))
p <- p + 1
}
growth_df <- do.call(rbind, growth_list)

Welcome to stackoverflow
Couple things about your code:
I recommend using the apply function instead of the for loop. You can set parameters in apply to perform row-wise functions. It makes you code run faster. The apply family of functions also creates a list for you, which reduces the code you write to make the list and populate it.
It is common to supply users with a snippet example of your initial data to work with. Sometimes the way we describe our data is not representative of our actual data. This tradition is necessary to alleviate any communication errors. If you can, please manufacture a dummy dataset for us to use.
Have you tried using as.data.frame(growth_list), or data.frame(growth_list)?
Another option is to use an if else statement within your for loop that performs the rbind function. This would look something like this:
#make a row-wise for loop
for(x in 1:nrow(i)){
#insert your desired calculations here. You can turn the rows into their own dataframe by using this, which may make it easier to perform your calculations:
dataCurrent <- data.frame(i[x,])
#finish with something like this to turn your calculations for each row into an output dataframe of your choice.
outFish <- cbind(date, length, rate)
#make your final dataframe as follows
if(exists("finalFishOut") == FALSE){
finalFishOut <- outFish
}else{
finalFishOut <- rbind(finalFishOut, outFish)
}
}
Please update with a snippet of data and I'll update this answer with your exact solution.

Here is a solution using dplyr and plyr with some toy data. There are 20 fish, with a random start and end time, plus random weights at each time. Find the growth rate over time, then create a new df for each fish with 1 row per day elapsed and the daily average growth rate, and output a new df containing all fish.
df <- data.frame(fish=rep(seq(1:20),2),weight=sample(c(50:100),40,T),
time=sample(c(1:100),40,T))
df1 <- df %>% group_by(fish) %>% arrange(time) %>%
mutate(diff.weight=weight-lag(weight),
diff.time=time-lag(time)) %>%
mutate(rate=diff.weight/diff.time) %>%
filter(!is.na(rate)) %>%
ddply(.,.(fish),function(x){
data.frame(time=seq(1:x$diff.time),rate=x$rate)
})
head(df1)
fish time rate
1 1 1 -0.7105263
2 1 2 -0.7105263
3 1 3 -0.7105263
4 1 4 -0.7105263
5 1 5 -0.7105263
6 1 6 -0.7105263
tail(df1)
fish time rate
696 20 47 -0.2307692
697 20 48 -0.2307692
698 20 49 -0.2307692
699 20 50 -0.2307692
700 20 51 -0.2307692
701 20 52 -0.2307692

Related

R: locate element previous in vector within for loop and report in new column

I've looked through many older posts but nothing is really hitting the answer I need. In short: I have a data frame that contains observation data and the time of observation in days.
My goal is to add a column for weeks. I have already subsetted the data so that I only have the time vector at intervals of 7 (t == 7, 14, 21, etc). I just need to make a for loop that creates a new vector of "weeks" that I can then cbind to my data. I'd prefer it to be a character string so I can use it more easily in ggplot geom_historgram, but isn't as necessary as just creating the new vector successfully.
The tricky part of the data is that there is not an equal number of observations per time- t # 28 has maybe 5x as many observations as t #7, etc.
I want to create code that evaluates what t is, then checks to see if it is greater than the last element in the t vector. If it isn't, then populate the week vector with the last value it did, and if so, then increase it by 1.
I know this is bad from a like, computer science/R perspective in a lot of ways, but any help would be useful:
#fake data (in reality this is a huge data set with many observations at intervals of 1 for t
L = rnorm(50, mean=10, sd=2)
t = c((rep.int(7,3)), (rep.int(14,6)), rep.int(21,8), rep.int(28,12), (rep.int(31, 5)), (rep.int(36,16)))
fake = cbind(L,t)
#create df that has only the observations that are at weekly time points
dayofweek = seq(7,120,7)
df = subset(fake, t %in% dayofweek)
#create empty week vector
week = c()
#for loop with if-else statement nested to populate the week vector
for (i in 1:length(dayofweek)){
if (t = t[t-1]){
week = i
} else if (t > t[t-1]{
week = i+1
}
}
Thanks!!
I'm not sure I can follow what you want to do. If you want to determine which week the data fall within, why not:
set.seed(1)
L = rnorm(50, mean=10, sd=2)
...
fake <- data.frame(L=L, t=t)
fake$week <- floor(fake$t/7) # comment this out so t==7 becomes week==1 + 1
head(fake)
# L t week
# 1 8.747092 7 2
# 2 10.367287 7 2
# 3 8.328743 7 2
# 4 13.190562 14 3
# 5 10.659016 14 3
# 6 8.359063 14 3

All data in one column

I have this soccer data all in one column.
Round 36 # Round of the league------------------------------------
29.07. 20:45 # Date and time of the match
Barcelona # Home Team
4 - 1 # FT result
Getafe # Away team
(2 - 0) # HT result
29.07. 20:45 # *date of the second match of the round*
Valencia
2 - 3
Laci
(1 - 2)
Round 35 # repeating pattern -------------------------------------------------
How can I move all the data in a certain round of the league in a new column? e.g. I want all observation from the Round 36 observation to the Round 35 observation in a single a column and so on.
I really do not have any idea how to solve this. I tried to transpose the data so that I could work better with observations as variables but still nothing. I am just a beginner in R and would appreciate any help.
thanks
Assuming your data is within a variable named lines (eg, lines[1] = Round 36 is the first entry, lines[2] = 29.07. 2045 is the next entry and so forth), we can spot the lines, split the vector into a list and then finally bind it into a data.frame (assuming they have equal length, if not you will have to do some manual work)
#Figure out where each round is.
rounds <- grepl('^Round', lines)
# Split it into seperate list. cumsum(rounds) will be an index for each group.
data <- split(lines, cumsum(rounds))
# Bind the data into a data.frame (assuming all have the same amount of data)
bound <- do.call(rbind, data)
Of course without a reproducible example it is hard to test the final result.
Note that if the soccer data does not have equal amount of data between rounds or if the data does not come in the same order, the resulting data.frame may not make immediate sense (if round 45 has 7 elements but round 46 has 4, round 46 will recycle element 1, 2 and 3 to fill out the missing values), but it might make it simpler to do some follow up data cleaning.

R code to iteratively and randomly delete entire rows from a data frame based on a column value, and saving as a new data frame each time

Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2

Why the for loop is not using the 'i' specified in the function

I have a data frame with 25 weeks of observations per animal and 20 animals in total. I am trying to write a function that calculates a linear equation between 2 points each time and do that for the 25 weeks and the 20 animals.
I want to use a general form of the equation so I can calculate values al any point. In the function, Week=t, Weight=d.
I can't figure out how to make this work. I don't think the loop is working using each row of the data frame as the index for the function. My data frame named growth looks something like this:
Week Weight Animal
1 50 1
2 60 1
n=25
1 80 2
2 90 2
.
.
20
for (i in growth$Week){
eq<- function(t){
d = growth$BW.Kg
t = growth$Week
(d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(eq)
}
}
eq(3)
OK, so I think there are a few points of confusion here. The first is writing a function inside a for loop. What is happening is that you are re-writing the function over and over, and also your function doesn't save the values of your equation anywhere. Secondly, you are passing t as your argument but the expecting t to follow the for loop with the i value. Finally, you say that you want this to be done for each animal, but the animal value is not shown in your code.
So it's a little bit hard to see what you are trying to achieve here.
Based on your information above, I've rewritten your function into something that will provide a result for your equation.
library(tidyverse)
growth <- tibble(week = 1:5,
animal = 1,
weight = c(50,52,55,54,57))
eq <- function(d,t,i){
z <- (d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(z)
}
test_result <- eq(growth$weight,growth$week,3)
Results:
[1] 57 56 55 54 53
Is that the kind of result you were expecting? Or did you want just a single result per week per animal? Could you provide a working example of a formula that would produce a single desired result (i.e. a result for animal 1 on week 1)?

How to select specific elements and find their index in a data.frame?

I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list
I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.

Resources