Adding to a row based upon values from another variable - r

I'm very new to R and currently working through data from my lab. I have a dataframe with a good amount of variables- two of these variables are Sample and Time. Each sample records a maximum of 10 minutes of observations, then restarts at 0 again for the next sample. I.e., sample 1 correctly displays the timestamps from 0 minutes to 10 minutes. However, upon going above 10 minutes of observations, the Time column will display 0, and the Sample column will display 2. Therefore, each time value in sample 2 observations should be the time displayed plus 10, each time value in sample 3 should be the time displayed plus 20, etc etc. What would be the best way to go about this? Sorry again if I don't have any of the jargon down, I just started learning r.

Without knowing for sure where the column that starts with 9.314... I cannot give an exact answer.
Is there a way for you to add something like this:
df$Time <- df$Time + (df$Sample - 1) * 10
My idea is to take the Time column and add
(1 - 1) * 10 = 0 for Sample 1
(2 - 1) * 10 = 10 for Sample 2
etc

Related

Create a new column with intervals in R

I am looking for a quick solution how to create a new column in a data frame (interval) by taking into account the output of the time column.
My dummy column
time <- c(7.1,8.2,9.3,10.4,11.5,12.6,50.9)
df <- data.frame(time)
df
My desired output
Based on the info in the TIME column, I would like to determine the interval. In my example, whatever comes between 0.0-10.0 (including 10) equals interval 10. The intervals are grouped by 10, as you see. So whatever comes between 10.0-20.0 will be assigned to interval 20 and so on and so forth.
Any hints on how to get the new column with intervals would be highly appreciated. Thanks
Moved comments to an answer. You can use integer division to do this. You divide the time column by 10 as this is the interval you are looking for. Add 1 and multiply by 10 to get the results.
df["interval"]<-(df$time %/% 10 + 1) * 10

Count total values after CountDistinct

i created a table in which i want to see all the resources that were used on 1 day, for different missions. It's possible that a resource executed more than 1 mission / day. that's why i used an expression with CountDistinct to only show the unique number of resources, used in 1 day for all the missions.
Now as a next step , i want to see what the average number of unique resources is, for a selected time period.
Unfortunately i am not able to use a count or sum expression on the CountDistinct-expression.
If i execute a sum function it gives me the total number of unique values, spread accross the time period, but i want to make a sum of the resources used per day.
fex i have 3 resources , on day 1 i use resource A for 5 missions on day 2 i use resource A & B for 6 missions. so that makes 11 missions on 2 days, and 3 resources ( A + A + B ).
so i want to count the 82+92+100+90+91+92. How do i get the sum of these values ?
any suggestions on how to fix this please?
MANY THANKS!!!!!
Found the solution, created 2 extra datasets to pull the unique values / day.
Added a lookup function in one of the two tablix to compare the values on the same dates ( dates in both datesets ) = > unique values per day. Afterwards made the sum of the values and divided by number of days to get get average unique values / per day.

Moving average with dynamic window

I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!

Independent binary variable (frequency) and continuous response variable - lmm

I've spent a lot of time searching for a solution but not successfully. For that reason I decided to post my problem or question here hoping somebody of you can help me.
I want to find out which variables are influencing the travel distance of two animals (same species).
The response variable is distance moved (in meters). In total I have 66 tracking sessions for both animals.
The independent variables are: temperature, rainfall, offspring (yes = 1, no = 0), observation period (in minutes) and activity.
I looked at the animals (one day - one animal) every 15 minutes and noted the state of activity (active = 1 or inactive = 0). For that reason my data table consists around 1800 points and the same amount of activity records.
Then I created a table with following columns:
Animal, Tracking-Session, rainfall, offspring, observation period, active, inactive, distance
The two columns active and inactive contain the sum of active (inactive) records per tracking session.
For example in tracking-session 1 the animal A was 30 times active and 11 inactive and moved 6000 meters during that tracking session.
I thought I could do my analysis with this table using the command cbind() to make one column for activity out of the two columns with "inactive" and "active". But this does not work, I get:
Error in lme4::lFormula(formula = distance~ (1 | animal) + activity + offspring + ...
rank of X = 12 < ncol(X) = 13
I want to include the second animal as a random factor to get an output valid for the whole "population" (which only consits of two animals in that case).
How can I fit a linear mixed model to this data or the first question is: how my data table has to look like to do such analysis?
I started running a linear mixed model with my original data table consisting of 1800 rows but the outcome was not convincing. And I don't know if this table was built up correctly for this task. Because I have only 60 tracking sessions and for that reason only 60 resulting travel distances, but 1800 records of activity (each 15 minutes - active or inactive). I don't know how to handle this situation the only possibility for me to overcome this problem was to copy the travel distace (which is the result of all points watched per day) and assign it to each single point of that tracking session.
The same is for rainfall and temperature because these conditions were only measured once a day I had to copy the value for each single point taken on the same day.
Is this correct or better can R handle such tables (like in the picture)? Or is it better to create a table with one row for each day (as I describe above)?
If the the second table (the one with one row per tracking session) is the better choice, how has it be transformed that R can use it?
Hopefully you can follow my explanations (I tried to explain it as detailed as possible) and anyone can help me!
Thanks in advance!
Iris

R data frame - row number increases

Assuming we have a data frame (the third column is a date column) that contains observations of irregular events starting from Jan 2000 to Oct 2011. The goal is to choose those rows of the data frame that contain observations between two dates
start<-"2005/09/30"
end<-"2011/01/31"
The original data frame contains about 21 000 rows. We can verify this using
length(df_original$date_column).
We now create a new data frame that contains dates newer than the start date:
df_new<-df_original[df_original$date_column>start,]
If I check the length using length(df_new$date_column) it shows about 13 000 for the length.
Now we create another data frame applying the second criterion (smaller than end date):
df_new2<-df_new[df_new$date_column<end,]
If I check again the length using length(df_new2$date_column) it shows about 19 000 counts for the length.
How is it possible that by applying a second criterion on the new data frame df_new the number of rows increases? The df_new should have a number of rows equal or below the 13 000.
The data frame is quite large such that I cannot post it here. Maybe someone can provide a reason under which circumstances this behavior occurs.
The following example works fine for me:
df_original = data.frame(date_column = seq(as.Date('2000/01/01'), Sys.Date(), by=1), value = 1)
start = as.Date('2005/09/30')
end = as.Date('2011/01/31')
df_new = df_original[df_original$date_column>start,]
df_new2 = df_new[df_new$date_column<end,]
> dim(df_original)
[1] 4316 2
> dim(df_new)
[1] 2216 2
> dim(df_new2)
[1] 1948 2
Without seeing an example of your actual data, I would suggest 2 things to look out for:
Make sure your dates are coded as dates.
Make sure you aren't accidentally indexing by row name. This is a common culprit for the behavior you're talking about.
Can you get the results you want via one subset command?
df_new <- df_original[with(df_original, date_column>start & date_column<end),]
# or
df_new <- subset(df_original, date_column>start & date_column<end)
can you give us dput(head(df_original))? which shares with us the first 5 records and their data structure. I am suspicious something is up with the format of your date_column.
If you are storing start and end both as strings (which your example seems to indicate) and the date column is also a string, then you will not be able to use < or > to compare values of dates. So somewhere you need to validate that everything being compared is known by R to be dates.

Resources