Saving function output to specific place in dataframe - r

I'm working on a function that predicts using a gbm model, one row at a time. Then, I want to save the predicted value in a specific place in my DF so that the next value can be predicted with that output included. Basically, a prediction with a lagged dependent variable. Below is a snippet of my data
DEC AAA_CCC BBBB LLLLL DDD_SHR ST_DSC WKG.P WKG.P.1T _CHNG XXXX_pr XXXX_pr_r XXXX_vol XXXX_.T.1.
38 0 0.99 0 0 0.51 8.28 0 0 6.04 2.84 2.84 10.49 9.83
39 0 0.99 0 0 0.51 8.27 0 0 5.97 2.75 2.75 10.33 10.49
40 0 1.04 0 0 0.51 8.27 0 0 6.01 2.81 2.81 10.58 10.33
41 0 0.98 0 0 0.51 8.28 0 0 5.99 2.87 2.87 9.49 10.58
42 0 0.98 0 1 0.52 8.27 0 0 6.10 2.81 2.81 10.35 9.49
43 0 0.95 0 1 0.51 8.27 0 0 6.01 2.72 2.71 10.67 10.35
XXXX_wd XXXX_ICP_A XXXX_ICP_A_.T.1.
38 4.41 0 1
39 4.33 1 0
40 4.36 0 1
41 4.32 1 0
42 4.19 0 1
43 4.25 1 0
This function needs to: find columns with specific names within a DF, check if there are 0s inside, if yes - predict a value based on the row with a 0 in it. Then, save the predicted value in that 0 place, and in a different column with a 0. Keep repeating that until there are no more 0s in the 'vol' column.
I've come up with this:
PREDICTION<-function (a, model)
{
vol<-select(a, ends_with("vol"))
vol_1<-select(a, ends_with("vol_.T.1."))
while (min(which(a[,colnames(vol)]== 0))!=0) {
PRED<-predict(model, a[min(which(a[,colnames(vol)]== 0)),])
a[[min(which(a[,colnames(vol_1)]== 0)),colnames(vol_1)]]<<-print(PRED)
a[[min(which(a[,colnames(vol)]== 0)),colnames(vol)]]<<-print(PRED)
}}
It prints the right values but doesn't save them the way I wanted. So the while part also doesn't work - values are not saved properly so it loops over the same row forever. I've tried replacing the print with return which didn't change anything. I don't really know where to go from here so I appreciate any help.

I have found a solution - definitely not elegant I will work on improving it. Posting if anyone is interesed.
I simplified the loop withing my function - for instead of while. The rows I'm working on are always in the same range so this helped even visually make the code easier to look at. I will be messing with this more so that might make a comeback to make the code more universal.
PREDICTION <- function(a, model)
{
vol <- select(a, ends_with("vol"))
vol_1 <- select(a, ends_with("vol_.T.1."))
for (i in 54:72) {
a[[i,colnames(vol)]] <- print(predict(model, a[i,]))
a[[i+1,colnames(vol_1)]] <- print(predict(model, a[i,]))
}
return(a)
}
Then I use mapply to map this function over my DF which contains dataframes and models to use.
LISTA$DF <- mapply(PREDICTION, a=LISTA$DF, model = LISTA$GBM)
Essentially this was mostly a syntax problem on my end. Just goes to show how much more I have to learn to be able to code more efficiently and functionally.

Related

Is there a way to resolve this error in cardinality_threshold problem?

I tried to use ggpairs to visualise my dataset but the error message that I am getting is what I don't understand. Can someone please help me?
> describe(Mydata)
vars n mean sd median trimmed mad min max range skew
Time 1 192008 4257.07 2589.28 4156.44 4210.33 3507.03 0 8869.91 8869.91 0.09
Source* 2 192008 9.32 5.95 8.00 8.53 2.97 1 51.00 50.00 3.39
Destination* 3 192008 8.22 6.49 7.00 7.31 2.97 1 51.00 50.00 3.07
Protocol* 4 192008 16.14 4.29 19.00 16.77 0.00 1 20.00 19.00 -1.26
Length 5 192008 166.12 464.07 74.00 96.25 11.86 60 21786.00 21726.00 14.40
Info* 6 192008 63731.70 46463.90 60732.50 62899.62 69904.59 1 131625.00 131624.00 0.14
kurtosis se
Time -1.28 5.91
Source* 15.94 0.01
Destination* 13.21 0.01
Protocol* 0.66 0.01
Length 349.17 1.06
Info* -1.47 106.04
> Mydata[,1][Mydata[,1] ==0]<-NA
> ggpairs(Mydata)
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'Source' has more levels (51) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
As the error suggests, the way to get rid of the error is to set cardinality_threshold=NULL or cardinality_threshold=51 as Source and Destination are both factor variables with 51 levels.
However, they're likely to be hard to see any detail in the plots, if it plots at all because one of the panels of the plot would be attempting to fit 51 barplots with 51 columns into it. You may want to think if grouping your factor levels makes sense for the analysis you're interested in, or exclude the factors (although that only leaves two continuous variables).

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers
It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Unable to apply ddply-summarise in R correctly

new here and new to R, so bear with me, please.
I have a data.frame similar to this:
time. variable TEER
1 0.07 cntrl 234.2795
2 1.07 cntrl 602.8245
3 2.07 cntrl 703.6844
4 3.07 cntrl 699.4538
...
48 0.07 cntrl 234.2795
49 1.07 cntrl 602.8245
50 2.07 cntrl 703.6844
51 3.07 cntrl 699.4538
...
471 0.07 agr1111 251.9119
472 1.07 agr1111 480.1573
473 2.07 agr1111 629.3744
474 3.07 agr1111 676.6782
...
518 0.07 agr1111 251.9119
519 1.07 agr1111 480.1573
520 2.07 agr1111 629.3744
521 3.07 agr1111 676.6782
...
753 0.07 agr2222 350.1049
754 1.07 agr2222 306.6072
755 2.07 agr2222 346.0387
756 3.07 agr2222 447.0137
757 4.07 agr2222 530.2433
...
802 2.07 agr2222 346.0387
803 3.07 agr2222 447.0137
804 4.07 agr2222 530.2433
805 5.07 agr2222 591.2122
I'm trying to apply ddply() to this data frame to get a new data frame with means and standard error (to plot later) like so:
> ddply(data_melt, c("time.", "variable"), summarise,
mean = mean(TEER), sd = sd(TEER),
sem = sd(TEER)/sqrt(length(TEER)))
What I get as an output data frame are same values of TEER in the mean column as in the first rows of the original data frame and zeroes in sd and sem columns. Also an error:
Warning message:
In levels<-(*tmp*, value = if (nl == nL) as.character(labels) else
paste0(labels, : duplicated levels in factors are deprecated
It looks like the function only goes through the first part of the data frame and doesn't bother looking at the duplicates of time. and variable group?
I already tried looking at the solutions to similar problems here but nothing seems to work. Am I missing something or is this a legitimate problem?
Any help / tips appreciated.
P.S Let me know if I'm not explaining the problem coherently enough and I'll try to go into more detail.
I think I've found a way around my problem.
Initially, when I load the data frame, each of the variables ("cntrl, "agr1111", "agr2222"), has a unique letter and number near them ("A1", "A2", "B1", "B2"), hence, looking like this: "cntrl.A1", "agr1111.B2". Instead, of substracting the letter-number from each of them using gsub i tried using filter with grepl to isolate certain rows that I need and summarise then.
Here's the code:
library(dplyr)
dt_11 <- dt %>%
group_by(time.) %>%
filter(grepl("agr1111", variable)) %>%
summarise(avg_11 = mean(teer),
sd_11 = sd(teer),
sem_11 = sd(teer)/sqrt(length(teer)))
This only gives me a data frame with one group of variables ("agr1111") and I'll have to do this two more times, for "cntrl" and "agr2222", hence resulting in 3 data frames. But I'm sure, I'll be able to either merge the data frames or plot them on the same graph separately.
This doesnt fit to be an answer, but too long to be a comment :
I ran your exact code and everything works fine!
> ddply(dt, c("time.", "variable"), summarise,
+ mean = mean(TEER), sd = sd(TEER),
+ sem = sd(TEER)/sqrt(length(TEER)), count = length(TEER))
#time. variable mean sd sem count
# 0.07 agr1111 251.9119 0 0 2
# 0.07 agr2222 350.1049 NA NA 1
# 0.07 cntrl 234.2795 0 0 2
# 1.07 agr1111 480.1573 0 0 2
# 1.07 agr2222 306.6072 NA NA 1
# 1.07 cntrl 602.8245 0 0 2
# 2.07 agr1111 629.3744 0 0 2
# 2.07 agr2222 346.0387 0 0 2
# 2.07 cntrl 703.6844 0 0 2
# 3.07 agr1111 676.6782 0 0 2
# 3.07 agr2222 447.0137 0 0 2
# 3.07 cntrl 699.4538 0 0 2
# 4.07 agr2222 530.2433 0 0 2
# 5.07 agr2222 591.2122 NA NA 1
> sessionInfo()
#other attached packages:
#[1] plyr_1.8.4
Could you update to latest version of packaes. I am not sure of the cause to your problem. I hope you understand how sd actually is calculated and why `NA~ appear.(HINT : look at the count column)

Analysis of several behaviours of 2 individuals

I am analysing several animal behaviours during a defined time period.
I watch videos of the animals and their behaviours. I record when each behaviour is displayed. They will display each behaviour several times during the recording (which correspond to the different events). Sometimes 2 or 3 behaviours can be displayed at the same time during the recording, but they don't usually start/finish exactly at the same time (so they overlap partly).
I end up with a series of events for each behaviour, and for each event I have their onset, duration and end point (see example hereafter).
I need to extract from this data the total amount during which behaviour 1 overlaps with behaviour 2 / behaviour 1 overlaps with behaviour 3 / behaviour 2 overlaps with behaviour 3. This is so that I can find correlations between behaviours, which ones tend to be displayed at the same time, which ones do not, ...
I am only a beginner with programming (mostly R) and I find it hard to get started. Can you please advise me how to proceed? Many thanks!
Example with a series of events for 3 behaviours:
Event tracked Onset Duration End
Behaviour 1 _event 1 7.40 548.88 556.28
Behaviour 1 _event 2 36.20 0.47 36.67
Behaviour 1 _event 3 48.45 0.25 48.70
Behaviour 1 _event 4 68.92 1.53 70.45
Behaviour 1 _event 5 75.48 0.22 75.70
Behaviour 1 _event 6 89.75 0.66 90.41
Behaviour 1 _event 7 94.62 0.16 94.78
Behaviour 1 _event 8 101.78 0.22 102.00
Behaviour 1 _event 9 108.86 0.59 109.45
Behaviour 1 _event 10 146.35 0.66 147.00
Behaviour 1 _event 11 150.20 0.75 150.95
Behaviour 1 _event 12 152.98 0.66 153.64
Behaviour 1 _event 13 157.84 0.56 158.41
Behaviour 2_event 1 7.52 0.38 7.90
Behaviour 2_event 2 18.73 0.16 18.88
Behaviour 2_event 3 19.95 2.25 22.20
Behaviour 2_event 4 26.41 0.25 26.66
Behaviour 2_event 5 35.91 0.16 36.07
Behaviour 2_event 6 37.29 0.34 37.63
Behaviour 2_event 7 38.13 0.72 38.85
Behaviour 2_event 8 40.19 0.31 40.51
Behaviour 2_event 9 44.26 0.16 44.41
Behaviour 2_event 10 45.32 0.16 45.48
Behaviour 2_event 11 54.84 1.44 56.27
Behaviour 2_event 12 56.65 1.19 57.84
Behaviour 2_event 13 61.59 1.03 62.62
Behaviour 2_event 14 81.13 3.83 84.96
Behaviour 2_event 15 86.65 0.31 86.96
Behaviour 2_event 16 90.15 0.19 90.34
Behaviour 2_event 17 96.97 0.53 97.50
Behaviour 2_event 18 107.12 0.22 107.34
Behaviour 2_event 19 118.53 0.41 118.94
Behaviour 2_event 20 127.76 0.25 128.01
Behaviour 2_event 21 129.45 0.69 130.13
Behaviour 2_event 22 130.60 2.31 132.91
Behaviour 2_event 23 141.01 0.41 141.41
Behaviour 2_event 24 152.85 0.37 153.23
Behaviour 2_event 25 156.54 0.13 156.66
Behaviour 3_event 1 7.71 1.94 9.65
Behaviour 3_event 2 11.12 1.53 12.65
Behaviour 3_event 3 19.01 0.19 19.20
Behaviour 3_event 4 20.01 3.97 23.98
Behaviour 3_event 5 24.95 4.22 29.16
Behaviour 3_event 6 29.70 2.19 31.88
Behaviour 3_event 7 33.23 2.50 35.73
Behaviour 3_event 8 36.82 0.44 37.26
Behaviour 3_event 9 38.20 1.16 39.35
Behaviour 3_event 10 39.91 2.13 42.04
Behaviour 3_event 11 42.49 3.62 46.11
Behaviour 3_event 12 47.09 0.53 47.62
Behaviour 3_event 13 48.15 0.34 48.49
Behaviour 3_event 14 49.40 2.13 51.52
Behaviour 3_event 15 57.57 2.25 59.82
Behaviour 3_event 16 60.89 0.88 61.76
Behaviour 3_event 17 66.85 6.78 73.63
Behaviour 3_event 18 75.65 3.03 78.68
In order to do the kind of study you want to do, it might be easiest to convert the data to a time series with variables on states (i.e. whether behavior 1, 2, 3, etc. is being displayed.) So you want to transform the dataset you have to one that looks like
time animal behav_1 behav_2 behav_3
0 1 FALSE TRUE FALSE
0 2 TRUE FALSE FALSE
1 1 FALSE TRUE FALSE
1 2 TRUE FALSE TRUE
... ... ... ... ...
Each row tells whether a particular animal is displaying each of the three behaviors at the given time. (I am assuming here that you have multiple animals and you want to keep their behavior data separate.)
Then you could easily approximate many of the quantities you are interested in. For example, you could compute the probability that an animal is doing behavior 1 given it is doing behavior 2 by
Computing a column data$behav_1_and_2 <- data$behav_1 & data$behav_2
Dividing the sum of the col behav_1_and_2 by the sum of behav_2: sum(data$behav_1_and_2) / sum(data$behav_2)
Okay, but how do you transform the data? First, decide how many time points you want to check. Maybe you should increment by about 0.1.
num_animals <- 10 // How many animals you have
time_seq <- seq(from = 0, to = 600, by = 0.1) // to should be end of video
data <- expand.grid(time = time_seq, animal = num_animals)
That gets you the first two columns of the data frame you want. Then you need to compute the three behavior columns. Define a function that takes the time, animal, and name of the behavior column, and returns TRUE if the animal is doing that behavior at the time, or FALSE if not.
has_behavior <- function(time, animal, behavior) {
...
}
(I'm going to let you figure out how to make that function.) With that function in hand, you can then create the last three columns with a loop:
// First create empty columns
data$behav_1 <- logical(nrow(data))
data$behav_2 <- logical(nrow(data))
data$behav_3 <- logical(nrow(data))
// Now loop through rows
for (i in 1:nrow(data)) {
data$behav_1[i] <- has_behavior(data$time[i], data$animal[i], 1)
data$behav_2[i] <- has_behavior(data$time[i], data$animal[i], 2)
data$behav_3[i] <- has_behavior(data$time[i], data$animal[i], 3)
}
With data in this format, you should be able to study the problem much more easily. You can compute those summary quantities easily, as I outlined earlier. This data frame is also set up to be useful for doing time series modeling. And it's also tidy, making it easy to use with packages like dplyr for data summarising and ggplot2 for visualization. (You can learn more about those last two tools in the free online book R for Data Science by Hadley Wickham.)

Reading non-uniform data into R

I struggling with reading non-uniform data into R.
I've achieved the following:
Used "readLines" to read the text file data in
Used "grep" to find the block of data that I want
Used the index from grep to create a variable (named "block") that contains only that block of data
All good so far - I now have the data I want. But - its a character variable with only one column that contains all the data.
This creates a sample of the variable I have made called "block" (first 3 rows):
line1 = c(" 114.24 -0.39 0.06 13.85 -0.06 1402.11 -1.48 0.0003 0.0000 35.468 1.02 -0.02 0.00 0 1 1 1 0 49.87 4 -290 0 0 -0.002 -0.010 0.155 999.00 11482.66 999.00 11482.66 16:52:24:119 255 13.89 50.00 0.00 -5.49 0.00")
line2 = c(" 114.28 -0.39 0.08 13.84 -0.06 1402.57 -1.48 0.0004 0.0000 35.479 1.29 -0.02 0.00 0 1 1 1 0 49.82 4 -272 0 0 -0.002 -0.011 0.124 999.00 11482.66 999.00 11482.66 16:52:24:150 255 13.89 50.00 0.00 -5.49 0.00")
line3 = c(" 114.31 -0.39 0.09 13.83 -0.06 1403.03 -1.47 0.0005 0.0000 35.492 1.42 -0.02 0.00 0 1 1 1 0 49.78 4 -263 0 0 -0.002 -0.011 0.046 999.00 11482.66 999.00 11482.66 16:52:24:197 255 13.89 50.00 0.00 -5.49 0.00")
block = c(line1,line2,line3)
My goal is to have this data as a data.frame with separate columns for each data point.
My attempts at using strsplit haven't helped (does the solution involve strsplit?)- what is the best approach here? Any suggestions/feedback welcome.
strsplit(block,"\s",fixed=F)
Either of the following should work for you:
## Creates a "data.table"
library(splitstackshape)
cSplit(data.table(x = block), "x", " ")
## Creates a "data.frame"
read.table(text = block, header = FALSE)
## Creates a character matrix
do.call(rbind, strsplit(block, "\\s+"))
## Like the above, but likely to be faster
library(stringi)
stri_split_regex(block, "\\s+", simplify = TRUE)
Note the "\\s+" for the last two options. The "+" is to match multiple spaces.
Actually - this looks like it might work.
Import raw data into R
But wanted to check if this was the best approach to this situation...?

Resources