Using lag function gives an atomic vector with all zeroes - r

I have trying to use "lag" function in base R to calculate rainfall accumulations for a 6-hr period. I have hourly rainfall, then I calculate cumulative rainfall using cumsum function and then I am using the lag function to calculate 6-hr accumulations as below.
Event_Data<-dbGetQuery(con, "select feature_id, TO_CHAR(datetime, 'MM/DD/YYYY HH24:MI') as DATE_TIME, value_ms as RAINFALL_IN from Rain_HOURLY")
Event_Data$cume<-cumsum(Event_Data$RAINFALL_IN)
Event_Data$six_hr<-Event_Data$cume-lag(Event_Data$cume, 6)
But the lag function gives me all zeroes and the structure of the data frame looks like this-
'data.frame': 169 obs. of 5 variables:
$ feature_id : num 80 80 80 80 80 ...
$ DATE_TIME : chr "09/10/2017 00:00" "09/10/2017 01:00" "09/10/2017 02:00" "09/10/2017 03:00" ...
$ RAINFALL_IN: num 0.251 0.09 0.017 0.071 0.016 0.01 0.136 0.651 0.185 0.072 ...
$ cume : num 0.251 0.341 0.358 0.429 0.445 ...
$ six_hr : atomic 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "tsp")= num -23 145 1
This code has worked fine with several of my other projects but I have no clue why I am getting zeroes. Any help is greatly appreciated.
Thanks.

There might be a conflict with the lag function from other packages, that would explain why this code worked on other scripts but not on this one.
try stats::lag instead of just lag to enforce which package you want to use. (or dplyr::lag which seems to work better for me at east) ?

I think you have a misconception about what lag() from the stats package does. It's returning zeros, because you're taking the full data for cumulative rainfall and then subtract it again. Check this small example for an illustration:
x <- 1:20
y <- lag(x,3) ;y
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#attr(,"tsp")
#[1] -2 17 1
x-y #x is a vector
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#attr(,"tsp")
#[1] -2 17 1
As you can see, lag() simply keeps the vector values and just adds a time series attribute with the values "starting time, ending time, frequency". Because you put in a vector, it used the default values "1, length(Event_Data$cume), 1" and subtracted the lag from the starting and ending time, which is 3 in the example and seemingly 24 in your code output (which doesn't fit the code input above it, btw).
The problem is that your vector doesn't have any time attribute assigned to it, so R doesn't know which the corresponding values of your data and lagged data are. Thus, it simply subtracts the vector values and adds the time attribute of the lagged variable. To fix this, you just need to assign times to Event_Data$cume, by converting it to a time-series object, i.e. try Event_Data$six_hr<-as.numeric(ts(Event_Data$cume) - lag(ts(Event_Data$cume), 6))
It works fine for the small example above:
x <- ts(1:20)
y <- lag(x,3)
x-y #x is a ts
#Time Series:
#Start = 1
#End = 17
#Frequency = 1
# [1] -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3

Related

How to devide my dataset for using permanova

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !
I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required
So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

How to organize my data to create a heatmap using correlation or cluster analysis (x must be numeric problem)

I need some help with generating heatmaps with cluster analysis and correlation (I am new to R). My data looks like this in Excel:
Gene1 Gene2 Gene3 Gene4 Gene5 ... Gene296
Bacteria1 0 0 0 0.7 0.2 ... 0
Bacteria2 0.44 0 0 0 0 ... 0.9
Bacteria2 0 0.32 0 0.4 0 ... 0
... ... ... ... ... ... ... ...
Bacteria117 0 0.2 0.3 0 0.7 ... 0
A value of 0.32 represents a score of 32 from 0 to 100. There are higher scores (0.9 for example) or lower scores (0 or 0.2 for example). I checked for NAs and there are none. I want to do cluster analysis to find out what bacteria form clusters according to my experimental data (scores). The file is CSV. I used this code:
> aa <- read.csv(file.choose())
> str(aa)
#I obtain this structure
'data.frame': 117 obs. of 296 variables:
$ X : Factor w/ 117 levels "Ac_neuii_BVI",..: 45 64 67 104 1 2 3 4 5 6 ...
$ AAC6_Iad : num 0 0 0 0 0 0 0 0 0 0 ...
$ aad6 : num 0 0 0 0 0 0 0 0 0 0 ...
$ abeS : num 0 0 0 0 0 0 0 0 0 0 ...
> is.numeric(aa)
[1] FALSE
When I try to use the correlation or the clustering I get this error:
> az <- cor(aa)
Error in cor(aa) : 'x' must be numeric
I tried as.matrix but the error continues in the matrix of course. I tried as.numeric but it didn't work. I erased X > aa$X <- NULL and the problem disappeared (I don't know if this is the correct way to solve the problem), but the name of the bacteria disapeared and then I get a correlation between my genes, not between my genes AND the bacteria. The same thing happens with the clustering using hclust or dist. Is there a way I should organize my csv file? I haven't found a clear article on the internet on how to solve the "x must be numeric problem" and on how to do the correlation or measuring the distances between the genes and the bacteria.
Thank you. Sorry for the ignorance on certain things that might appear obvious to you.
You can import the bacteria names as row.names:
aa <- read.csv(file.choose(), row.names = 1)
aa$X is not numeric (it contains factors). You can transform it with:
aa$X = as.numeric(aa$X)
Then az <- cor(aa) will run... but (as noted by #Cole) it does not make sense since X refers to the names of the bacteria.
You can set the first column to be the names of the rows with the row.names parameter of read.csv:
aa <- read.csv(file.choose(), row.names = 1)

Fomat number series in R

I have a number series like below. I need the negative numbers (numbers below 0) to be zero and other numbers to be rounded to two digits.
Can somebody help me to do this in R?
Output:
21.31 22.0 0 8.71 -25.27 1.63 0 144.23 0 0 21.9558290 57.2186577 214.2688719 57.9806240 0 0 21.7744036 50.7217715 0 131.4853834
Thanks in advance.
pred_cty1
1 2 3 4 5 6 7
21.3147237 22.0741859 -1.5040034 8.7155408 -25.2777258 1.6331518 -1.5303588
8 9 10 11 12 13 14
144.2318083 -13.1278888 -19.6253222 21.9558290 57.2186577 214.2688719 57.9806240
15 16 17 18 19 20
-7.7710546 -35.6169525 21.7744036 50.7217715 -0.4616455 131.4853834
> str(pred_cty1)
Named num [1:20] 21.31 22.07 -1.5 8.72 -25.28 ...
- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
These are very basic r functions and methodologies, so I'd recommend researching the concept of subsetting and checking out ?round. FYI, pred_cty1 is a vector type object. 'Series' doesn't really help answer your question because there a bunch of data types that can store them.
After reading up on subsetting and round check out this simple solution:
pred_cty1 <- round(pred_cty1, digits = 2)
pred_cty1[pred_cty1 < 0] <- 0

R: filling matrix with values does not work

I have a data frame vec that I need to prepare for an image.plot() plot. The structure of vec is as follows:
> str(vec)
'data.frame': 31212 obs. of 5 variables:
$ x : int 8 24 40 56 72 88 104 120 136 152 ...
$ y : int 8 8 8 8 8 8 8 8 8 8 ...
$ dx: num 0 0 0 0 0 0 0 0 0 0 ...
$ dy: num 0 0 0 0 0 0 0 0 0 0 ...
$ d : num 0 0 0 0 0 0 0 0 0 0 ...
Note: the values in $dx, $dy and $d are not zero but only too small to be shown in this overview.
Background: the data is the output of a pixel tracking software. $x and $y are pixel coordinates while in $d are the displacement vector lengths (in pixels) of that pixel.
image.plot() expects as first and second argument the dimension of the matrix as ordered vectors, so I think sort(unique(vec$x)) and sort(unique(vec$y)) respectively should be good. So, I would like to end up with image.plot(sort(unique(vec$x)),sort(unique(vec$y)), data)
The third argument is the actual data. To build this I tried:
# spanning an empty matrix
data = matrix(NA,length(unique(vec$x)),length(unique(vec$y)))
# filling the matrix
data[match(vec$x, sort(unique(vec$x))), match(vec$y, sort(unique(vec$y)))] = vec$d
But, unfortunately, this isn't working. It reports no errors but data contains no values! This works:
for(i in c(1:length(vec$x))) data[match(vec$x[i], sort(unique(vec$x))), match(vec$y[i], sort(unique(vec$y)))] = vec$d[i]
But is very slow.
a) is there a better way to build data?
b) is there a better way to deal with my problem, anyways?
R allows indexing of a matrix by a two-column matrix, where the first column of the index is interpreted as the row index, and the second column as the column index. So create the indexes into data as a two-column matrix
idx = cbind(match(vec$x, sort(unique(vec$x))),
match(vec$y, sort(unique(vec$y))))
and use that
data[idx] = vec$d

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources