Loop to create series of graphs from different files - r

I am trying to plot histograms with long term (several years) mean precipitation (pp) for each day of the month from a series of files. Each file has data collected from a different place (and has a different code). Each of my files looks like this:
X code year month day pp
1 2867 1945 1 1 0.0
2 2867 1945 1 2 0.0
...
And I am using the following code:
files <- list.files(pattern=".csv")
par(mfrow=c(4,6))
for (i in 1:24) {
obs <- read.table(files[i],sep=",", header=TRUE)
media.dia <- ddply(obs, .(day), summarise, daily.mean<-mean(pp))
codigo <- unique(obs$code)
hist(daily.mean, main=c("hist per day of month", codigo))
}
I get 24 histograms with 24 different codes in the title, but instead of 24 DIFFERENT histograms from 24 different locations, I get the same histogram 24 times (with 24 different titles). Can anybody tell me why? Thanks!

There are at least two errors I can see in your code.
There is an error in your ddply statement.
You are passing the wrong variable to hist, thus plotting something that may or may not exist depending on previous session actions.
The problem in your ddply statement is that you are doing an invalid assign (using <- ). Fix this by using =:
media.dia<- ddply(obs, .(day),summarise, daily.mean = mean(pp))
Then edit your hist statement:
hist(media.dia$daily.mean,main=c("hist per day of month",codigo))
I suspect the problem is that you are not passing the correct parameter to hist. The reason that your code actually produces a plot at all, is because in some previous step in your session you must have created a variable called daily.mean (as Brandon points out in the comment.)

I think the daily.mean calculated in the ddply function is assigned in a separate environment, and does not exist in an environment hist can see.
Try daily.mean<<-mean(pp)

Related

Creation of an index with a year base

I have a simple features panel data with the murders in the 32 Mexican states through 24 years. I want to create an index taking as a base the first year in my data (1994). To do so I am running the following code:
#Taking the data of murders in 1994 from each state and then paste it for all the years
mexico.sf$murders1994 <- mexico.sf$murders[mexico.sf$year==1994]
#Use the murders from each year divided by the murders in 1994 per state to create an index
mexico.sf$murdersrelativeto1994 <- (mexico.sf$murders / mexico.sf$murders1994)
Nevertheless when I run the first code I got the following error:
Error: Assigned data `mexico.sf$murders[mexico.sf$year == 1994]` must be compatible with existing data.
x Existing data has 800 rows.
x Assigned data has 32 rows.
i Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
It is clear to me that it is only taking 32 values because I am filtering per year, however, how can I copy those 32 data in all the sample?
Without looking at the actual data, I'm not sure I understand your purpose correctly. But if you need only to overwrite 800 values with the 32 values, maybe this way:
mexico.sf$murders1994 <- rep(mexico.sf$murders[mexico.sf$year==1994], 800/32)

how to Reading Csv file and calculating mean in R dynamically?

I want to read a file and calculate the mean of it.
`>list
[1] "book1.csv" "book2.csv".
for book1
observation1
23
24
65
76
34
In books i have a variable observation 1 and observation 2 column for book 1 and 2 respectively. So i want to write a function where i can calculate mean of it.I am new to R and not able subset the variable of books. Can anyone please help me out in writing the function?
Try this. File represents the file to be read in (book1) and the variable represents the variable to take mean over (observation 1)
read.mean<-function(file,variable){
df<-read.csv(file)
mean.df <- mean(df[,variable])
return(mean.df)
}
Make sure to pass your arguments in quotes, i.e. read.mean("book1", "observation1"). There is a way to do it without the quotes (Passing a variable name to a function in R) but it is complicated.

Discriminant analysis and column name in the code

I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))

Circular-linear regression with covariates in R

I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources