I am doing some very basic plots for exploratory analyses, and have successfully created a for loop to do most of the work for me. I have 12 years of data, 5 different categories(Cat1-Cat5), and 3 different variables(Say X,Y,Z). The loops that I have done so far gives me the histogram of each of the variables by year (so X in year 1 - X in year 12 for example).
I partitioned my data in 2 ways - by category, and by year as follows:
Cat.1<-subset(data,Category==1) #Similar code for categories 2-5
categories<-list(Cat.1,Cat.2,Cat.3,Cat.4,Cat.5)
Year.1<-subset(data,Year==1)
years<-list(Year.1,Year.2, ... , Year.12)
Now, with the data partitioned this way I have set up loops:
for(i in (1:length(categories))
{
store.data<-categories[[i]]
hist(store.data$X)
}
What I would like to do is have an external loop that deals with the 3 variables:
variables<-list(X,Y,Z)
for(j in (1:length(variables))
{
#insert above for loop here
}
The desired output would be the output of all of the histograms for each year and each variable. I realize that I can just add in lines to the original for loop:
hist(store.data$Y)
hist(store.data$Z)
But, eventually I will be running analyses (ANOVA, t-test, etc) on the data and I plan on having the same setup. By having the external loop that deals with which variable the internal loop works on, I should have much less code to write in theory.
This short solution gives you the histograms, but doesn't name them to inform you which histogram relate to which category. The histograms will be named by variable, and the order the histograms are generated will correspond to the numerical order of you categories. It doesn't look like you're labeling you're histograms in the code you posted, so this may not be a problem for you.
category = rep(1:5,20)
X = rnorm(100)
Y = rexp(100)
Z = rgamma(100,5)
require(data.table)
DT = data.table(category, X, Y, Z)
DT[,lapply(.SD, hist), by=category]
Related
I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.
This is probably simple, but Im new to R and it doesn't work like GrADs so I;ve been searching high and low for examples but to no avail..
I have two sets of data. Data A (1997) and Data B (2000)
Data A has 35 headings (apples, orange, grape etc). 200 observations.
Data B has 35 headings (apples, orange, grape, etc). 200 observations.
The only difference between the two datasets is the year.
So i would like to correlate the two dataset i.e. 200 data under Apples (1997) vs 200 data under Apples (2000). So 1 heading should give me only 1 value.
I've converted all the header names to V1,V2,V3...
So now I need to do this:
x<-1
while(x<35) {
new(x)=cor(1997$V(x),2000$V(x))
print(new(x))
}
and then i get this error:
Error in pptn26$V(x) : attempt to apply non-function.
Any advise is highly appreciated!
Your error comes directly from using parentheses where R isn't expecting them. You'll get the same type of error if you do 1(x). 1 is not a function, so if you put it right next to parentheses with no white space between, you're attempting to apply a non function.
I'm also a bit surprised at how you are managing to get all the way to that error, before running into several others, but I suppose that has something to do with when R evaluates what...
Here's how to get the behavior you're looking for:
mapply(cor, A, B)
# provided A is the name of your 1997 data frame and B the 2000
Here's an example with simulated data:
set.seed(123)
A <- data.frame(x = 1:10, y = sample(10), z = rnorm(10))
B <- data.frame(x = 4:13, y = sample(10), z = rnorm(10))
mapply(cor, A, B)
# x y z
# 1.0000000 0.1393939 -0.2402058
In its typical usage, mapply takes an n-ary function and n objects that provide the n arguments for that function. Here the n-ary function is cor, and the objects are A, and B, each a data frame. A data frame is structured as a list of vectors, the columns of the data frame. So mapply will loop along your columns for you, making 35 calls to cor, each time with the next column of both A and B.
If you have managed to figure out how to name your data frames 1997 and 2000, kudos. It's not easy to do that. It's also going to cause you headaches. You'll want to have a syntactically valid name for your data frame(s). That means they should start with a letter (or a dot, but really a letter). See the R FAQ for the details.
Seems like quite an easy problem to solve, but I can't seem to get my head around it in R.
I have dataset with the following columns:
'Biomass' where each row is a value of biomass for a particular species
'Count' where each row is the number of individual animals of that species counted
I need to create a histogram of biomasses, but if I use hist(DF$Biomass) I will get a histogram of the biomasses of the animals where each value is one animal.
I need to include the count, so that I have (for example) the weight frequencies of elephant x 2, giraffe x 56 etc..
you're not making my life easy :)
Is this what you want ?
DF <- data.frame(Biomass=c(200,200,1500),Count = c(36,20,2))
DF2 <- aggregate(Count ~ Biomass,DF,sum) # sum different occurrences for each Biomass value
barplot(DF2$Count,names.arg =DF2$Biomass) # presents them with a barplot, which is more appropriate than an histogram in the R sense here.
If I understood you right that is what you need :)
biomass<-c(1,5,7,6,3)
count<-c(1,2,1,3,4)
new<-NULL
for (i in 1:length(biomass))
{
new<-c(new, rep(biomass[i], count[i]))
}
new
hist(new)
So finally just type:
new<-NULL
for (i in 1:length(DF$Biomass))
{
new<-c(new, rep(DF$Biomass[i], DF$Count[i]))
}
hist(new)
I'm completely new to R, and I have been tasked with making a script to plot the protocols used by a simulated network of users into a histogram by a) identifying the protocols they use and b) splitting everything into a 5-second interval and generate a graph for each different protocol used.
Currently we have
data$bucket <- cut(as.numeric(format(data$DateTime, "%H%M")),
c(0,600, 2000, 2359),
labels=c("00:00-06:00", "06:00-20:00", "20:00-23:59")) #Split date into dates that are needed to be
to split the codes into 3-zones for another function.
What should the code be changed to for 5 second intervals?
Sorry if the question isn't very clear, and thank you
The histogram function hist() can aggregate and/or plot all by itself, so you really don't need cut().
Let's create 1,000 random time stamps across one hour:
set.seed(1)
foo <- as.POSIXct("2014-12-17 00:00:00")+runif(1000)*60*60
(Look at ?POSIXct on how R treats POSIX time objects. In particular, note that "+" assumes you want to add seconds, which is why I am multiplying by 60^2.)
Next, define the breakpoints in 5 second intervals:
breaks <- seq(as.POSIXct("2014-12-17 00:00:00"),
as.POSIXct("2014-12-17 01:00:00"),by="5 sec")
(This time, look at ?seq.POSIXt.)
Now we can plot the histogram. Note how we assign the output of hist() to an object bar:
bar <- hist(foo,breaks)
(If you don't want the plot, but only the bucket counts, use plot=FALSE.)
?hist tells you that hist() (invisibly) returns the counts per bucket. We can look at this by accessing the counts slot of bar:
bar$counts
[1] 1 2 0 1 0 1 1 2 3 3 0 ...
I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.