R: data.frame to vector - r

Let me preface this question by saying that I know very little about R. I'm importing a text file into R using read.table("file.txt", T). The text file is in the general format:
header1 header2
a 1
a 4
b 3
b 2
Each a is an observation from a sample and similarly each b is an observation from a different sample. I want to calculate various statistics of the sets of a and b which I'm doing with tapply(header2, header1, mean). That works fine.
Now I need to do some qqnorm plots of a and b and draw with qqline. I can use tapply(header2, header1, qqnorm) to make quantile plots of each BUT using tapply(header2, header1, qqline) draws both best fit lines on the last quantile plot. Programatically that makes sense but it doesn't help me.
So my question is, how can convert the data frame to two vectors (one for all a and one for all b)? Does that make sense? Basically, in the above example, I'd want to end up with two vectors: a=(1,4) and b=(3,2).
Thanks!

Create a function that does both. You won't be able (easily at least) to revert to an old graphics device.
e.g.
with(dd, tapply(header2,header1, function(x) {qqnorm(x); qqline(x)}))
You could use data.table here for coding elegance (and speed)
You can pass the equivalent of a body of a function that is evaluated within the scope of the data.table e.g.
library(data.table)
DT <- data.table(dd)
DT[, {qqnorm(x)
qqline(x)}, by=header1]
You don't really want to pollute your global environments with lots of objects (that will be inefficient).

Related

How to check if a column has numeric or categorical levels in R?

I am trying to plot 9 barplots in a 3X3 matrix in R using base-R wrapped inside a for loop. (I am working on a workhorse solution for visualizing every column before I begin working on manipulating data) Below is the code:
library(ISLR);
library(ggplot2);
# load wage data
data(Wage)
par(mfrow=c(3,3))
for(i in 1:(dim(Wage)[2]-2)){
plot(Wage[,i],main = paste0(names(Wage)[i]),las = 2)
}
But unfortunately can't do properly for first 2 columns because they are numeric and actually needs a histogram. I get it that I need to fit if-else condition somewhere inside for() statement but that is giving me errors. below is the output where first 2 columns are plotted wrong. (Age and year are actually numeric and I may need to use them in X-axis instead of defaulting them to y).
Kindly requesting to suggest an edit/hack? I also learnt that I cant' use par() when I am wrapping ggplot inside for so I had to use base-R otherwise ggplot would have been great aesthetically.

R: conditional expand.grid function

I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(

Plotting Multiple Graphs using R

I currently have a dataset which has a format of: (x, y, type)
I've used the code that is found on the example of plotting with Postgres through R.
My question is: How would I get R to generate multiple graphs for each unique "type" column?
I'm new to R, so my appologies if this is something that is extremely easy and I just lack the understanding of loops with R.
So lets say we have this data:
(1,1,T), (1,2,T), (1,3,T), (1,4,T), (1,5,T), (1,6,T),
(1,1,A), (1,2,B), (1,3,B), (1,4,B), (1,5,A), (1,6,A),
(1,1,B), (1,2,B), (1,3,C), (1,4,C), (1,5,C), (1,6,C),
It would plot 4 individual graphs on the page. One for each of the types T, A, B, and C. [Ploting x,y]
How would I do that with R when the data coming in may look like the data above?
While the other post has some good info, there's a faster way to do all that. So assuming your data frame or matrix is called DF and is in the form above (where each (1,2,B) or whatever is a row), then:
by(DF, DF[,3], function(x) plot(x[,1], x[,2], main=unique(x[,3])))
And that's it.
If you'd like all the four plots to be on the same page, you can first change the graphing paramter option:
par(mfrow=c(2,2))
And back to default par(mfrow=c(1,1) when you're done.
I'm quite fond of the ggplot2 package, which does the same thing that user1717913 suggests, but with slightly different syntax (it does a lot of other things very nicely, which is why I like it.)
test <- data.frame(x=rep(1,18),y=rep(1:6,3),type=c("T","T","T","T","T","T","A","B","B","B","A","A","B","B","C","C","C","C"))
require(ggplot2)
ggplot(test, aes(x=x, y=y)) + #define the data that the plot will use, and which variables go where
geom_point() + #plot it with points
facet_wrap(~type) #facet it by the type variable
R is really cool in that there's a bazillion (that's a technical term) different ways to do most things. The way I would do is is to split the data along the groups, and then plot by group.
To do that, the split command is what you want (I'll assume your data is in an object called data):
data.splitted <- split(data, data$type)
Now the data will have this form (let's assume you have 3 types, A, B, and C):
data.splitted
L A
| L x y type
| 1 4 A
| 3 6 A
L B
| L x y type
| 3 3 B
| 2 1 B
L C
L x y type
4 5 C
5 2 C
and so on. You would reference the "4" in the y column of group A like so:
data.splitted$A$y[1] or data.splitted[[1]][[2]][1] Hopefully seeing them both together makes enough sense.
Now that we have the data split, we're getting closer.
We still need to tell R that we want to plot a bunch of graphs to the same window. Now, this is just one way to go about it. You could also tell it to write each graph to a image file, or a pdf, or whatever you want.
groups <- names(data.splitted) puts your different types into a variable for reference later.
par(mfcol=c(length(groups),1))
Using mfcol fills the graphs in vertically. the mfrow option fills in horizontally. The c() just combines input. The length(groups) returns the total number of groups.
Now we can work on the for-loop.
for(i in 1:length(data.splitted)){ # This tells it what i is iterating from and to.
# It can start and stop wherever, or be a
# sequence, ascending or descending,
# the sky is the limit.
tempx <- data.splitted[[i]][[x]] # This just saves us
tempy <- data.splitted[[i]][[y]] # a bunch of typing.
plot(tempx, tempy, main=groups[i]) # Plot it and make the title the type.
rm(tempx, tempy) # Remove our temporary variables for the next run through.
}
So you see, it's not too bad when you break it down into its components. You can do pretty much anything this way. I have a project I'm working on right now, where I'm doing this for 18 lidar metrics that I calculated using another for loop.
Commands to read up on:
split, plot, data.frame, "[",
par(mfrow=___) and par(mfcol=___)
Here's a few helpful links to get you started. The most helpful one of all is built right in to R though. a ? followed by a command will bring up the html help for that command in your browser.
Good luck!

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

How to compute descriptive statistics on a set of differently sized vectors

In a problem, I have a set of vectors. Each vector has sensor readings but are of different lengths. I'd like to compute the same descriptive statistics on each of these vectors. My question is, how should I store them in R. Using c() concatenates the vectors. Using list() seems to cause functions like mean() to misbehave. Is a data frame the right object?
What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?
Vectors of different sizes should be combined into a list: a data.frame expects each column to be the same length.
Use lapply to fetch your data. Then use lapply again to get the descriptive statistics.
x <- lapply(ids, sqlfunction)
stats <- lapply(x, summary)
Where sqlfunction is some function you created to query your database. You can collapse the stats list into a data.frame by calling do.call(rbind, stats) or by using plyr:
library(plyr)
x <- llply(ids, sqlfunction)
stats <- ldply(x, summary)
Most plotting and regression functions expect data to be in a "long" format: numeric values in one column and grouping or covariate values in others. The stack function will accept irregular length lists, and tapply or aggregate will allow functions to work over irregular length category variables:
dlist <- list(a=1:2, b=13:15, cc= 5:1)
s.dfrm <- stack(dlist)
s.dfrm
values ind
1 1 a
2 2 a
3 13 b
4 14 b
5 15 b
6 5 cc
7 4 cc
8 3 cc
9 2 cc
10 1 cc
tapply(s.dfrm$values, s.dfrm$ind, mean)
a b cc
1.5 14.0 3.0
"What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?"
As suggested by Shane, lappy is your definite choice here. You can, of course, use it with custom functions as well – in case you feel summary does not provide enough information.
For the SQL part: There are packages around for most relational DBMS: RPostgreSQL, RMySQL, ROracle and there´s RODBC as a general one. If you speak of MS SQL server, I am not sure if there is a specific package, but RODBC should do the job. I don´t know if you are married to MS SQL server stuff, but if it´s an option for you to run your own local database for R – RMySQL is really easy to set up.
In general, by using database packages you use wrappers like dbListTable, or dbReadTable which simply turns a table into a R data.frame.
If you really want to import the data you could use .csv exports of your database and use read.table or read.csv depending on what fits your needs. But I suggest to directly connect to the database – it´s not that difficult even if you haven´t done it before and it's more fun.
EDIT: I don´t use MS, but others done it before maybe the mailing list post helps
I would tend to import this into a data frame and not a list. Each of your individual vectors is likely differentiated by one or more meaningful variables. Let's say you wanted to keep track of the time that the data was collected and location it was collected from. In a data frame you would have one column that was all of the vectors concatenated together but they would each be differentiated by values in the time and location columns. To get each individual vector mean then tapply() might be the tool of choice.
tapply(df$y, list(df$time, df$location), mean)
Or, perhaps aggregate() would be even better, depending on the number of variables and your future needs.

Resources