How to compute descriptive statistics on a set of differently sized vectors - r

In a problem, I have a set of vectors. Each vector has sensor readings but are of different lengths. I'd like to compute the same descriptive statistics on each of these vectors. My question is, how should I store them in R. Using c() concatenates the vectors. Using list() seems to cause functions like mean() to misbehave. Is a data frame the right object?
What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?

Vectors of different sizes should be combined into a list: a data.frame expects each column to be the same length.
Use lapply to fetch your data. Then use lapply again to get the descriptive statistics.
x <- lapply(ids, sqlfunction)
stats <- lapply(x, summary)
Where sqlfunction is some function you created to query your database. You can collapse the stats list into a data.frame by calling do.call(rbind, stats) or by using plyr:
library(plyr)
x <- llply(ids, sqlfunction)
stats <- ldply(x, summary)

Most plotting and regression functions expect data to be in a "long" format: numeric values in one column and grouping or covariate values in others. The stack function will accept irregular length lists, and tapply or aggregate will allow functions to work over irregular length category variables:
dlist <- list(a=1:2, b=13:15, cc= 5:1)
s.dfrm <- stack(dlist)
s.dfrm
values ind
1 1 a
2 2 a
3 13 b
4 14 b
5 15 b
6 5 cc
7 4 cc
8 3 cc
9 2 cc
10 1 cc
tapply(s.dfrm$values, s.dfrm$ind, mean)
a b cc
1.5 14.0 3.0

"What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?"
As suggested by Shane, lappy is your definite choice here. You can, of course, use it with custom functions as well – in case you feel summary does not provide enough information.
For the SQL part: There are packages around for most relational DBMS: RPostgreSQL, RMySQL, ROracle and there´s RODBC as a general one. If you speak of MS SQL server, I am not sure if there is a specific package, but RODBC should do the job. I don´t know if you are married to MS SQL server stuff, but if it´s an option for you to run your own local database for R – RMySQL is really easy to set up.
In general, by using database packages you use wrappers like dbListTable, or dbReadTable which simply turns a table into a R data.frame.
If you really want to import the data you could use .csv exports of your database and use read.table or read.csv depending on what fits your needs. But I suggest to directly connect to the database – it´s not that difficult even if you haven´t done it before and it's more fun.
EDIT: I don´t use MS, but others done it before maybe the mailing list post helps

I would tend to import this into a data frame and not a list. Each of your individual vectors is likely differentiated by one or more meaningful variables. Let's say you wanted to keep track of the time that the data was collected and location it was collected from. In a data frame you would have one column that was all of the vectors concatenated together but they would each be differentiated by values in the time and location columns. To get each individual vector mean then tapply() might be the tool of choice.
tapply(df$y, list(df$time, df$location), mean)
Or, perhaps aggregate() would be even better, depending on the number of variables and your future needs.

Related

Is there a reason why common R data types/container are indexed so differently?

common container types used in R are data.tables, data.frames, matrices and lists (probably more?!).
All these storage types have slightly different rules for indexing.
Let's say we have a simple dataset with named columns:
name1 name2
1 11
2 12
... ...
10 20
We now put this data in every container accordingly. If I want to index the number 5 which is in the name1 column it goes as follows:
lists: dataset[['name1']][5]
-> why the double brackets?!?!
data frames: dataset$name1[5] or dataset[5,'name1']
-> here are two options possible, why the ambiguity?!?
data table: dataset$name1[5]
-> why is it here only one possibility
I often stumbled upon this problem and coming from python this is something very odd. It furthermore leads to extremely tedious debuging. In python this is solved in a very uniform way where indexing is pretty much standard across lists,numpy arrays, pandas data frames, etc.
A data.frame is a list with equal elements having equal length. We use $ or [[ to extract the list elements or else it would still be a list with one element
You reference the data.frame example in R and then go on to say you are used to pandas, except these have direct, standard equivalents in pandas for the exact same purpose, so unsure where the confusion comes from.
dataset$name1[5] -> dataset['name1'][5] or dataset.name1[5]
dataset[5, 'name1'] -> dataset.loc[5, 'name1']
Using the definitions in the Note at the end these all work and give the same answer.
L[["name1"]][5]
DF[["name1"]][5]
DT[["name1"]][5]
L$name1[5]
DF$name1[5]
DT$name1[5]
It seems not unreasonable that a data frame which is conceptually a 2d object can take two subscripts whereas a list which is one dimensional takes one.
[[ and [ have different meanings so I am not sure consistency plays a role here.
Note
L <- list(name1 = 1:10, name2 = 11:20)
DF <- as.data.frame(L)
library(data.table)
DT <- as.data.table(DF)

R: Avoid repeating lines of code using R subsets in scripts

I'm very new to R - but have been developing SAS-programs (and VBA) for some years. Well, the thing is that I have 4 lines of R-code (scripts?) that I would like to repeat 44 times. Two times for each of 22 different train stations, indicating whether the train is in- or out-going. The four lines of code are:
dataGL_FLIin <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FLIin))
names(dataGL_FLIin)[names(dataGL_FLIin)=='FLIin'] <- 'GL_Antal'
dataGL_FLIin$DIR<-"IN"
dataGL_FLIin$STATION<-"FLI
To avoid repeating the 4 lines 44 times I need 2 "macro variables" (yes, I'm aware, that this is a SAS-thing only, sorry). One "macro variable" indicating the train station and one indicating the direction. In the example above the train station is FLI and the direction is in. Below the same 4 lines are demonstrated for the train station FBE, this time in out-going direction.
dataGL_FBEout <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FBEout))
names(dataGL_FBEout)[names(dataGL_FBEout)=='FBEout'] <- 'GL_Antal'
dataGL_FBEout$DIR<-"OUT"
dataGL_FBEout$STATION<-"FBE"
I have looked many places and tried many combinations of R-functions and R-lists, but I can't make it work. Quite possible I'm getting it all wrong. I apologize in advance if the question is (too) stupid, but will however be very grateful for any help on the matter.
Pls. notice that I, in the end, want 44 different data-frames created:
1) dataGL_FLIin
2) dataGL_FBEout
3) Etc. ...
ADDED: 2 STATION 2 DIRECTIONS EXAMPLE OF MY PROBLEM
'The one data frame I have'
Date<-c("01-01-15 04:00","01-01-15 04:20","01-01-15 04:40")
FLIin<-c(96,39,72)
FLIout<-c(173,147,103)
FBEin<-c(96,116,166)
FBEout<-c(32,53,120)
dataGL_all<-data.frame(Date, FLIin, FLIout, FBEin, FBEout)
'The four data frames I would like'
GL_antal<-c(96,39,72)
Station<-("FLI")
Dir<-("IN")
dataGL_FLIin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(173,147,103)
Station<-("FLI")
Dir<-("OUT")
dataGL_FLIout<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(96,116,166)
Station<-("FBE")
Dir<-("IN")
dataGL_FBEin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(32,53,120)
Station<-("FBE")
Dir<-("OUT")
dataGL_FBEout<-data.frame(Date, Station, Dir, GL_antal)
Thanks,
lars
With your example, it is now clearer what you want and I give it a second try. I use dataGL_all as defined in your question and the define
stations <- rep(c("FLI","FBE"),each=2)
directions <- rep(c("in","out"),times=length(stations)/2)
You could also extract the stations and directions from your data frame. Using your example, the following would work
stations <- substr(names(dataGL_all)[-1],1,3)
directions <- substr(names(dataGL_all)[-1],4,6)
Then, I define the function that will work on the data:
dataGLfun <- function(station,direction) {
name <- paste0(station,direction)
dataGL <- dataGL_all[,c("Date", name)]
names(dataGL)[names(dataGL)==name] <- 'GL_Antal'
dataGL$DIR<-direction
dataGL$STATION<-station
dataGL
}
And now I apply this function to all stations with both directions:
dataGL <- mapply(dataGLfun,stations,directions,SIMPLIFY=FALSE)
names(dataGL) <- paste0(stations,directions)
Now, you can get the data frames for each combination of station and direction. For instance, the two examples in your question, you get with dataGL$FLIin and dataGL$FBEout. The reason that there is a $ instead of a _ is that I did not actually create a separate variable for each data frame. Instead, I created a list, where each element of the list is one of the data frames. This has the advantage that it will be easier to do something to all the data frames later. With your solution, you would have to type all the various variable names, but if the data frames are in a list, you can work with them using functions like lapply.
If you prefer to have many different variables, you could do the following
for (i in seq_along(stations)) {
assign(paste0("dataGL_",stations[i],directions[i]), dataGLfun(stations[i],directions[i]))
}
However, in my opinion, this is not how you should solve this problem in R.

R: data.frame to vector

Let me preface this question by saying that I know very little about R. I'm importing a text file into R using read.table("file.txt", T). The text file is in the general format:
header1 header2
a 1
a 4
b 3
b 2
Each a is an observation from a sample and similarly each b is an observation from a different sample. I want to calculate various statistics of the sets of a and b which I'm doing with tapply(header2, header1, mean). That works fine.
Now I need to do some qqnorm plots of a and b and draw with qqline. I can use tapply(header2, header1, qqnorm) to make quantile plots of each BUT using tapply(header2, header1, qqline) draws both best fit lines on the last quantile plot. Programatically that makes sense but it doesn't help me.
So my question is, how can convert the data frame to two vectors (one for all a and one for all b)? Does that make sense? Basically, in the above example, I'd want to end up with two vectors: a=(1,4) and b=(3,2).
Thanks!
Create a function that does both. You won't be able (easily at least) to revert to an old graphics device.
e.g.
with(dd, tapply(header2,header1, function(x) {qqnorm(x); qqline(x)}))
You could use data.table here for coding elegance (and speed)
You can pass the equivalent of a body of a function that is evaluated within the scope of the data.table e.g.
library(data.table)
DT <- data.table(dd)
DT[, {qqnorm(x)
qqline(x)}, by=header1]
You don't really want to pollute your global environments with lots of objects (that will be inefficient).

Performing statistics on multiple columns of data

I'm trying to conduct certain statistics such as t-tests on a table of data containing hundreds to thousands of columns. The data is formatted in a way that the two groups of values I'm comparing are in the same column.
So, basically my first attempt was to cut and paste like the following;
NN <-read.delim("E:/output.txt")
View(NN)
attach(NN)
#output p-values of 100 t-tests
sink(file="E:/ttest.txt", append=TRUE, split=FALSE)
t.test(Tree1[1:13],Tree1[14:34])$p.value
t.test(Tree2[1:13],Tree2[14:34])$p.value
t.test(Tree3[1:13],Tree3[14:34])$p.value
....
...
..
.
As my data grows, this is becoming more and more impractical. Is there a way to loop these t-tests through each column sequentially and save the ouput to file?
Thanks in advance.
lapply will get you there I think with an anonymous function:
> test <- data.frame(a=1:100,b=101:200)
> lapply(test,function(x) t.test(x[1:50],x[51:100])$p.value)
$a
[1] 2.876776e-31
$b
[1] 2.876776e-31
I should do my part for good practice and also note that running 100 t-tests in a single go is fraught with the potential for type-1 errors and other badness.
Extracting the p-value in isolation is also probably a really bad move.
Not sure if this is a wise approach or if it even works correctly but try mapply with the indexed parts as in:
test <- data.frame(a=1:100,b=101:200)
testa <- test[1:50, ]
testb <- test[51:100, ]
t.test2 <- function(x, y) t.test(x, y)[["p.value"]]
mapply(t.test2, testa, testb)
EDIT: I used thelatemail's data so it's comparable. His warning is right on.
Thanks for all the input. Just a few clarifications; while I AM running hundreds of t-tests at once, they are comparing independent sets of data each time. So for example, the values in column 1 (Tree1), rows 1:50 would only be compared once to rows 51:100 in the same column, and never used again. The same for column 2 (Tree2), and so on. Would type-1 error still be a problem? the way I see it I'm basically doing t-tests on separate data sets one at a time.
That being said, I've come up with a way to do this with a for-loop, and the results correspond to those when t-testing each column individually.
for (i in 1:100)
print (t.test(mydata[1:50, i],mydata[51:100, i])$p.value)
end;
The only problem being that my output always has a [1] in front of it.

Simple Function to normalize related objects

I'm quite new to R and I'm trying to write a function that normalizes my data in diffrent dataframes.
The normalization process is quite easy, I just divide the numbers I want to normalize by the population size for each object (that is stored in the table population).
To know which object relates to one and another I tried to use IDs that are stored in each dataframe in the first column.
I thought to do so because some objects that are in the population dataframe have no corresponding objects in the dataframes to be normalized, as to say, the dataframes sometimes have lesser objects.
Normally one would built up a relational database (which I tried) but it didn't worked out for me that way. So I tried to related the objects within the function but the function didn't work. Maybe someone of you has experience with this and can help me.
so my attempt to write this function was:
# Load Tables
# Agriculture, Annual Crops
table.annual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Agriculture, Bianual and Perrenial Crops
table.bianual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Fishery
table.fishery <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Population per Municipality
table.population <-read.table ("C:\\Users\\etc", header=T,sep=";")
# attach data
attach(table.annual.crops)
attach(table.bianual.crops)
attach(table.fishery)
attach(table.population)
# Create a function to normalize data
# Objects should be related by their ID in the first column
# Values to be normalized and the population appear in the second column
funktion.norm.percapita<-function (x,y){if(x[,1]==y[,1]){x[,2]/y[,2]}else{return("0")}}
# execute the function
funktion.norm.percapita(table.annual.crops,table.population)
Lets start with the attach steps... why? Its usually unecessary and can get you into trouble! Especially since both your population data.frame and your crops data.frame have Geocode as a column!
as suggested in the comments, you can use merge. This will by default combine data.frames using columns of the same name. You can specify which columns on which to merge with the by parameters.
dat <- merge(table.annual.crops, table.population)
dat$crop.norm <- dat$CropValue / dat$Population
The reason your function isn't working? Look at the results of your if statemnt.
table.annual.crops[,1] == table.population[,1]
Gives a vector of booleans that will recycle the shorter vector. If your data is quite large (on the order of millions of rows) the merge function can be slow. if this is the case, take a look at the data.table package and use its merge function instead.

Resources