How to use List of List of Dataframes - r

I´m not sure if this is possible or even how to get a good resolution for the following R problem.
Data / Background / Structure:
I´ve collected a big dataset of project based cooperation data, which maps specific projects to the participating companies (this can be understood as a bipartite edgelist for social network analysis). Because of analytical reasons it is advised to subset the whole dataset to different subsets of different locations and time periods. Therefore, I´ve created the following data structure
sna.location.list
[[1]] (location1)
[[1]] (is a dataframe containing the bip. edge-list for time-period1)
[[2]] (is a dataframe containing the bip. edge-list for time-period2)
...
[[20]] (is a dataframe containing the bip. edge-list for time-period20)
[[2]] (location2)
... (same as 1)
...
[[32]] (location32)
...
Every dataframe contains a project id and the corresponding company ids.
My goal is now to transform the bipartite edgelists to one-mode networks and then do some further sna-related-calculations (degree, centralization, status, community detection etc.) and save them.
I know how to these claculation-steps with one(!) specific network but it gives me a really hard time to automate this process for all of the networks at one time in the described list structure, and save the various outputs (node-level and network-level variables) in a similar structure.
I already tried to look up several ways of for-loops and apply approaches but it still gives me sleepless nights how to do this and right now I feel very helpless. Any help or suggestions would be highly appreciated. If you need more information or examples to give me a brief demo or code example how to tackle such a nested structure and do such sna-related calculations/modification for all of the aforementioned subsets in an efficient automatic way, please feel free to contact me.

Let's say you have a function foo that you want to apply to each data frame. Those data frames are in lists, so lapply(that_list, foo) is what we want. But you've got a bunch of lists, so we actually want to lapply that first lapply across the outer list, hence lapply(that_list, lapply, foo). (The foo will be passed along to the inner lapply with .... If you wish to be more explicit you can use an anonymous function instead: lapply(that_list, function(x) lapply(x, foo)).
You haven't given a reproducible example, so I'll demonstrate applying the nrow function to a list of built-in data frames
d = list(
list(mtcars, iris),
list(airquality, faithful)
)
result = lapply(d, lapply, nrow)
result
# [[1]]
# [[1]][[1]]
# [1] 32
#
# [[1]][[2]]
# [1] 150
#
#
# [[2]]
# [[2]][[1]]
# [1] 153
#
# [[2]][[2]]
# [1] 272
As you can see, the output is a list with the same structure. If you need the names, you can switch to sapply with simplify = FALSE.
This covers applying functions to a nested list and saving the returns in a similar data structure. If you need help with calculation efficiency, parallelization, etc., I'd suggest asking a separate question focused on that, with a reproducible example.

Related

Gather list after partitioning with split in R

Simplified example is:
data<-data.frame(rnorm(10),rbinom(10,1,prob=.7))
sdata<-split(data[,1],data[,2])
ret <- lapply(sdata,mean)
I make data column and factor column. I come from SQL language and solve an,y task with grouping pattern.
lapply(sdata,mean) is:
class(ret)
[1] "list"
str(ret)
List of 2
$ 0: num -0.146
$ 1: num -0.0572
How can I make data frame again from list?
Are there limitation to factor type? Factor become "name" of list elements and I afraid to lose data/precision when convert from name bask to actual data in data frame.
Is there better way to process partitioned/grouped data then split/lapply?
PS Fill free to correct question wording. I have little experience with R to write professionally.
#MLavoie ret <- data.frame(lapply(sdata,mean)) gives me:
> dim(ret)
[1] 1 2
I expect 2x1.
#David Arenburg function application in lapply on result from sapply receive not only single column - but all and not only single row - but all within group. This approach may lead to performance degradation but allow any processing logic.
aggregate and data.table work on individual column in each group if I understand properly.

Creating multiple matrices with a "for" loop

I am currently in a statistics class working on multivariate clustering and classification. For our homework we are trying to use a 10 fold cross validation to test how accurate different classification methods are on a 6 variable data set with three classifications. I was hoping I could get some help on creating a for loop (or something else which would be better that I don't know about) to create and run 10 classifications and validations so I don't have to repeat myself 10 times on everything.... Here is what I have. It will run but the first two matrices only show the first variable. Because of this, I have not been able to troubleshoot the other parts.
index<-sample(1:10,90,rep=TRUE)
table(index)
training=NULL
leave=NULL
Trfootball=NULL
football.pred=NULL
for(i in 1:10){
training[i]<-football[index!=i,]
leave[i]<-football[index==i,]
Trfootball[i]<-rpart(V1~., data=training[i], method="class")
football.pred[i]<- predict(Trfootball[i], leave[i], type="class")
table(Actual=leave[i]$"V1", classfied=football.pred[i])}
Removing the "[i]" and replacing them with 1:10 individually works right now....
Your problem lies is the assignment of a data.frame or matrix to a vector that you initially set as NULL (training and leave). A way to think about it is, you are trying to squeeze in a whole matrix into an element that can only take a single number. That's why R has a problem with your code. You need to initialise training and leave to something that can handle your iterative agglomeration of values (the R object list as #akrun points out).
The following example should give you a feel for what is happening and what you can do to fix your problem:
a<-NULL # your set up at the moment
print(a) # NULL as expected
# your football data is either data.frame or matrix
# try assigning those objects to the first element of a:
a[1]<-data.frame(1:10,11:20) # no good
a[1]<-matrix(1:10,nrow=2) # no good either
print(a)
## create "a" upfront, instead of an empty object
# what you need:
a<-vector(mode="list",length=10)
print(a) # empty list with 10 locations
## to assign and extract elements out of a list, use the "[[" double brackets
a[[1]]<-data.frame(1:10,11:20)
#access data.frame in "a"
a[1] ## no good
a[[1]] ## what you need to extract the first element of the list
## how does it look when you add an extra element?
a[[2]]<-matrix(1:10,nrow=2)
print(a)

Performing statistics on multiple columns of data

I'm trying to conduct certain statistics such as t-tests on a table of data containing hundreds to thousands of columns. The data is formatted in a way that the two groups of values I'm comparing are in the same column.
So, basically my first attempt was to cut and paste like the following;
NN <-read.delim("E:/output.txt")
View(NN)
attach(NN)
#output p-values of 100 t-tests
sink(file="E:/ttest.txt", append=TRUE, split=FALSE)
t.test(Tree1[1:13],Tree1[14:34])$p.value
t.test(Tree2[1:13],Tree2[14:34])$p.value
t.test(Tree3[1:13],Tree3[14:34])$p.value
....
...
..
.
As my data grows, this is becoming more and more impractical. Is there a way to loop these t-tests through each column sequentially and save the ouput to file?
Thanks in advance.
lapply will get you there I think with an anonymous function:
> test <- data.frame(a=1:100,b=101:200)
> lapply(test,function(x) t.test(x[1:50],x[51:100])$p.value)
$a
[1] 2.876776e-31
$b
[1] 2.876776e-31
I should do my part for good practice and also note that running 100 t-tests in a single go is fraught with the potential for type-1 errors and other badness.
Extracting the p-value in isolation is also probably a really bad move.
Not sure if this is a wise approach or if it even works correctly but try mapply with the indexed parts as in:
test <- data.frame(a=1:100,b=101:200)
testa <- test[1:50, ]
testb <- test[51:100, ]
t.test2 <- function(x, y) t.test(x, y)[["p.value"]]
mapply(t.test2, testa, testb)
EDIT: I used thelatemail's data so it's comparable. His warning is right on.
Thanks for all the input. Just a few clarifications; while I AM running hundreds of t-tests at once, they are comparing independent sets of data each time. So for example, the values in column 1 (Tree1), rows 1:50 would only be compared once to rows 51:100 in the same column, and never used again. The same for column 2 (Tree2), and so on. Would type-1 error still be a problem? the way I see it I'm basically doing t-tests on separate data sets one at a time.
That being said, I've come up with a way to do this with a for-loop, and the results correspond to those when t-testing each column individually.
for (i in 1:100)
print (t.test(mydata[1:50, i],mydata[51:100, i])$p.value)
end;
The only problem being that my output always has a [1] in front of it.

How to compute descriptive statistics on a set of differently sized vectors

In a problem, I have a set of vectors. Each vector has sensor readings but are of different lengths. I'd like to compute the same descriptive statistics on each of these vectors. My question is, how should I store them in R. Using c() concatenates the vectors. Using list() seems to cause functions like mean() to misbehave. Is a data frame the right object?
What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?
Vectors of different sizes should be combined into a list: a data.frame expects each column to be the same length.
Use lapply to fetch your data. Then use lapply again to get the descriptive statistics.
x <- lapply(ids, sqlfunction)
stats <- lapply(x, summary)
Where sqlfunction is some function you created to query your database. You can collapse the stats list into a data.frame by calling do.call(rbind, stats) or by using plyr:
library(plyr)
x <- llply(ids, sqlfunction)
stats <- ldply(x, summary)
Most plotting and regression functions expect data to be in a "long" format: numeric values in one column and grouping or covariate values in others. The stack function will accept irregular length lists, and tapply or aggregate will allow functions to work over irregular length category variables:
dlist <- list(a=1:2, b=13:15, cc= 5:1)
s.dfrm <- stack(dlist)
s.dfrm
values ind
1 1 a
2 2 a
3 13 b
4 14 b
5 15 b
6 5 cc
7 4 cc
8 3 cc
9 2 cc
10 1 cc
tapply(s.dfrm$values, s.dfrm$ind, mean)
a b cc
1.5 14.0 3.0
"What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?"
As suggested by Shane, lappy is your definite choice here. You can, of course, use it with custom functions as well – in case you feel summary does not provide enough information.
For the SQL part: There are packages around for most relational DBMS: RPostgreSQL, RMySQL, ROracle and there´s RODBC as a general one. If you speak of MS SQL server, I am not sure if there is a specific package, but RODBC should do the job. I don´t know if you are married to MS SQL server stuff, but if it´s an option for you to run your own local database for R – RMySQL is really easy to set up.
In general, by using database packages you use wrappers like dbListTable, or dbReadTable which simply turns a table into a R data.frame.
If you really want to import the data you could use .csv exports of your database and use read.table or read.csv depending on what fits your needs. But I suggest to directly connect to the database – it´s not that difficult even if you haven´t done it before and it's more fun.
EDIT: I don´t use MS, but others done it before maybe the mailing list post helps
I would tend to import this into a data frame and not a list. Each of your individual vectors is likely differentiated by one or more meaningful variables. Let's say you wanted to keep track of the time that the data was collected and location it was collected from. In a data frame you would have one column that was all of the vectors concatenated together but they would each be differentiated by values in the time and location columns. To get each individual vector mean then tapply() might be the tool of choice.
tapply(df$y, list(df$time, df$location), mean)
Or, perhaps aggregate() would be even better, depending on the number of variables and your future needs.

How to get a .csv file into R?

I have this .csv file:
ID,GRADES,GPA,Teacher,State
3,"C",2,"Teacher3","MA"
1,"A",4,"Teacher1","California"
And what I want to do is read in the file using the R statistical software and read in the Header into some kind of list or array (I'm new to R and have been looking for how to do this, but so far have had no luck).
Here's some pseudocode of what I want to do:
inputfile=read.csv("C:/somedirectory")
for eachitem in row1:{
add eachitem to list
}
Then I want to be able to use those names to call on each vertical column so that I can perform calculations.
I've been scouring over google for an hour, trying to find out how to this but there is not much out there on dealing with headers specifically.
Thanks for your help!
You mention that you will call on each vertical column so that you can perform calculations. I assume that you just want to examine each single variable. This can be done through the following.
df <- read.csv("myRandomFile.csv", header=TRUE)
df$ID
df$GRADES
df$GPA
Might be helpful just to assign the data to a variable.
var3 <- df$GPA
You need read.csv("C:/somedirectory/some/file.csv") and in general it doesn't hurt to actually look at the help page including its example section at the bottom.
As Dirk said, the function you are after is 'read.csv' or one of the other read.table variants. Given your sample data above, I think you will want to do something like this:
setwd("c:/random/directory")
df <- read.csv("myRandomFile.csv", header=TRUE)
All we did in the above was set the directory to where your .csv file is and then read the .csv into a dataframe named df. You can check that the data loaded properly by checking the structure of the object with:
str(df)
Assuming the data loaded properly, you can think go on to perform any number of statistical methods with the data in your data frame. I think summary(df) would be a good place to start. Learning how to use the help in R will be immensely useful, and a quick read through the help on CRAN will save you lots of time in the future: http://cran.r-project.org/
You can use
df <- read.csv("filename.csv", header=TRUE)
# To loop each column
for (i in 1:ncol(df))
{
dosomething(df[,i])
}
# To loop each row
for (i in 1:nrow(df))
{
dosomething(df[i,])
}
Also, you may want to have a look to the apply function (type ?apply or help(apply))if you want to use the same function on each row/column
Please check this out if it helps you
df<-read.csv("F:/test.csv",header=FALSE,nrows=1)
df
V1 V2 V3 V4 V5
1 ID GRADES GPA Teacher State
a<-c(df)
a[1]
$V1
[1] ID
Levels: ID
a[2]
$V2
[1] GRADES
Levels: GRADES
a[3]
$V3
[1] GPA
Levels: GPA
a[4]
$V4
[1] Teacher
Levels: Teacher
a[5]
$V5
[1] State
Levels: State
Since you say you want to access by position once your data is read in, you should know about R's subsetting/ indexing functions.
The easiest is
df[row,column]
#example
df[1:5,] #rows 1:5, all columns
df[,5] #all rows, column 5.
Other methods are here. I personally use the dplyr package for intuitive data manipulation (not by position).

Resources