Applying lapply on Multiple Data Frames in a List, R - r

I have a list of similar data frames in a list u (4 columns, all with same headers) and would like to run an lapply function to get the correlation of columns 2 and 3 of each data frame. I want the function to read any integer i (the list has 300+ csvs).
I've tried this code but it hasn't worked:
i<-1:2
for (i) lapply(u, cor(u[[i]][,2],u[[i]][,3]))
Can someone please help me fix this code? Still fairly new to the program.
Edit: I've tried Metrics code below and it works, unfortunately one of the csvs contain only headers and no data. I get this error: Error in cor(u[, 2], u[, 3]) : 'x' is empty
Is there anyway sapply can be modified so that the "cor" function returns 0 if there isn't any data available?

x contains the list of all dataframes. In the following example, I have used two dataframes from R. (mtcars and iris)
list(mtcars=mtcars,iris=iris)
lapply(x,function(x)cor(x[,2],x[,3]))
[[1]]
[1] 0.9020329
[[2]]
[1] -0.4284401
Or, if you want the vector output:
sapply(x,function(x)cor(x[,2],x[,3]))
[1] 0.9020329 -0.4284401

Related

Comparing character lists in R

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?
Maybe you can try match() function or "==".
You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

Use lists/dataframes as items in for-loops in R

I am quite sure this is basic stuff, but I just can't find the answer by googling. So my problem:
I want to use a for-loop on a list of lists or data frames. But when you use list[i], you get all the values in the data frame instead of the data frame it self. Can anyone point out to me how to code this properly?
Example of the code:
a<-data.frame(seq(1:3),seq(3:1))
b<-data.frame(seq(1:3),seq(3:1))
l<-c(a,b)
Then l[1] returns:
> l[1]
$seq.1.3..
[1] 1 2 3
And I want it to just return: a
You can use the list function:
a<-data.frame(1:3,1:3)
b<-data.frame(3:1,3:1)
l<-list(a,b)
And access it's value with double brackets [[:
l[[1]]
l[[2]]
Ps: seq(1:3) and seq(3:1) outputs the same value, so I used 1:3 and 3:1. :)

How to use List of List of Dataframes

I´m not sure if this is possible or even how to get a good resolution for the following R problem.
Data / Background / Structure:
I´ve collected a big dataset of project based cooperation data, which maps specific projects to the participating companies (this can be understood as a bipartite edgelist for social network analysis). Because of analytical reasons it is advised to subset the whole dataset to different subsets of different locations and time periods. Therefore, I´ve created the following data structure
sna.location.list
[[1]] (location1)
[[1]] (is a dataframe containing the bip. edge-list for time-period1)
[[2]] (is a dataframe containing the bip. edge-list for time-period2)
...
[[20]] (is a dataframe containing the bip. edge-list for time-period20)
[[2]] (location2)
... (same as 1)
...
[[32]] (location32)
...
Every dataframe contains a project id and the corresponding company ids.
My goal is now to transform the bipartite edgelists to one-mode networks and then do some further sna-related-calculations (degree, centralization, status, community detection etc.) and save them.
I know how to these claculation-steps with one(!) specific network but it gives me a really hard time to automate this process for all of the networks at one time in the described list structure, and save the various outputs (node-level and network-level variables) in a similar structure.
I already tried to look up several ways of for-loops and apply approaches but it still gives me sleepless nights how to do this and right now I feel very helpless. Any help or suggestions would be highly appreciated. If you need more information or examples to give me a brief demo or code example how to tackle such a nested structure and do such sna-related calculations/modification for all of the aforementioned subsets in an efficient automatic way, please feel free to contact me.
Let's say you have a function foo that you want to apply to each data frame. Those data frames are in lists, so lapply(that_list, foo) is what we want. But you've got a bunch of lists, so we actually want to lapply that first lapply across the outer list, hence lapply(that_list, lapply, foo). (The foo will be passed along to the inner lapply with .... If you wish to be more explicit you can use an anonymous function instead: lapply(that_list, function(x) lapply(x, foo)).
You haven't given a reproducible example, so I'll demonstrate applying the nrow function to a list of built-in data frames
d = list(
list(mtcars, iris),
list(airquality, faithful)
)
result = lapply(d, lapply, nrow)
result
# [[1]]
# [[1]][[1]]
# [1] 32
#
# [[1]][[2]]
# [1] 150
#
#
# [[2]]
# [[2]][[1]]
# [1] 153
#
# [[2]][[2]]
# [1] 272
As you can see, the output is a list with the same structure. If you need the names, you can switch to sapply with simplify = FALSE.
This covers applying functions to a nested list and saving the returns in a similar data structure. If you need help with calculation efficiency, parallelization, etc., I'd suggest asking a separate question focused on that, with a reproducible example.

How to subset a list based on the length of its elements in R

In R I have a function (coordinates from the package sp ) which looks up 11 fields of data for each IP addresss you supply.
I have a list of IP's called ip.addresses:
> head(ip.addresses)
[1] "128.177.90.11" "71.179.12.143" "66.31.55.111" "98.204.243.187" "67.231.207.9" "67.61.248.12"
Note: Those or any other IP's can be used to reproduce this problem.
So I apply the function to that object with sapply:
ips.info <- sapply(ip.addresses, ip2coordinates)
and get a list called ips.info as my result. This is all good and fine, but I can't do much more with a list, so I need to convert it to a dataframe. The problem is that not all IP addresses are in the databases thus some list elements only have 1 field and I get this error:
> ips.df <- as.data.frame(ips.info)
Error in data.frame(`128.177.90.10` = list(ip.address = "128.177.90.10", :
arguments imply differing number of rows: 1, 0
My question is -- "How do I remove the elements with missing/incomplete data or otherwise convert this list into a data frame with 11 columns and 1 row per IP address?"
I have tried several things.
First, I tried to write a loop that removes elements with less than a length of 11
for (i in 1:length(ips.info)){
if (length(ips.info[i]) < 11){
ips.info[i] <- NULL}}
This leaves some records with no data and makes others say "NULL", but even those with "NULL" are not detected by is.null
Next, I tried the same thing with double square brackets and get
Error in ips.info[[i]] : subscript out of bounds
I also tried complete.cases() to see if it could potentially be useful
Error in complete.cases(ips.info) : not all arguments have the same length
Finally, I tried a variation of my for loop which was conditioned on length(ips.info[[i]] == 11 and wrote complete records to another object, but somehow it results in an exact copy of ips.info
Here's one way you can accomplish this using the built-in Filter function
#input data
library(RDSTK)
ip.addresses<-c("128.177.90.10","71.179.13.143","66.31.55.111","98.204.243.188",
"67.231.207.8","67.61.248.15")
ips.info <- sapply(ip.addresses, ip2coordinates)
#data.frame creation
lengthIs <- function(n) function(x) length(x)==n
do.call(rbind, Filter(lengthIs(11), ips.info))
or if you prefer not to use a helper function
do.call(rbind, Filter(function(x) length(x)==11, ips.info))
Alternative solution based on base package.
# find non-complete elements
ids.to.remove <- sapply(ips.info, function(i) length(i) < 11)
# remove found elements
ips.info <- ips.info[!ids.to.remove]
# create data.frame
df <- do.call(rbind, ips.info)

Recoding over multiple data frames in R

(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]

Resources