Looping and storing results over many data frames - r

I want to to perform at least six looping steps in R. My data sets are 28 files stored in one folder. Each file has 22 rows (21 individual cases and one row for column names) and columns as follows: Id, id, PC1, PC2….PC20.
I intend to:
read each file into R as a data frame
delete first column named “Id” in the each data frame
arrange each data frame as follows:
first column should be “id” and
next ten columns should be first ten PCs (PC1, PC2, …PC10)
sort each data frame according to “id” (data frames should have the same order of individuals and their respective PC’s scores)
perform pairwise comparison by protest function in the vegan package among all possible pair’s combinations (378 combinations)
store result of each pair’s comparison in a symmetric (28*28) matrix which will be used in further analysis
At the moment I am able to do it manually for each pair of data (code is below):
## 1. step
## read files into R as a data frame
c_2d_hand_1a<-read.table("https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, c_2d_hand-1a, Symmetric component.txt",header=T)
c_2d_hand_1b<-read.table("https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, c_2d_hand-1b, Symmetric component.txt",header=T)
## 2. step
## delete first column named “Id” in the each data frame
c_2d_hand_1a[,1]<-NULL
c_2d_hand_1b[,1]<-NULL
## 3. step
## arrange each data frame that have 21 rows and 11 columns (id,PC1,PC2..PC10)
c_2d_hand_1a<-c_2d_hand_1a[,1:11]
c_2d_hand_1b<-c_2d_hand_1b[,1:11]
## 4. step
## sort each data frame according to “id”
c_2d_hand_1a<-c_2d_hand_1a[order(c_2d_hand_1a$id),]
c_2d_hand_1b<-c_2d_hand_1b[order(c_2d_hand_1b$id),]
## 5. step
## perform pairwise comparison by protest function
library(permute)
library(vegan)
c_2d_hand_1a_c_2d_hand_1b<-protest(c_2d_hand_1a[,2:ncol(c_2d_hand_1a)],c_2d_hand_1b[,2:ncol(c_2d_hand_1b)],permutations=10000)
summary(c_2d_hand_1a_c_2d_hand_1b)[2] ## or c_2d_hand_1a_c_2d_hand_1b[3]
Since I am a newbie in data handling/manipulation in R, my self-learning skills are suitable to perform respective steps manually, typing codes for each data set and perform each pairwise comparisons at the time. Since I need to perform those six steps 378 times, manual typing would be exhaustive and time consuming.
I tried to import files as a list and tried several operations, but I was unsuccessful. Specifically, using list.files(), I made the list, called “probe”. I was able to select certain data frame using e.g. probe[2]. Also I could assess column “Id” by e.g. probe[2][1], and deleted it by probe[2][1]<-NULL. But when I tried to work with for loop, I was stuck.

This code is untested, but with some luck, it should work. The summary of the protest() results are stored in a matrix of lists.
# develop a way to easily reference all of the URLs
url.begin <- "https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, "
url.middle <- c("c_2d_hand-1a", "c_2d_hand-1b")
url.end <- ", Symmetric component.txt"
L <- length(url.middle)
# read in all of the data and save it to a list of data frames
mybiglist <- lapply(url.middle, function(mid) read.table(paste0(url.begin, mid, url.end), header=TRUE))
# save columns 2 to 12 in each data frame and order by id
mybiglist11cols <- lapply(mybiglist, function(df) df[order(df$id), 2:12])
# get needed packages
library(permute)
library(vegan)
# create empty matrix of lists to store results
results <- matrix(vector("list", L*L), nrow=L, ncol=L)
# perform pairwise comparison by protest function
for(i in 1:L) {
for(j in 1:L) {
df1 <- mybiglist11cols[[i]]
df2 <- mybiglist11cols[[j]]
results[i, j] <- list(summary(protest(df1[, -1], df2[, -1], permutations=10000)))
}}

Related

rds file decompressed has inconsistent size

I have a downloaded a .rds file that I have decompressed in R using:
t<-readRDS("myfile.rds")
the file is easily decompressed into a data frame. ncol(t)=24, nrow(t)=20.
When I view the file in R studio, the table has actually 1572 columns and 20 rows.
I would like to know what I am actually dealing with here, mainly because when I try to save this data frame on a mysql server using RMySQL and DBI (dbWriteTable() ), R freezes.
For your information, class(t)='data.frame', typeof(t)='list'.
str(t) yields
tidyr::unnest(t) yields
thank you for your assistance
From your str call, consider walking down each nested element and flatten each one accordingly with either [[, unlist, or cbind to generate a main data frame. The recurring property is that most components appear to have length of 20 items, being number of observations of t.
# FLATTEN LIST OF CHR VECTORS INTO DATA FRAME COLUMN
t$alt_names <- unlist(t$alt_names)
# FLATTEN ONE COLUMN DATA FRAME
t$official_sites <- t$official_sites[[1]]
# ADJUST COLUMNS TO ALL NAs DUE TO ALL EMPTY LISTS
t$stats$previous_seasons <- NA
# CREATE TWENTY NEW COLUMNS FROM LIST OF CHR VECTORS
t$stats[paste0("seasonGoals_away", 1:20)] <- unlist(t$stats$seasonGoals_away)
t$stats$seasonGoals_away <- NULL # REMOVE ORIGINAL COLUMN
# SEPARATE LARGE DATA FRAME PIECES
stats <- t$stats
t$stats <- NULL # REMOVE ORIGINAL COLUMN
# COMBINE LARGE DATA FRAME PIECES
main_df <- cbind(t, stats) # DATA FRAME OF 20 OBS BY 1053 COLUMNS
Add same like steps for other nested objects not shown in screenshots. Ultimately, main_df should be a data frame of only flat, atomic types of 20 obs by 1053 (24 + 1023) columns.

How do I loop through multiple Data Frames in r to create a vector?

This is the code I am currently using to move data from multiple data frames into a time-ordered vector which I then perform analysis on and graph:
TotalLoans <- c(
sum(as.numeric(HCD2001$loans_all)), sum(as.numeric(HCD2002$loans_all)),
sum(as.numeric(HCD2003$loans_all)), sum(as.numeric(HCD2004$loans_all)),
sum(as.numeric(HCD2005$loans_all)), sum(as.numeric(HCD2006$loans_all)),
sum(as.numeric(HCD2007$loans_all)), sum(as.numeric(HCD2008$loans_all)),
sum(as.numeric(HCD2009$loans_all)), sum(as.numeric(HCD2010$loans_all)),
sum(as.numeric(HCD2011$loans_all)), sum(as.numeric(HCD2012$loans_all)),
sum(as.numeric(HCD2013$loans_all)), sum(as.numeric(HCD2014$loans_all)),
sum(as.numeric(HCD2015$loans_all)), sum(as.numeric(HCD2016$loans_all))
)
I do this four more times with similar data frames that also are similarly formatted as:
Varname$year
Is there a way to loop through these 16 data frames, select an individual column, perform a function on it, and put it into a vector? This is what I have tried so far:
AllList <- list(HCD2001, HCD2002, HCD2003, HCD2004, HCD2005, HCD2006, HCD2007, HCD2008, HCD2009, HCD2010, HCD2011, HCD2012, HCD2013, HCD2014, HCD2015, HCD2016)
TotalLoans <- lapply(AllList,
function(df){
sum(as.numeric(df$loans_all))
return(df)
}
)
However, it returns a Large List with every column from the data frames. All the other posts related to this were for modifying data frames, not creating a new vector with modified values of the data frames.

For Loop Over List of Data Frames and Create New Data Frames from Every Iteration Using Variable Name

I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".

Create new columns for percentile rank of numerous columns in a data frame in R

I have a fairly large data-set (4000 obs of 149 variables), and I would like to look at the percentile rank of many of these variables. I have been able to successfully generate the percentile ranks (I believe) ignoring NA values with the following code:
prank <- function(x){
r <- rank(x)/sum(!is.na(x))*100
r[is.na(x)]<-NA
r
}
My question is how to automate applying this function to the columns I am interested in, returning a new column with the ranks? I tried this:
y <- data.frame(x, t(apply(-x,1,prank)))
But this appears to group everything together and establish the ranks. I essentially want to be able to do the following on ~100 different columns:
y$V5.pr <- prank(x$V5)

Manipulating cutree object in R to segment original dataframe

I'm using R's built-in correlation matrix and hierarchical clustering methods to segment daily sales data into 10 clusters. Then, I'd like to create agglomerated daily sales data by cluster. I've got as far as creating a cutree() object, but am stumped on extracting only the column names in the cutree object where the cluster number is 1, for example.
For simplicity's sake, I'll use the EuStockMarkets data set and cut the tree into 2 segments; bear in mind that I'm working with thousands of columns here so the needs to be scalable:
data=as.data.frame(EuStockMarkets)
corrMatrix<-cor(data)
dissimilarity<-round(((1-corrMatrix)/2), 3)
distSimilarity<-as.dist(dissimilarity)
hirearchicalCluster<-hclust(distSimilarity)
treecuts<-cutree(hirearchicalCluster, k=2)
now, I get stuck. I want to extract only the column names from treecuts where the cluster number is equal to 1, for example. But, the object that cutree() makes is not a DataFrame, making sub-setting difficult. I've tried to convert treecuts into a data frame, but R does not create a column for the row names, all it does is coerce the numbers into a row with the name treecuts.
I would want to do the following operations:
....Code that converts treecuts into a data frame called "treeIDs" with the
columns "Index" and "Cluster"......
cluster1Columns<-colnames(treeIDs[Cluster==1, ])
cluster1DF<-data[ , (colnames(data) %in% cluster1Columns)]
rowSums(cluster1DF)
...and voila, I'm done.
Thoughts/suggestions?
Here is the solution:
names(treecuts[which(treecuts[1:4]==1)])
[1] "DAX" "SMI" "FTSE"
If you want,say, also for the cluster 2 (or higher), you can then use %in%
names(treecuts[which(treecuts[1:4] %in% c(1,2))])
[1] "DAX" "SMI" "CAC" "FTSE"
Why not just
data$clusterID <- treecuts
then subset data as usual?

Resources