rds file decompressed has inconsistent size - r

I have a downloaded a .rds file that I have decompressed in R using:
t<-readRDS("myfile.rds")
the file is easily decompressed into a data frame. ncol(t)=24, nrow(t)=20.
When I view the file in R studio, the table has actually 1572 columns and 20 rows.
I would like to know what I am actually dealing with here, mainly because when I try to save this data frame on a mysql server using RMySQL and DBI (dbWriteTable() ), R freezes.
For your information, class(t)='data.frame', typeof(t)='list'.
str(t) yields
tidyr::unnest(t) yields
thank you for your assistance

From your str call, consider walking down each nested element and flatten each one accordingly with either [[, unlist, or cbind to generate a main data frame. The recurring property is that most components appear to have length of 20 items, being number of observations of t.
# FLATTEN LIST OF CHR VECTORS INTO DATA FRAME COLUMN
t$alt_names <- unlist(t$alt_names)
# FLATTEN ONE COLUMN DATA FRAME
t$official_sites <- t$official_sites[[1]]
# ADJUST COLUMNS TO ALL NAs DUE TO ALL EMPTY LISTS
t$stats$previous_seasons <- NA
# CREATE TWENTY NEW COLUMNS FROM LIST OF CHR VECTORS
t$stats[paste0("seasonGoals_away", 1:20)] <- unlist(t$stats$seasonGoals_away)
t$stats$seasonGoals_away <- NULL # REMOVE ORIGINAL COLUMN
# SEPARATE LARGE DATA FRAME PIECES
stats <- t$stats
t$stats <- NULL # REMOVE ORIGINAL COLUMN
# COMBINE LARGE DATA FRAME PIECES
main_df <- cbind(t, stats) # DATA FRAME OF 20 OBS BY 1053 COLUMNS
Add same like steps for other nested objects not shown in screenshots. Ultimately, main_df should be a data frame of only flat, atomic types of 20 obs by 1053 (24 + 1023) columns.

Related

Write an excel or csv file in a way that the dataframes are listed on the same sheet, instead of multiple sheets in R

I have an object that contains multiple dataframes and wanted to produce a single excel worksheet with the data. I know there are ways of dealing with this problem in excel. But is there a way to manipulate it from the R side, so that people don't have to worry about the extra steps that weren't in my R script? I have been using this function (see below), but am open to another function. This function produces 1 excel file, but a worksheet for every dataframe. I have 119 dataframes. So this is not really practical.
write_xlsx(results1, "hpresponse1.MinimallyAdjusted")
I used the bind_rows. However, some of the data was lost. I am not sure how to retain it, especially as I don't even know what kind of data it is. But I turned my results for logistic regression into a dataframe so that I was able to perform certain manipulations. There are labels of some kind off to the left that are not variables. Can I turn these data into a variable so that it is retained when I use bind_rows?
(Up front, I'm assuming that the 199 frames are all different. Or at least that they are structured or used such that you must not combine them into a single frame, as stefan has suggested in the comments.)
I suggest you use the openxlsx package and offset each table individually.
L <- list(mtcars[1:3,1:3], mtcars[4:5,1:5], mtcars[6:10,1:4])
wb <- openxlsx::createWorkbook("quux.xlsx")
sapply(L, nrow)
# [1] 3 2 5
starts <- cumsum(c(1L, 2L + sapply(L[-length(L)], nrow)))
starts
# [1] 1 6 10
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "quuxsheet")
ign <- Map(function(x, st) openxlsx::writeData(wb, sheet = "quuxsheet", x = x, startRow = st), L, starts)
openxlsx::saveWorkbook(wb, file = "quux.xlsx")
sapply(L, nrow) gives us the number of rows in each table in the list. This is used so that we know to offset so-many-rows after a table for the next table. Since we don't care about the number of rows in the last frame, we omit it with L[-length(L)]
2L + sapply(..) gives us a gap of 1 row between each frame in the worksheet. Change to to suit your needs.
cumsum(c(1L, 2L+sapply(..)) is because we need an absolute row number, not just the counts for each frame. This results in starts holding the first excel row number for each frame in L.

How to read data frame name from a csv and use it in a loop in R

I have created some 500 data frames in the global environment of R studio and I have the list of all these data frame names in a csv file. I need to collect the no. of observations of each data frames. I know I need to use the nrow(dataset_name) command but is there any way, I can create a loop such that R reads the dataset names from the csv and executes the nrow command?
p.s.- I am a newbie to R, so plz pardon me if this question is very basic.
TIA!
Regards,
Brock
Sample Example Below:
# Creating 3 data frames
df1=data.frame(a=1:12,b=1:12) # Number of rows 12
df2=data.frame(a=1:5,b=1:5) # Number of rows 5
df3=data.frame(a=1:20,b=1:20) # Number of rows 20
# Storing the created data frames in a list
df_names=list(df1,df2,df3)
# Computing the number of rows for each data frame
for(i in 1:length(df_names)){
print(nrow(df_names[[i]]))
}
# [1] 12
# [1] 5
# [1] 20
Note that here the data frames created are all saved in a list, not the dataframe names but the data frames and not in the csv file.
If you have 500 dataframes with you with data frames name as df1,df2,....,df499, df500, you can create list using paste and then you can apply the same above for loop.
df_names=list(paste("df",1:500,sep=""))

Unexpected results when merging two vectors into a dataframe

I am extracting data from multiple CSV files and attempting to combine them into a single data frame. The source data is formatted weirdly, so I have to extract the data from specific locations in the source, then place them in a logical pattern in my resulting data frame.
I created two vectors of equal length and pulled the data from my source files. The end result is that I wind up with two vectors of length 3 (as expected), but instead of having a 3x2 data frame (3 observations of 2 variables), I wind up with a 1x6 data frame (1 observation of 6 variables).
What is curious to me is that although RStudio deems them both to be "List of 3", when I show them in the console, they display very differently:
The source code which doesn't work:
#set the working directory to where the data files are stored
setwd("/foo")
# identify how many data files are present
files = list.files("/foo")
# create vectors long enough to contain all the postal codes and income data
postalCodeData=vector(length=length(files))
medianIncomeData=vector(mode="character", length=length(files))
# loop through all the files, pulling data from rows 2 and 1585.
for(i in 1:length(files)) {
x = read.csv(files[i],skip=1,nrows= 1,header=F)
y = read.csv(files[i], skip = 1584, nrows = 1,header=F)
postalCodeData[i]=x
medianIncomeData[i]=y[2]
}
#create the data frame
Results=data.frame(postalCodeData,medianIncomeData)
#name the columns
names(Results)=c("FSA", "Median Income")
My data frame winds up looking like this:
Source code which does work:
setwd("/Users/Perry/Downloads/Postal Code Data/")
files = list.files("/Users/Perry/Downloads/Postal Code Data/")
postalCodeData=c("K0A","K0B","K0C")
medianIncomeData=c("10000","20000","30000")
Results=data.frame(postalCodeData,medianIncomeData)
names(Results)=c("FSA", "Median Income")
Unfortunately, I can't specify the values explicitly because I have a few hundred files to extract the information from. Any advice on how I can correct the loop to get the desired results would be appreciated.
The output of "read.csv" is a data frame, so, when you store
medianIncomeData[i]=y[2]
you are storing a column of a data frame, use
medianIncomeData[i]=y[2][1]
instead, to store only the value that you want, the same for x

Name list of data frames from data frame

I usually read a bunch of .csv files into a list of data frames and name it manually doing.
#...code for creating the list named "datos" with files from library
# Naming the columns of the data frames
names(datos$v1r1)<-c("estado","tiempo","x1","x2","y1","y2")
names(datos$v1r2)<-c(...)
names(datos$v1r3)<-c(...)
I want to do this renaming operation automatically. To do so, I created a data frame with the names I want for each of the data frames in my datos list.
Here is how I generate this data frame:
pru<-rbind(c("UT","TR","UT+","TR+"),
c("UT","TR","UT+","TR+"),
c("TR","UT","TR+","UT+"),
c("TR","UT","TR+","UT+"))
vec<-paste("v1r",seq(1,20,1),sep="")
tor<-paste("v1s",seq(1,20,1),sep="")
nombres<-do.call("rbind", replicate(10, pru, simplify = FALSE))
nombres_df<-data.frame(corrida=c(vec,tor),nombres)
Because nombres_df$corrida[1] is v1r1, I have to name the datos$v1r1 columns ("estado","tiempo", nombres_df[1,2:5]), and so on for the other 40 elements.
I want to do this renaming automatically. I was thinking I could use something that uses regular expressions.
Just for the record, I don't know why but the order of the list of data frames is not the same as the 1:20 sequence (by this I mean 10 comes before 2,3,4...)
Here's a toy example of a list with a similar structure but fewer and shorter data frames.
toy<-list(a=replicate(6,1:5),b=replicate(6,10:14))
You have a data frame where variable corridas is the name of the data frame to be renamed and the remaining columns are the desired variable names for that data frame. You could use a loop to do all the renaming operations:
for (i in seq_len(nrow(nombres_df))) {
names(datos[[nombres_df$corridas[i]]]) <- c("estado","tiempo",nombres_df[i,2:length(nombres_df)])
}

Looping and storing results over many data frames

I want to to perform at least six looping steps in R. My data sets are 28 files stored in one folder. Each file has 22 rows (21 individual cases and one row for column names) and columns as follows: Id, id, PC1, PC2….PC20.
I intend to:
read each file into R as a data frame
delete first column named “Id” in the each data frame
arrange each data frame as follows:
first column should be “id” and
next ten columns should be first ten PCs (PC1, PC2, …PC10)
sort each data frame according to “id” (data frames should have the same order of individuals and their respective PC’s scores)
perform pairwise comparison by protest function in the vegan package among all possible pair’s combinations (378 combinations)
store result of each pair’s comparison in a symmetric (28*28) matrix which will be used in further analysis
At the moment I am able to do it manually for each pair of data (code is below):
## 1. step
## read files into R as a data frame
c_2d_hand_1a<-read.table("https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, c_2d_hand-1a, Symmetric component.txt",header=T)
c_2d_hand_1b<-read.table("https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, c_2d_hand-1b, Symmetric component.txt",header=T)
## 2. step
## delete first column named “Id” in the each data frame
c_2d_hand_1a[,1]<-NULL
c_2d_hand_1b[,1]<-NULL
## 3. step
## arrange each data frame that have 21 rows and 11 columns (id,PC1,PC2..PC10)
c_2d_hand_1a<-c_2d_hand_1a[,1:11]
c_2d_hand_1b<-c_2d_hand_1b[,1:11]
## 4. step
## sort each data frame according to “id”
c_2d_hand_1a<-c_2d_hand_1a[order(c_2d_hand_1a$id),]
c_2d_hand_1b<-c_2d_hand_1b[order(c_2d_hand_1b$id),]
## 5. step
## perform pairwise comparison by protest function
library(permute)
library(vegan)
c_2d_hand_1a_c_2d_hand_1b<-protest(c_2d_hand_1a[,2:ncol(c_2d_hand_1a)],c_2d_hand_1b[,2:ncol(c_2d_hand_1b)],permutations=10000)
summary(c_2d_hand_1a_c_2d_hand_1b)[2] ## or c_2d_hand_1a_c_2d_hand_1b[3]
Since I am a newbie in data handling/manipulation in R, my self-learning skills are suitable to perform respective steps manually, typing codes for each data set and perform each pairwise comparisons at the time. Since I need to perform those six steps 378 times, manual typing would be exhaustive and time consuming.
I tried to import files as a list and tried several operations, but I was unsuccessful. Specifically, using list.files(), I made the list, called “probe”. I was able to select certain data frame using e.g. probe[2]. Also I could assess column “Id” by e.g. probe[2][1], and deleted it by probe[2][1]<-NULL. But when I tried to work with for loop, I was stuck.
This code is untested, but with some luck, it should work. The summary of the protest() results are stored in a matrix of lists.
# develop a way to easily reference all of the URLs
url.begin <- "https://googledrive.com/host/0B90n5RdIvP6qbkNaUG1rTXN5OFE/PC scores, "
url.middle <- c("c_2d_hand-1a", "c_2d_hand-1b")
url.end <- ", Symmetric component.txt"
L <- length(url.middle)
# read in all of the data and save it to a list of data frames
mybiglist <- lapply(url.middle, function(mid) read.table(paste0(url.begin, mid, url.end), header=TRUE))
# save columns 2 to 12 in each data frame and order by id
mybiglist11cols <- lapply(mybiglist, function(df) df[order(df$id), 2:12])
# get needed packages
library(permute)
library(vegan)
# create empty matrix of lists to store results
results <- matrix(vector("list", L*L), nrow=L, ncol=L)
# perform pairwise comparison by protest function
for(i in 1:L) {
for(j in 1:L) {
df1 <- mybiglist11cols[[i]]
df2 <- mybiglist11cols[[j]]
results[i, j] <- list(summary(protest(df1[, -1], df2[, -1], permutations=10000)))
}}

Resources