rbind formatting issues when exporting to csv - r

What I am doing is crosstabulating a range of variables in a dataset (kol) with one crossing variable (cross) and then exporting the result into a csv.
library(dplyr)
cross <- dataset$crossingvariable
kol <- select(dataset,var.first:var.last)
myfunction <- function(x) {
table1 <- table(cross, x)
table2 <- prop.table(table1, 1)
table3 <- data.frame(table2)
}
dtfr <- lapply(kol,FUN=myfunction)
dtfr2 <- do.call("rbind", dtfr)
write.csv(dtfr2, file="C:/Users/Gebruiker/Documents/statistics/test.csv")
The problem is that the two dtfr (dtfr and dtfr2) steps mess up the formatting. What it needs to do is that every element in dtfr is arranged vertically under one another when exporting to csv. dtfr2 does this, but in the process it also rearranges the tables. So in step dtfr the formatting is still correct if I include as.data.frame.matrix with table3 (shown below) (how the formatting should be, shown in r console and with as.data.frame.matrix in table3), in step dtfr2 it has moved (Incorrect formatting in csv). Step dtfr2 however is needed to get it vertically under one another in the csv file.
When I change table3 to:
library(dplyr)
cross <- dataset$crossingvariable
kol <- select(dataset,var.first:var.last)
myfunction <- function(x) {
table1 <- table(cross, x)
table2 <- prop.table(table1, 1)
table3 <- as.data.frame.matrix(table2)
}
dtfr <- lapply(kol,FUN=myfunction)
dtfr2 <- do.call("rbind", dtfr)
write.csv(dtfr2, file="C:/Users/Gebruiker/Documents/statistics/test.csv")
dtfr gets a correct output (shown above). But at this point dtfr2 shows the following error message when I run the script:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I assume it has something to do with unequal amounts of columns in the different variables I am crosstabulating. Most of them are nominal or ordinal variables, so some of them have three categories and some of them have five.
I can't manage to find how to fix this, mostly having to do with my limited r knowledge. I googled the problem and looked on stack overflow which yielded some similar problems but all the offered solutions didn't work in my case. Again this has alot to do with me not really knowing enough about r yet, but some help pointing me in the right direction would be very much appreciated.

Related

Error: Files must have consistent column names: * File 1 column 5 is: 1 * File 2 column 5 is: 0

Note: all of the datasets and scripts referenced here can be easily found in my
GitHub Repository for this research project. The code referenced and reprinted both in this question and the one linked to in the previous question should be in the script called "Both BE & FS script(40) - prepped for Antony" (Dr. Antony Davies is the principle researcher on this project).
This question is technically a follow up on a previous question, which itself was a follow up of a question before that. I firmly believe that the proposed answer in the previous question which can be found in the link would work in most cases, but my datasets have an unusual characteristic, they have no headers, and their top rows are not the variable names, the 3rd rows contain the variable names. Instead, their top rows contain which variables were selected by a variable selection algorithm (LASSO, Backward Stepwise, or Forward Stepwise) denoted as binary indicators, basically dummy variables as can be seen in the screenshot below:
This peculiar feature of these datasets I am loading into R and storing in a list/dataframe object I call 'datasets' means that when I ran the proposed solution to the previous question, I got this back in the Console:
all_data <- vroom(list.files(pattern = 'csv'), id = 'source_file')
Error: Files must have consistent column names: * File 1 column 5 is: 1 * File 2 column 5 is: 0
I can clearly see what the problem is here, but I don't know how to fix it without asking my collaborator on this research project to edit part of the custom Excel Macro he wrote which he used to generate the 260,000 synthetic datasets we are running his new algorithm and the several Benchmark Algorithms on and then have him run it again on his second computer for half a week or more just so I can avoid this error at this very last step in measuring the performance of the Benchmark Methods!
p.s. 1 - Just to make myself clearer here, the best format I could have after transforming their current state would be for the top row to be what the 2nd row currently is, except with Xs in front of ever number, or equivalently, what is in the 3rd row starting with column #2, and what is currently in the 1st row right below that proper 1st row or header. I already know exactly how to write functions to score the selections if I can get the output in that format!
p.s. 2 - Again, the way that I got here in case it is too much trouble for you to quickly toggle between two tabs or between two tabs and an Rscript, I use the following code to get to where my issue is:
setwd("/datasets folder's location on my laptop")
filepaths_list <- list.files(path = filepath, full.names = TRUE, recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepaths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
datasets <- lapply(filepaths_list, read.csv, header = FALSE)
True_IVs <- lapply(datasets, function(j) {j[1, -1]})
datasets <- lapply(datasets, function(i) {i[-1:-3, ]})
datasets <- lapply(datasets, \(X) { lapply(X, as.numeric) })
datasets <- lapply(datasets, function(i) { as.data.frame(i) })
# the line below won't run without the line above it having been run 1st
datasets <- lapply(datasets, \(X) { round(X, 3) })
# fit all of the Backward Stepwise Regressions
set.seed(11) # for reproducibility
system.time(BE.fits <- parLapply(CL, datasets, \(X) {
full_models <- lm(X$V1 ~ ., X)
back <- step(full_models, scope = formula(full_models),
direction = 'back', trace = FALSE) }) )
# extract their fitted coefficients
BE_Coeffs <- lapply(seq_along(BE.fits), function(i) coef(BE.fits[[i]]))
# extract just the candidate variables they select by name
# this is one of the two components needed for performance measurment
IVs_Selected_by_BE <- lapply(seq_along(BE.fits),
\(i) names(coef(BE.fits[[i]])[-1]))
# this is the other component needed for performance measurement
True_Regressors <- lapply(True_IVs, function(i){names(i)[i == 1]})

Looping through variables to produce tables of percentages

I am very new to R and would appreciate any advice. I am from a STATA background and so learning to think in R. I am trying to produce tables of percentages for my 20 binary variables. I have tried a for loop but not sure where I am going wrong as there is no warning message.
for (i in 1:ncol(MAAS1r[varbinary])) {
varprop<- varbinary[i]
my.table<-table(MAAS1r[varprop])
my.prop<-prop.table(my.table)
cbind(my.table, my.prop)
}
Many thanks
I made one with an example extracted from mtcars
this are two variables that are binary (0 or 1), called VS and AM
mtcarsBivar<- mtcars[,c(8,9)]
get names of the columns:
varbinary <- colnames(mtcarsBivar)
use dplyr to do it:
library(dplyr)
make an empty list to populate
Binary_table <- list()
now fill it with the loop:
for (i in 1:length(varbinary)) {
Binary_table[[i]] <- summarise(mtcarsBivar, percent_1 = sum(mtcarsBivar[,1] == 1)/nrow(mtcarsBivar))
}
Transform it to a data frame
Binary_table <- do.call("cbind", Binary_table)
give it the name of the varbinary to the columns
colnames(Binary_table) <- varbinary
this only works if all your variables are binary

Delete row after row in for loop

I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).
I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.
Here are my tries:
Approach 1
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
a <- data.frame()
for (i in 1:length(rand_no)){
a <- rbind(a, which.names(rand_no[i], file))
file <- file[-a[1,1]]
}
Problem:
Warning message:
In which.names(rand_no[i], file) : 297 not matched
Approach 2
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
deleter <- function(i) {
a <- which.names(rand_no[i], file)
file <- file[-a]
}
lapply(1:length(rand_no), deleter)
Problem:
This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.
Thanks for any suggestions.
Edit
I hoped that it will work with numbers, but of course file looks like this:
file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")
Of course rand_no can't be over 100 files with such a small sample. Therefore:
rand_no <- sample(file, 2)
Use list instead of c. Then you can set the values to NULL and they will be removed.
file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)
library(car) #From poster's code.
file[file %in% rand_no] <- NULL
If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)
library(car) #From poster's code.
file[rand_no] <- NULL
Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.
So if you want three sets of 100 samples that don't share any elements, you can use
file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)
s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]
Or if you wanted to decease the total size by 100 each time you could do
s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]
A simple approach would be to select random indices and then remove those indices:
file <- 1:10000 # Build sample data
ind <- sample(seq(length(file)), 100) # Select random indices
rand_no <- file[ind] # Compute the actual values selected
file <- file[-ind] # Remove selected indices
I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.
files<-paste("file",1:100,sep="_")
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))

Create several data.frames via a for loop and name them accordingly

I want to apply a for-loop to every element of a list (station code of air quality stations) and create a single data.frame for each station with specific data.
My current code looks like this:
for (i in Stations))
{i_PM <- data.frame(PM2.5$DateTime,PM2.5$i)
colnames(i_PM)[1] <- "DateTime"
i_AOT <- subset(MOD2011, MOD2011$Station_ID==i)
i <- merge(i_PM, i_AOT, by="DateTime")}
Stations consists of 28 elements. The result should be a data.frame for every station with the colums DateTime, PM2.5 and several elements from MOD2011.
I just dont get it running as its supposed to be. Im sure its my fault, I couldnt find the specific answer via the internet.
Can you show me my mistake?
Try assign:
for (i in Stations)) {
dat <- data.frame(PM2.5$DateTime,PM2.5$i)
dat2 <- subset(MOD2011, MOD2011$Station_ID==i)
colnames(i_PM)[1] <- "DateTime"
assign(paste(i, "_PM", sep=""), dat)
assign(paste(i, "_AOT", sep=""), dat2)
assign(i, merge(dat, dat2, by="DateTime"))
}
Note, however, that this is bad coding practice. You should reconsider your algorithm. For instance, use a list instead.

merge tables in Loop using R

I have a simple question regarding a loop that I wrote. I want to access different files in different directories and extract data from these files and combine into one table. My problem is that my loop is not adding the results of the different files but only updating with the species that is currently in the loop. Here it is my code:
for(i in 1:length(splist.par))
{
results<-read.csv(paste(getwd(),"/ResultsR10arcabiotic/",splist.par[i],"/","maxentResults.csv",sep=""),h=T)
species <- splist.par[i]
AUC <- results$Test.AUC[1:10]
AUC_SD <- results$AUC.Standard.Deviation[1:10]
Variable <- "a"
Resolution <- "10arc"
table <-cbind(species,AUC,AUC_SD,Variable,Resolution)
}
This is probably an easy question but I am not an experienced programmer. Thanks for the attention
Gabriel
I'd use lapply to get the desired data from each file and add the Species information, and then combine with rbind. Something like this (untested):
do.call(rbind, lapply(splist.par, function(x) {
d <- read.csv(file.path("ResultsR10arcabiotic", x, "maxentResults.csv"))
d <- d[1:10, c("Test.AIC", "AIC.Standard.Deviation")]
names(d) <- c("AUC", "AUC_SD")
cbind(Species=x, d, stringsAsFactors=FALSE)
}))
#Aaron's lapply answer is good, and clean. But to debug your code: you put a bunch of data into table but overwrite table every time. You need to do
table <-cbind(table, species,AUC,AUC_SD,Variable,Resolution)
BTW, since table is a function in R, I'd avoid using it as a variable name. Imagine:
table(table)
:-)

Resources