function for randomizing (row-wise) df, splitting, then writing to csv - r

So, my goal is to write a function that will take as input any csv file, an output path, and an arbitrary number of split sizes (by number of rows), and then randomize and split the data into the appropriate files. I could really easily do this manually if I know the split sizes ahead of time, but I want an automated function that will handle varying split sizes. Seems straightforward, and here's what I had written:
randomizer = function(startFile, endPath, ...){ ##where ... are the user-defined split sizes
vec = unlist(list(...))
n_files = length(vec)
values = read.csv(startFile, stringsAsFactors = FALSE)
values_rand = as.data.frame(values[sample(nrow(values)),])
for(i in 1:n_files){
if(nrow(values_rand)!=0 & !is.null(nrow(values_rand))){
assign(paste('group', i , sep=''), values_rand[1:vec[i], ]);
values_rand = as.data.frame(values_rand[(vec[i]+1):nrow(values_rand), ], stringsAsFactors = FALSE)
## (A) write.csv fn here?
} else {
print("something went wrong")
}
}
## (B) write.csv fn here?
}
}
when I try to do something in place (A) like write.csv(x= paste('group', i, sep=''), file= paste(endPath, '/group', i, '.csv', sep=''), row.names=FALSE I get errors or literally writing the string "group1" to a csv, rather than the chunk of the randomized dataframe i'm looking for. I'm super confused, as this seems like I'm running up against R semantics rather than a genuine programming issue.. Thanks in advance for the help.

You have indeed programmed yourself into a corner here, and it's a common one for beginners to end up in, particularly beginners that are coming to R from other programming languages.
The use of assign is the big red flag. At least when you're starting out in the language, if you feel yourself reaching for that function, stop and think again. You're most likely approaching the problem entirely wrong and need to rethink it.
Here is my (entirely untested) version of what you described, annotated with some comments:
split_file <- function(startFile,endPath,sizes){
#There's no need to use "..." for the partition sizes.
# A simple vector of values is much simpler
values <- read.csv(startFile,stringsAsFactors = FALSE)
if (sum(sizes) != nrow(values)){
#I'm assuming here that we're not doing anything fancy with bad input
stop("sizes do not evenly partition data!")
}else{
#Shuffle data frame
# Note we don't need as.data.frame()
values <- values[sample(nrow(values)),]
#Split data frame
values <- split(values,rep(seq_len(nrow(values)),times = sizes))
#Create the output file paths
paths <- paste0(endPath,"/group_",seq_along(sizes))
#We could shoe-horn this into lapply, but there's no real need
for (i in seq_along(values)){
write.csv(x = values[[i]],file = paths[i],row.names = FALSE)
}
}
}

Related

For loop returns last result

I have a small number of csv files, each containing two columns with numeric values. I want to write a for loop that reads the files, sums the columns, and stores the sum totals for each csv in a numeric vector. This is the closest I've come:
allfiles <- list.files()
for (i in seq(allfiles)) {
total <- numeric()
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
total
}
My result is all NA's save a value for the last file. I understand that I'm overwriting each iteration each time the for loop executes and I think* I need to do something with indexing.
The first problem is that you are not pre-allocating the right length of (or properly appending to) total. Regardless, I recommend against that method.
There are several ways to do this, but the R-onic (my term, based on pythonic ... I know, it doesn't flow well) is based on vectors/lists.
alldata <- sapply(allfiles, read.csv, simplify = FALSE)
totals <- sapply(alldata, function(a) sum(subset(a, select=Gift.1), subset(a, select=Gift.2)))
I often like to that, keeping the "raw/unaltered" data in one list and then repeatedly extract from it. For instance, if the files are huge and reading them is a non-trivial amount of time, then if you realize you also need Gift.3 and did it your way, then you'd need to re-read the entire dataset. Using my method, however, you just update the second sapply to include the change and rerun on the already-loaded data. (Most of the my rationale is based on untrusted data, portions that are typically unused, or other factors that may not be there for you.)
If you really wanted to reduce the code to a single line, something like:
totals <- sapply(allfiles, function(fn) {
x <- read.csv(fn)
sum(subset(x, select=Gift.1), subset(x, select=Gift.2))
})
allfiles <- list.files()
total <- numeric()
for (i in seq(allfiles)) {
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
}
total
if possible try and give the total a known length before hand ie total<-numeric(length(allfiles))

How to cycle through variables without a for-loop

After using R for the last little bit, I have distanced myself from using for loops for everything, but I still don't know how to cycle through names without using for loops. Whenever I am processing mulitple things, I will use for loops as a way to cover all my bases in one go. Here is a mock example of something I would do. Is there a simpler way to go about doing this?
names <- c("John_Doe","Jane_Doe")
employee <- vector(length = length(names))
for(i in 1:length(names)){
filename <- paste0(names[i],".csv")
employee[i] <- read.csv(filename,header = FALSE)
}
Not sure if it's simpler, but you could try this:
dfs <- lapply(seq_along(names), function(i) read.csv(paste0(names[i], ".csv"), header = FALSE))
Instead of looping it's applying the same function to your index or 1 through the length of your names vector

How to modify every data frame in a list in R

i really didn't found the solution even it's seems easy, any way
i have a list a data frames, and i have a very large code (it's not just an apply or else, a bunch of for loop and creating tables ...) that i want to apply to each dataframe, to each element of the list,
i thought of making a loop on this list, and browsing it data frame by data frame, but how can extract the current element to work on it ?
(my code is about 450 lines, i just wanna instead of replacing the data frame name with the next name, it just will be automatic)
dbR<-list()
for (i in datedeb:datefin)
{
sqlst<-paste("SELECT * FROM `cl4d6-2015/09/",sprintf("%02d",i),"`",sep="")
nomcl<-paste0("cl",sprintf("%02d",i),sep="")
dbR[[nomcl]]<-dbGetQuery(db,sqlst)
}
for (i in dbR)
{
#mycode
}
Please see the below example code.
dbR <- sapply(1:30, simplify = FALSE, USE.NAMES = TRUE, FUN = function(i) {
dt <- dbGetQuery(db, paste("SELECT * FROM `cl4d6-2015/09/",
sprintf("%02d", i), "`", sep = ""))
#mycode
return(dt)
})
The code above will operate across 1:30 similar to a for loop, except the output is automatically saved as individual list entries. We save the list of dataframes out to dbR. You can also add your code operating on the dataframes subsequent to the data read-in.
sapply() with USE.NAMES = TRUE and simplify = FALSE will operate like lapply, but retaining the list names/values.
The apply family of functions is not always intuitive, but they're powerful and fast! They're also easily translated into parallel operations. I recommend getting comfortable with them.

Using a function to create multple data frames from a 3D array

Thanks for reading. I am new to functions, and want to make my scripts much more readable as they will be shared around and wondered if you could help.
I used to have this peice of scrip to create multiple new dataframes from a 3D data set.
for (i in 1:ncol(x$data))
{nam <- paste("timepoint",deparse(i), sep=".")
assign(nam, 'as.data.frame(print(x$data[,i,])))}
Now I want this as a function, not sure on the best return format, but likely a list so that my data is contained within the list as listname$timepoint1
So far I have....
funcs <- list()
step2<-function(funcs, x){
for(i in ncol(x$data))
{
timepoint <- paste( "timepoint", i, sep = '' )
funcs[[timepoint]] = assign(timepoint[i],(print(x$data[,i,])))
}
return(funcs)
}
tester<-step2(funcs,x)
this works, except it only returns the final timepoint. giving at the end....
summary(tester)
Length Class Mode
timepoint79 25600 -none- numeric
Thanks alot

Looping a function over Lists in R

I have a list of samples, each of varying lengths. I need to compare sample means (using a Mann-Whitney-Wilcoxon test) for all samples in the list. Current code is as follows:
wilcox.v = list() ##This creates the list of samples
for (i in df){
treat = list(i$treatment)
wilcox.v = c(wilcox.v,treat)
}
###This *should* iterate over all items in the list
wilcox = sapply(wilcox.v, function(i){ wilcox.test(as.numeric(wilcox.v[i,]), as.numeric(wilcox.v[-i,]), exact = FALSE)$p.value
})
I'd like to have the function return a vector of p-values, so that the broader function can re-sample if necessary.
The problem seems to lie in the need to compare a sample mean to all other sample means in the list.
I'm sure there's an easy way to do this (and I think it has something to do with calling indicies correctly), but I'm not sure!
AS joran said, you wrote your apply function a little wonky. There are two ways you can fis this.
Modify it so i is in fact an index reference:
wilcox = sapply(1:length(wilcox.v)
,function(i){ wilcox.test(as.numeric(wilcox.v[[i]])
,as.numeric(wilcox.v[[-i]]), exact = FALSE)$p.value
})
modify your function so it appropriately treats i as a list element. I'll leave this as an exercise to you (primarily since I don't want to deal with the wilcox.v[-i,] term.
Thanks for your help! This is the solution I ended up using. It's hardly elegant but it gets the job done.
mannwhit = vector()
for (i in mannwhit.v){
for (j in mannwhit.v){
if (identical(i,j) == FALSE){
p.val = wilcox.test(i, j, paired=FALSE)$p.value
mannwhit = c(mannwhit, p.val)
}
}
}

Resources