I have some code that creates a dataframe with 2 coulmns I want to write data from a forloop to this dataframe ...how do I do that?
df<-data.frame(id = numeric(), nobs = numeric())
setwd(directory)
files <-list.files(directory)
files <-files[id]
for (i in files) {
#print(i)
file <- read.csv(i)
x <- nrow(file)
num = as.numeric(gsub(".csv","",i))
y <- sprintf("%i %i", num, x)
#print(y)
df <- rbind(df,num,x)
}
To add rows in a data.frame using a loop you can modify your code using the following one:
df<-data.frame(id = numeric(), nobs = numeric())
for (i in 1:1000) {
df[i,] <- c(runif(1),runif(1))
}
However, if you know the number of rows needed then preallocating memory is strongly recommended:
files <- 1:1000
df<-data.frame(id = numeric(length(files)), nobs = numeric(length(files)))
for (i in 1:length(files)) {
df[i,] <- c(runif(1),runif(1))
}
Related
I am trying to create a CSV file that is a list of all unique values in my dataset. My data is from a folder that contains 200+ CSV files all with 9 columns and a varying number of rows. Some files have no duplicates but many have duplicate values. I have found a code that lists how many rows in each file but I am wondering what I could add to it so it removes the duplicate values and only counts the unique values in the final output CSV. I would like the final CSV file to list the row count each of the 200+ files in one sheet.
The code I found is below
library(tidyverse)
csv.file <- list.files("TestA") # Directory with your .csv files
data.frame.output <- data.frame(number_of_cols = NA,
number_of_rows = NA,
name_of_csv = NA) #The df to be written
MyF <- function(x){
csv.read.file <- data.table::fread(
paste("TestA", x, sep = "/")
)
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
data.frame.output <<- add_row(data.frame.output,
number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = str_remove_all(x,".csv")) %>%
filter(!is.na(name_of_csv))
}
map(csv.file, MyF)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I appreciate any guidance as I am a total R/coding beginner.
The following function accepts a vector of file names, reads them one by one, removes duplicated rows and outputs a data.frame with numbers of columns and rows and CSV filename.
There is no need to previously create a results data.frame data.frame.output.
MyF <- function(x, path = "TestA"){
f <- function(x, path) {
# commented out to test the function
# uncomment these 3 lines and comment out the next one
#csv.read.file <- data.table::fread(
# file.path(path, x)
#)
csv.read.file <- data.table::fread(x)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(x)) NA_character_ else basename(x)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
#
data.frame(number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = name_of_csv) |>
dplyr::filter(!is.na(name_of_csv))
}
#
y <- purrr::map(x, f, path = path)
data.table::rbindlist(y)
}
data.frame.output <- MyF(csv.file)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I find this for loop version better. Though for loops are not considered very idiomatic in R, there is nothing wrong with them. Like the function above, it avoids assignment in the parent environment with the operator <<- and the code is simpler. The results data.frame data.frame.output is created beforehand with the number of rows equal to the length of the input filenames vector and assignment is done by replacing the NA values by each CSV files' values.
MyF <- function(x, path = "TestA"){
data.frame.output <- data.frame(number_of_cols = rep(NA, length(x)),
number_of_rows = rep(NA, length(x)),
name_of_csv = rep(NA, length(x)))
for(i in seq_along(x)) {
# commented out to test the function
# uncomment this line and comment out the next one
#fl_name <- file.path(path, x[i])
fl_name <- x[i]
#
csv.read.file <- data.table::fread(fl_name)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
data.frame.output$number_of_cols[i] <- ncol(csv.read.file)
data.frame.output$number_of_rows[i] <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(fl_name)) NA_character_ else basename(fl_name)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
data.frame.output$name_of_csv[i] <- name_of_csv
}
#
data.frame.output |> dplyr::filter(!is.na(name_of_csv))
}
MyF(csv.file)
I have a table with samples of data named Sample_1, Sample_2, etc. I take user input as a string for which samples are wanted (Sample_1,Sample_3,Sample_5). Then after parsing the string, I have a for-loop which I pass each sample name to and the program filters the original dataset for the name and creates a DF with calculations. I then append the DF to a list after each iteration of the loop and at the end, I rbind the list for a complete DF.
sampleloop <- function(samplenames) {
data <- unlist(strsplit(samplenames, ","))
temp = list()
for(inc in 1:length(data)) {
df <- CT[CT[["Sample_Name"]] == data[inc],]
........
tempdf = goitemp
temp[inc] <- tempdf
}
newdf <- do.call(rbind.data.frame, temp)
}
The inner function on its own produces the correct wanted output. However, with the loop the function produces the following wrong DF if the input is "Sample_3,Sample_9":
I'm wondering if it has something to do with the rbind?
The issue seems to be using [ instead of [[ to access and assign to the list element`
sampleloop <- function(samplenames) {
data <- unlist(strsplit(samplenames, ","))
temp <- vector('list', length(data))
for(inc in seq_along(data)) {
df <- CT[CT[["Sample_Name"]] == data[inc],]
........
tempdf <- goitemp
temp[[inc]] <- tempdf
}
newdf <- do.call(rbind.data.frame, temp)
return(newdf)
}
The difference can be noted with the reproducible example below
lst1 <- vector('list', 5)
lst2 <- vector('list', 5)
for(i in 1:5) {
lst1[i] <- data.frame(col1 = 1:5, col2 = 6:10)
lst2[[i]] <- data.frame(col1 = 1:5, col2 = 6:10)
}
I want to do an operation if each data frame of a list. I want to perform the Kolmogorov–Smirnov (KS) test for one column in each data frame. I am using the code below but it is not working:
PDF_mean <- matrix(nrow = length(siteNumber), ncol = 4)
PDF_mean <- data.frame(PDF_mean)
names(PDF_mean) <- c("station","normal","gamma","gev")
listDF <- mget(ls(pattern="DSF_moments_"))
length(listDF)
i <- 1
for (i in length(listDF)) {
PDF_mean$station[i] <- siteNumber[i]
PDF_mean$normal[i] <- ks.test(list[i]$mean,"pnorm")$p.value
PDF_mean$gev[i] <- ks.test(list[i]$mean,"pgev")$p.value
PDF_mean$gamma[i] <- ks.test(list[i]$mean,"gamma")$p.value
}
Any help?
It is not length(listDF) instead, it would be seq_along(listDF) or 1:length(listDF) (however, it is more appropriate with seq_along) because length is a single value and it is not doing any loop
for(i in seq_along(listDF)) {
PDF_mean$station[i] <- listDF[[i]]$siteNumber
PDF_mean$normal[i] <- ks.test(listDF[[i]]$mean,"pnorm")$p.value
PDF_mean$gev[i] <- ks.test(listDF[[i]]$mean,"pgev")$p.value
PDF_mean$gamma[i] <- ks.test(listDF[[i]]$mean,"gamma")$p.value
}
What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)
I have a loop for a data frame construction, and I would like to write all small pieces at each iteration in a csv file. Something like a rbind() but in a file...
I have seen sink() like that
exemple :
sink("resultat/output.txt")
df.total<-NULL
for(i in 1:length(list2.1)){
if(i%%100==0){print(i)}
tps<-as.data.frame(list2.1[i])
tps<-cbind(tps,as.data.frame(list2.2[i]))
colnames(tps)<-c("slope","echange")
tps$Run<-rep(data$run_nb[i],length(tps$slope))
tps$coefSlope<-rep(data$coef_Slope_i[i],length(tps$slope))
tps$coefDist<-rep(data$coef_dist_i[i],length(tps$slope))
sink(tps)
df.total<-rbind(df.total,tps)
}
sink()
write.csv(df.total,"resultat/df_total.csv")
but I don't think it work for my case ...
any suggestions
You can use sink (which redirects R output to a connection) along with print
df <- data.frame(a = 0:1, b = 1:2)
sink("output.txt")
for(i in 1:10) {
print(df[i+2, ] <- c(sum(df$a), tail(df$b, 1) + 1))
## Or, to save the whole data.frame each time: print(df)
}
sink()
Another option is to use cat
df <- data.frame(a = 0:1, b = 1:2)
for(i in 1:10) {
cat(df[i+2, ] <- c(sum(df$a), tail(df$b, 1) + 1), "\n", file="output.txt" append=TRUE)
}
You can also use write.table to save the whole data.frame
df <- data.frame(a = 0:1, b = 1:2)
for(i in 1:10) {
df[i+2, ] <- c(sum(df$a), tail(df$b, 1) + 1)
write.table(df, file="output.txt", append=TRUE)
}
If you set append = FALSE only the last iteration will be saved.