I need to pull a random sample of 100,000 - 200,000 rows from a csv dataset of 2.8mil rows. How do I effectively do this so that the random sample can be cleaned and processed?
Under the DMwR2 library, I have used the sampleCSV function, but the output data messes up the 22 variables that I need to use.
library(caret)
library(DMwR2)
dataset source: https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv
pitchData <- sampleCSV(file.choose(), 200000, 2867154 , header = TRUE , mxPerc = 0.5)
summary(pitchData)
I expect the output of summary(pitchData) to have the same variable names as the csv file, but it renames them using random numbers, and some of the variables are lost.
Maybe the following function can do what the question asks for. Note that it uses a function from package R.utils.
The return values is a list with 2 members:
lines the line numbers read in;
data the data frame.
This can be changed to return just the dataframe.
sample_csv <- function(fname, n, sep = ",", header = TRUE, ...){
N <- R.utils::countLines(fname)
stopifnot(N >= n)
lns <- sample(N, n)
x <- sapply(lns, function(l){
scan(fname, what = character(), skip = l - 1, nlines = 1, quiet = TRUE)
})
list(lines = lns,
data = read.table(textConnection(x),
sep = sep, header = header, ...)
)
}
set.seed(1234)
res <- sample_csv(filename, 100, header = FALSE)
str(res$data)
Related
I have a folder of files (csv) that have filtered/gated data -- two columns (dihedral angle vs bend angle). It was filtered based upon an individualized min and max for each file.
I need to be able to get at least the mean, median, sd, skewness, and kurtosis for each column of each file and have that data presented as a table. (One line per file in the final product)
I am not familiar with what R packages that maybe suitable for this task, so I was trying to do something simple. I can get it to work for a single file, but I have over 200 files. They will likely be updating, so I'll have to run this multiple times.
module load ccs/container/R/4.1.0
R
library(moments)
files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
cat("filename","\t","dihedral mean","\t","bend mean","\t","dihedral median","\t","bend median","\t","dh sd","\t","bd sd","\t","dh skew","\t","bd skew","\t","dh kurt","\t","bd kurt","\n")
for (currentFile in files) {
df <- read.table(fileName[i], header=TRUE)
z1 <- mean(df$V1)
z2 <- median(df$V1)
z3 <- sd(df$V1)
z4 <- skewness(df$V1)
z5 <- kurtosis(df$V1)
z7 <- mean(df$V2)
z8 <- median(df$V2)
z9 <- sd(df$V2)
z10 <- skewness(df$V2)
z11 <- kurtosis(df$V2)
cat(filename,"\t",z1,"\t",z7,"\t",z2,"\t",z8,"\t",z3,"\t",z9,"\t",z4,"\t",z10,"\t",z5,"\t",z11,"\n")
write.table(newdata, file=statsFileName[i]))
}
The "first cat line" is the header and labels.
The "for cat line" likely goes "no where," but it is the format that I am trying to achieve.
The "write.table line" is something that I found, but I don't think it may be appropriate for this.
I truly appreciate any help on this. I am not that familiar with R and the examples that I have found do not appear to relate enough to what I trying to do for me to adapt them.
Edit: This is a plot from where the data is visualized. I’m looking for the medians (centers) of each major area of density. Trying to give some context.
Example of what the data looks like (head and tail) and some of the files in the folder
Added screenshot for Rui
Added screenshots for Rowan
The following computes all statistics the question asks for for each file and writes a table of results to a CSV file.
library(moments)
stats <- function(filename, na.rm = TRUE) {
tryCatch({
x <- read.csv(filename)
xbar <- colMeans(x, na.rm = na.rm)
med <- apply(x, 2, median, na.rm = na.rm)
S <- apply(x, 2, sd, na.rm = na.rm)
skwn <- skewness(x, na.rm = na.rm)
kurt <- kurtosis(x, na.rm = na.rm)
#
# return a data.frame, it will
# make the code simpler further on
out <- data.frame(
filename = filename,
dihedral.mean = xbar[1],
bend.mean = xbar[2],
dihedral.median = med[1],
bend.median = med[2],
dihedral.sd = S[1],
bend.sd = S[2],
dihedral.skewness = skwn[1],
bend.skewness = skwn[2],
dihedral.kurtosis = kurt[1],
bend.kurtosis = kurt[2]
)
row.names(out) <- NULL
out
},
error = function(e) e
)
}
statsFileName <- "statsfile.txt"
#files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
files <- list.files("~/Temp", "^t.*\\.csv$")
newdata <- lapply(files, stats)
ok <- !sapply(newdata, inherits, "error")
cat("files read:", sum(ok), "\n")
if(any(!ok)) {
cat("errors:", sum(!ok), "\n")
err_list <- list(
files = files[!ok],
error = conditionMessage(newdata[!ok])
)
}
newdata <- do.call(rbind, newdata[ok])
write.csv(newdata, file = statsFileName, row.names = FALSE)
This solution uses dplyr to summarise each file, combines the summaries into a single dataframe, then writes the results to a csv file.
library(moments)
library(dplyr)
csv_output_path <- "./results.csv"
data_dir <- "./data"
### Create dummy csv files for reproducibility ###
if(!dir.exists(data_dir)) dir.create(data_dir)
for(i in 1:200){
write.csv(data.frame(V1 = runif(100), V2 = runif(100)),
file = paste0(data_dir, "/file_", i, ".csv"),
row.names = FALSE)
}
### Summarise files ###
files <- list.files(data_dir, pattern = ".csv$", recursive = TRUE, full.names = TRUE)
all_results <- vector("list", length(files)) # results placeholder
# Loop that calculates summary statistics
for (i in 1:length(files)) {
currentFile <- files[i]
df <- tryCatch(read.csv(file = currentFile, header=TRUE),
error = function(e) NULL)
if(is.null(df))
next
result <- df %>% summarise_all(list(mean = mean, median = median,
sd = sd, skew = skewness, kur = kurtosis))%>%
mutate(file = currentFile) %>% # add filename to the result
select(file, everything()) # reorder
all_results[[i]] <- result
}
# Combine results into a single df
final_table <- bind_rows(all_results)
# write file
write.csv(final_table, csv_output_path, row.names = FALSE)
I am trying to write a function (and I am new to R, most of my knowledeges of R were learned form this wedsite, thanks),
I want to apply my function to a list. The list contain some ".CSV" files.
All CSV files in my folder look like the picture below, same structure but with different column numbers.
I want to :
based on "Frame" column, delete all the row contain words "T",
then I got "110*n1" rows data.
delete all the column contain ""Flag" words, they are blank column.
delete the 1st column. then I have "2*n2" columns.
reshape the mulit-column to 2 column data, now I got "110*n3" rows data.
repeat "1,2,3,4,...,110" as seires numbers, n times(n=n3), rebind as a column.
form "1,2,3,...,n3", each repeat 110 times, make as a colum.
export the new table as txt files.
Here is what I've done so far:
T_function <- function(x) {
data.df <- read.csv(x, skip = 1,header=TRUE, na.strings=c("NA","NaN", " ","*"),
dec=".", strip.white=TRUE)
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
data.df[!grepl("T", data.df$Frame),]
data.df <- data.df [,-1]
data.df <- data.df [,colSums(is.na(data.df))<nrow(data.df)]
splitter <- function(indf, ncols) {
if (ncol(indf) %% ncols != 0) stop("Not the right number of columns to split")
inds <- split(sequence(ncol(indf)), c(0, sequence(ncol(indf)-1) %/% ncols))
temp <- unlist(lapply(inds, function(x) c(t(indf[x]))), use.names = FALSE)
as.data.frame(matrix(temp, ncol = ncols, byrow = TRUE))
}
out <- splitter(data.df, 2)
list <- 1:110
from <- which(out$V1 == 1)
to <- c((from-1)[-1], nrow(out))
end <- c(to/110)
list2 <- rep(list,length(to/110))
out$Number <- unlist(list2)
out$Number <- as.factor(out$Number)
list3 <- rep(1:end,each=110)
out$slice <- unlist(list3)
out$slice <- as.factor(out$slice)
write.table(x = data.df,
file = paste0(filename, "_analysis.txt"),
sep = ",",quote=F)
}
It seems the function can not add correct "out$Number" and "out$slice".
filenames <- list.files(path = "",pattern="csv",full.names = T)
sapply(filenames, FUN = T_function)
I am trying to apply my function to all files in list, while it seems beside the 1st files I can't get other files to work.
Could anybody help me find out and salve problems?
I'm doing analysis on company networks in R and am trying to export my igraph results into a dataframe.
Here's a reproducible example:
library(igraph)
sample <- data.frame(ID = 1:8, org_ID = c(5,4,1,2,2,2,5,7), mon = c("199801", "199802","199802","199802","199904","199912","200001", "200012"))
create.graphs <- function(df){
g <- graph.data.frame(d = df, directed = TRUE)
g <- simplify(g, remove.multiple = FALSE, remove.loops = TRUE)
E(g)$weight <- count_multiple(g)
#calculate global values
g$centrality <- centralization.degree(g)
#calculate local values
g$indegree <- degree(g, mode = "in",
loops = FALSE, normalized = FALSE)
return(g)
}
df.list <- split(sample, sample$mon)
g <- lapply(df.list, create.graphs)
As you can see, I have graphs for multiple months. I want to export this to longitudinal data, where each row represents a month (per ID) and each column represents the corresponding network measures.
So far I've managed to create a data frame, but not how to run it through the list of graphs and put it into a fitting format. An additional problem could be that the graphs have different numbers of nodes (some have around 25, others more than 40), but that should theoretically just be recognised as missing by my regression model.
output <- data.frame(Centrality = g$`199801`$centrality,
Indegree = g$`199801`$indegree)
output
summary(output)
I tried writing a function similar to the one above for this, but unfortunately to no avail.
Thanks in advance for reading this, any help is greatly appreciated
I wanted to share how I solved it (thanks to Dave2e's suggestion).
Note that ci$monat defines my time periods in the original data, so one row for each point in time.
sumarTable <- data.frame(time = unique(ci$monat))
sumarTable$indegree <- lapply(g, function(x){x$indegree})
sumarTable$outdegree <- lapply(g, function(x){x$outdegree})
sumarTable$constraint <- lapply(g, function(x){x$constraint})
etc
edit:
in order to export these values, I had to "flatten" the lists:
sumarTable$indegree <- vapply(sumarTable$indegree, paste, collapse = ", ", character(1L))
sumarTable$outdegree <- vapply(sumarTable$outdegree, paste, collapse = ", ", character(1L))
sumarTable$constraint <- vapply(sumarTable$constraint, paste, collapse = ", ", character(1L))
I am a new user to R and am trying to create multiple subsamples of a data frame. I have my data assigned to 4 stratum (STRATUM = 1, 2, 3, 4), and want to randomly keep only a specified number of rows in each stratum. To achieve this, I import my data, sort by the stratification value, then assign a random number to each row. I want to keep my original random number assignments since I need to use them again in future analyses, so I save a .csv with these values. Next, I subset the data by their stratum, and then specify the number of records that I want to retain in each stratum. Finally, I rejoin the data and save as a new .csv. The code works, however, I want to repeat this process 100 times. In each case I want to save the .csv with random numbers assigned, as well as the final .csv of randomly selected plots. I am unsure of how to get this block of code to repeat 100x, and also how to assign a unique file name for each iteration. Any help would be much appreciated.
DataFiles <- "//Documents/flownData_JR.csv"
PlotsFlown <- read.table (file = DataFiles, header = TRUE, sep = ",")
#Sort the data by the stratification
FlownStratSort <- PlotsFlown[order(PlotsFlown$STRATUM),]
#Create a new column with a random number (no duplicates)
FlownStratSort$RAND_NUM <- sample(137, size = nrow(FlownStratSort), replace = FALSE)
#Sort by the stratum, then random number
FLOWNRAND <- FlownStratSort[order(FlownStratSort$STRATUM,FlownStratSort$RAND_NUM),]
#Save a csv file with the random numbers
write.table(FLOWNRAND, file = "//Documents/RANDNUM1_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
#Subset the data by stratum
FLOWNRAND1 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='1'),]
FLOWNRAND2 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='2'),]
FLOWNRAND3 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='3'),]
FLOWNRAND4 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='4'),]
#Remove data from each stratum, specifying the number of records we want to retain
FLOWNRAND1 <- FLOWNRAND1[1:34, ]
FLOWNRAND2 <- FLOWNRAND2[1:21, ]
FLOWNRAND3 <- FLOWNRAND3[1:7, ]
FLOWNRAND4 <- FLOWNRAND4[1:7, ]
#Rejoin the data
FLOWNRAND_uneven <- rbind(FLOWNRAND1, FLOWNRAND2, FLOWNRAND3, FLOWNRAND4)
#Save the table with plots removed from each stratum flown in 2017
write.table(FLOWNRAND_uneven, file = "//Documents/Flown_RAND_uneven_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
Here's a data.table solution if you just need to know which rows are in each set.
library(data.table)
df <- data.table(dat = runif(100),
stratum = sample(1:4, 100, replace = T))
# Gets specified number randomly from each strata
get_strata <- function(df, n, i){
# Subset data frame to randomly chosen w/in strata
# replace stratum with var name
f <- df[df[, .I[sample(.N, n)], by = stratum]$V1]
# Save as CSV, replace path
write.csv(f, file = paste0("path/df_", i),
row.names = F, col.names = T)
}
for (i in 1:100){
# replace 10 with number needed
get_strata(df, 10, i)
}
I have a simple data set with 10 rows and two colums
What I want to do is sample 5 rows from my data 10 times without replacement and
then store this new subsample with 10 different names.
This is what I have:
for (i in 1:10){
print(Data[sample(nrow(Data), size = 5, replace= FALSE),] -> i)
write.table(i, file = "data_i", row.names = F, col.names = F)
}
But in the new table I only get one subsample and with " " in each value of my data set.
You reassign your looping variable i inside the loop. Plus, you got only one file name so that the file would be overwritten multiple times. Try this:
for (i in 1:10){
print(Data[sample(nrow(Data), size = 5, replace= FALSE),] -> x)
write.table(x, file = paste0("data_", i), row.names = F, col.names = F)
}