I have a simple data set with 10 rows and two colums
What I want to do is sample 5 rows from my data 10 times without replacement and
then store this new subsample with 10 different names.
This is what I have:
for (i in 1:10){
print(Data[sample(nrow(Data), size = 5, replace= FALSE),] -> i)
write.table(i, file = "data_i", row.names = F, col.names = F)
}
But in the new table I only get one subsample and with " " in each value of my data set.
You reassign your looping variable i inside the loop. Plus, you got only one file name so that the file would be overwritten multiple times. Try this:
for (i in 1:10){
print(Data[sample(nrow(Data), size = 5, replace= FALSE),] -> x)
write.table(x, file = paste0("data_", i), row.names = F, col.names = F)
}
Related
I need to pull a random sample of 100,000 - 200,000 rows from a csv dataset of 2.8mil rows. How do I effectively do this so that the random sample can be cleaned and processed?
Under the DMwR2 library, I have used the sampleCSV function, but the output data messes up the 22 variables that I need to use.
library(caret)
library(DMwR2)
dataset source: https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv
pitchData <- sampleCSV(file.choose(), 200000, 2867154 , header = TRUE , mxPerc = 0.5)
summary(pitchData)
I expect the output of summary(pitchData) to have the same variable names as the csv file, but it renames them using random numbers, and some of the variables are lost.
Maybe the following function can do what the question asks for. Note that it uses a function from package R.utils.
The return values is a list with 2 members:
lines the line numbers read in;
data the data frame.
This can be changed to return just the dataframe.
sample_csv <- function(fname, n, sep = ",", header = TRUE, ...){
N <- R.utils::countLines(fname)
stopifnot(N >= n)
lns <- sample(N, n)
x <- sapply(lns, function(l){
scan(fname, what = character(), skip = l - 1, nlines = 1, quiet = TRUE)
})
list(lines = lns,
data = read.table(textConnection(x),
sep = sep, header = header, ...)
)
}
set.seed(1234)
res <- sample_csv(filename, 100, header = FALSE)
str(res$data)
I am trying to write a function (and I am new to R, most of my knowledeges of R were learned form this wedsite, thanks),
I want to apply my function to a list. The list contain some ".CSV" files.
All CSV files in my folder look like the picture below, same structure but with different column numbers.
I want to :
based on "Frame" column, delete all the row contain words "T",
then I got "110*n1" rows data.
delete all the column contain ""Flag" words, they are blank column.
delete the 1st column. then I have "2*n2" columns.
reshape the mulit-column to 2 column data, now I got "110*n3" rows data.
repeat "1,2,3,4,...,110" as seires numbers, n times(n=n3), rebind as a column.
form "1,2,3,...,n3", each repeat 110 times, make as a colum.
export the new table as txt files.
Here is what I've done so far:
T_function <- function(x) {
data.df <- read.csv(x, skip = 1,header=TRUE, na.strings=c("NA","NaN", " ","*"),
dec=".", strip.white=TRUE)
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
data.df[!grepl("T", data.df$Frame),]
data.df <- data.df [,-1]
data.df <- data.df [,colSums(is.na(data.df))<nrow(data.df)]
splitter <- function(indf, ncols) {
if (ncol(indf) %% ncols != 0) stop("Not the right number of columns to split")
inds <- split(sequence(ncol(indf)), c(0, sequence(ncol(indf)-1) %/% ncols))
temp <- unlist(lapply(inds, function(x) c(t(indf[x]))), use.names = FALSE)
as.data.frame(matrix(temp, ncol = ncols, byrow = TRUE))
}
out <- splitter(data.df, 2)
list <- 1:110
from <- which(out$V1 == 1)
to <- c((from-1)[-1], nrow(out))
end <- c(to/110)
list2 <- rep(list,length(to/110))
out$Number <- unlist(list2)
out$Number <- as.factor(out$Number)
list3 <- rep(1:end,each=110)
out$slice <- unlist(list3)
out$slice <- as.factor(out$slice)
write.table(x = data.df,
file = paste0(filename, "_analysis.txt"),
sep = ",",quote=F)
}
It seems the function can not add correct "out$Number" and "out$slice".
filenames <- list.files(path = "",pattern="csv",full.names = T)
sapply(filenames, FUN = T_function)
I am trying to apply my function to all files in list, while it seems beside the 1st files I can't get other files to work.
Could anybody help me find out and salve problems?
I am a new user to R and am trying to create multiple subsamples of a data frame. I have my data assigned to 4 stratum (STRATUM = 1, 2, 3, 4), and want to randomly keep only a specified number of rows in each stratum. To achieve this, I import my data, sort by the stratification value, then assign a random number to each row. I want to keep my original random number assignments since I need to use them again in future analyses, so I save a .csv with these values. Next, I subset the data by their stratum, and then specify the number of records that I want to retain in each stratum. Finally, I rejoin the data and save as a new .csv. The code works, however, I want to repeat this process 100 times. In each case I want to save the .csv with random numbers assigned, as well as the final .csv of randomly selected plots. I am unsure of how to get this block of code to repeat 100x, and also how to assign a unique file name for each iteration. Any help would be much appreciated.
DataFiles <- "//Documents/flownData_JR.csv"
PlotsFlown <- read.table (file = DataFiles, header = TRUE, sep = ",")
#Sort the data by the stratification
FlownStratSort <- PlotsFlown[order(PlotsFlown$STRATUM),]
#Create a new column with a random number (no duplicates)
FlownStratSort$RAND_NUM <- sample(137, size = nrow(FlownStratSort), replace = FALSE)
#Sort by the stratum, then random number
FLOWNRAND <- FlownStratSort[order(FlownStratSort$STRATUM,FlownStratSort$RAND_NUM),]
#Save a csv file with the random numbers
write.table(FLOWNRAND, file = "//Documents/RANDNUM1_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
#Subset the data by stratum
FLOWNRAND1 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='1'),]
FLOWNRAND2 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='2'),]
FLOWNRAND3 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='3'),]
FLOWNRAND4 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='4'),]
#Remove data from each stratum, specifying the number of records we want to retain
FLOWNRAND1 <- FLOWNRAND1[1:34, ]
FLOWNRAND2 <- FLOWNRAND2[1:21, ]
FLOWNRAND3 <- FLOWNRAND3[1:7, ]
FLOWNRAND4 <- FLOWNRAND4[1:7, ]
#Rejoin the data
FLOWNRAND_uneven <- rbind(FLOWNRAND1, FLOWNRAND2, FLOWNRAND3, FLOWNRAND4)
#Save the table with plots removed from each stratum flown in 2017
write.table(FLOWNRAND_uneven, file = "//Documents/Flown_RAND_uneven_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
Here's a data.table solution if you just need to know which rows are in each set.
library(data.table)
df <- data.table(dat = runif(100),
stratum = sample(1:4, 100, replace = T))
# Gets specified number randomly from each strata
get_strata <- function(df, n, i){
# Subset data frame to randomly chosen w/in strata
# replace stratum with var name
f <- df[df[, .I[sample(.N, n)], by = stratum]$V1]
# Save as CSV, replace path
write.csv(f, file = paste0("path/df_", i),
row.names = F, col.names = T)
}
for (i in 1:100){
# replace 10 with number needed
get_strata(df, 10, i)
}
I would like to read a large .csv into R. It'd handy to split it into various objects and treat them separately. I managed to do this with a while loop, assigning each tenth to an object:
# The dataset is larger, numbers are fictitious
n <- 0
while(n < 10000){
a <- paste('a_', n, sep = '')
assign(a, read.csv('df.csv',
header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)))
# There will be some additional processing here (omitted)
n <- n + 1000
}
Is there a more R-like way of doing this? I immediately thought of lapply. According to my understanding each object would be the element of a list that I would then have to unlist.
I gave a shot to the following but it didn't work and my list only has one element:
A <- lapply('df.csv', read.csv,
header = F, stringsAsFactors = F, nrows = 1000, skip = seq(0, 10000, 1000))
What am I missing? How do I proceed from here? How do I then unlist A and specify each element of the list as a separate data.frame?
If you apply lapply to a single element you'll have only one element as an output.
You probably want to do this:
a <- paste0('a_', 1:1000) # all your 'a's
A <- lapply(a,function(x){
read.csv('df.csv', header = F, stringsAsFactors = F, nrows = 1000, skip = 0 + n)
})
for each element of a, called x because it's the name I chose as my function parameter, I execute your command. A will be a list of the results.
Edit: As #Val mentions in comments, assign seems not needed here, so I removed it, you'll end up with a list of data.frames coming from your csvs if all works fine.
How can I append a column in a dataframe?
I'm iterating over my datamatrix and if some data agree with a threshold I've set, I want to store them in a 1-row dataframe so I can print at the end of the loop.
My code, looks like this:
for (i in 1:nrow(my.data.frame)) {
# Store gene name in a variable and use it as row name for the 1-row dataframe.
gene.symbol <- rownames(my.data.frame)[i]
# init the dataframe to output
gene.matrix.of.truth <- data.frame(matrix(ncol = 0, nrow = 0))
for (j in 1:ncol(my.data.frame)) {
if (my.data.frame[i,j] < threshold) {
str <- paste(colnames(my.data.frame)[j], ';', my.data.frame[i,j], sep='')
# And I want to append this str to the gene.matrix.of.truth
# I tried gene.matrix.of.truth <- cbind(gene.matrix.of.truth, str) But didn't get me anywhere.
}
}
# Ideally I want to print the dataframe here.
# but, no need to print if nothing met my requirements.
if (ncol(gene.matrix.of.truth) != 0) {
write.table(paste('out_',gene.symbol,sep=''), gene.matrix.of.truth, row.names = T, col.names = F, sep='|', quote = F)
}
}
I do this sort of thing all the time, but with rows instead of columns. Start with
gene.matrix.of.truth = data.frame(x = character(0))
instead of the gene.matrix.of.truth <- data.frame(matrix(ncol = 0, nrow = 0)) you have at initiation. Your append step inside the for j loop will be
gene.matrix.of.truth = rbind(gene.matrix.of.truth, data.frame(x = str))
(i.e. create a dataframe around str and append it to gene.matrix.of.truth).
Obviously, your final if statement will be if(nrow(...)) instead of if(ncol(...)), and if you want the final table as a big row you'll need t to transpose your dataframe at printing time.