Related
I have a .csv file with 175 rows and 6 columns. I want to append a 176th row. My code is as follows:
x <- data.frame('1', 'ab', 'username', '<some.sentence>', '2017-05-04T00:51:35Z', '24')
write.table(x, file = "Tweets.csv", append = T)
What I expect to see is:
Instead, my result is:
How should I change my code?
write.table(x, file = "Tweets.csv", sep = ",", append = TRUE, quote = FALSE,
col.names = FALSE, row.names = FALSE)
I have 400+ quite large csv files (~ million rows) having all a similar structure :
A long header of which I only need the 2nd and 3rd rows
A first time serie (always preceded by 'Target1')
A second time serie (always preceded by 'Target2')
Here is an example of data :
#multiple rows of header#
Target 1
Timestamp,X,Y,Z
1553972886851,0.017578,-0.003052,-0.971375
1553972886851,0.017883,-0.003662,-0.980408
1553972886851,0.016418,-0.003174,-0.977295
1553972886999,0.017151,-0.002808,-0.978088
1553972886999,0.016785,-0.003113,-0.977051
1553972886999,0.017883,-0.002197,-0.975830
1553972887096,0.017517,-0.003113,-0.976624
1553972887096,0.017883,-0.003113,-0.977966
1553972887096,0.017883,-0.002869,-0.978210
1553972887243,0.017151,-0.003113,-0.976135
1553972887243,0.018250,-0.003235,-0.975647
1553972887243,0.017273,-0.002991,-0.976257
1553972887340,0.018372,-0.003235,-0.977722
1553972887340,0.017761,-0.003235,-0.978027
Target 2
Timestamp,X,Y,Z
1553972886753,-0.411585,0.072409,-0.849848
1553972886753,-0.339177,-0.053354,-0.556402
1553972886753,-0.411585,-0.262957,-0.483994
1553972886855,-0.506860,-0.057165,-0.472561
1553972886855,-0.499238,-0.007622,-0.529726
1553972886855,-0.472561,-0.041921,-0.560213
1553972887002,-0.510671,-0.083841,-0.480183
1553972887002,-0.525915,-0.057165,-0.480183
1553972887002,-0.544969,-0.038110,-0.522104
1553972887098,-0.510671,-0.030488,-0.510671
1553972887098,-0.529726,-0.026677,-0.525915
1553972887098,-0.510671,-0.068598,-0.518293
I need to split each csv files in those 3 parts and name them accordingly.
I managed to do step 1) and step 3) but struggle for step 2).
Here is what I did for step 3) :
fileNames <- basename(list.files(path = ".", all.files = FALSE, full.names = FALSE, recursive = TRUE, ignore.case = FALSE, include.dirs = FALSE))
extension <- "txt"
fileNumbers <- seq(fileNames)
for (fileNumber in fileNumbers) {
newFileName <- paste("Target2-",
sub(paste("\\.", extension, sep = ""), "", fileNames[fileNumber]),
".", extension, sep = "")
# read old data:
Lines <- readLines(fileNames[fileNumber])
ix <- which(Lines == "Target2")
sample <- read.csv(fileNames[fileNumber],
header = TRUE,
sep = ",", skip= ix)
# write old data to new files:
write.table(sample,
newFileName,
append = FALSE,
quote = FALSE,
sep = ",",
row.names = FALSE,
col.names = TRUE)
}
I'm quite sure this is not the most straightforward approach and I can't get the data comprised between Target 1 and Target 2 using this approach. Also, this is super slow and I was wondering it there could be a more memory efficient approach ?
foo = function(filename) {
cat("\nprocessing", filename, "...")
x = readLines(con = filename)
idx = grepl("^Target", x)
x = split(x[!idx], cumsum(idx)[!idx])[-1]
invisible(lapply(seq_along(x), function(i) {
write.table(
x = x[[i]],
file = sub("\\.csv", paste0("_", i, ".csv"), filename),
append =FALSE,
row.names = FALSE,
quote = FALSE,
col.names = FALSE)
}))
}
files = list.files(path = "path/to/files", pattern = ".+\\.csv$")
lapply(files, foo)
I am trying to read text files and create a data frame (called dataset) of some specific columns (about 12) (located at certain lengths) as below:
x <- fread("file1.txt",colClasses = "character", sep = "\n", header = FALSE, verbose = FALSE,strip.white = FALSE)
y <- fread("file2.txt",colClasses = "character", sep = "\n", header = FALSE, verbose = FALSE,strip.white = FALSE)
# combine them
x = rbind(x,y)
# We basically read the whole file as a string and then read substrings
# corresponding to each variable start and finish lengths.
Var1= sapply(as.list(x$V1), stri_sub, from = 80, to = 82)
Var1= as.data.frame(Var1)
Var2= sapply(as.list(x$V1), stri_sub, 83, 89)
Var2= as.data.frame(Var2)
dataset = cbind(Var1,Var2)
It takes around 1 minute to run the two text file have 200K and 300K rows respectively. They have 1800 characters per line. Is there a faster way to run this? I will be reading about 200 such files.
I think you can simplify your code in the following manner
x <- Reduce(rbind, lapply(1:2, function(k) fread(paste0("file",k,".txt"),
colClasses = "character",
sep = "\n",
header = FALSE,
verbose = FALSE,
strip.white = FALSE)))
dataset <- data.frame(Var1= substr(x$V1, 80, 82), Var2 = substr(x$V1,83,89))
where the second line may save more time when you use substr over the whole column.
So, i have this input csv of the form,
id,No.,V,S,D
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662
I need to read the No. Column as it is(i.e., as a character). I know i can use something like this for that:
data <- read.csv("input.csv", colClasses = c("MSISDN" = "character"))
I have a code that i'm using to read the csv file in chunks:
chunk_size <- 2
con <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,colClasses = c("MSISDN" = "character"),quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
repeat {
data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
names(data_frame)<-c(header)
print(header)
print(data_frame)
if(nrow(data_frame) < chunk_size) {
break
}
}
}
close(con)
But, here what the issue i'm facing is that, the first chunk will only read the No. Column as a character, the rest of the chunks will not.
How can i resolve this?
PS: the original input file has about 150+ columns and about 20 Million rows.
You can read the data as string with readLines and split it:
fileName <- "input.csv"
df <- do.call(rbind.data.frame, strsplit(readLines(fileName), ",")[-1]) # skipping headlines
colnames(df) <- c("id","No.","V","S","D") #adding headlines
or the direct approach with read.csv:
fileName <- "input.csv"
col <- c("integer","character","integer","integer","integer")
df <- read.csv(file = fileName,
sep = ",",
colClasses=col,
header = TRUE,
stringsAsFactors = FALSE)
You need to give the column type colClasses in the read.csv() inside the repeat procedure.
You no longer have the header so you need to define an unnamed vector to specify the colClasses.
Let's say the size of colClasses is 150.
myColClasses=rep("numeric",150)
myColClasses[2] <- "character"
repeat {
data_frame <- read.csv(con,nrows = chunk_size, colClasses=myColClasses, header = FALSE, quote="")
...
I'm attempting to import and export, in pieces, a single 10GB CSV file with roughly 10 million observations. I want about 10 manageable RData files in the end (data_1.RData, data_2.Rdata, etc.), but I'm having trouble making the skip and nrows dynamic. My nrows will never change as I need almost 1 million per dataset, but I'm thinking I'll need some equation for skip= so that every loop it increases to catch the next 1 million rows. Also, having header=T might mess up anything over ii=1since only the first row will include variable names. The following is the bulk of the code I'm working with:
for (ii in 1:10){
data <- read.csv("myfolder/file.csv",
row.names=NULL, header=T, sep=",", stringsAsFactors=F,
skip=0, nrows=1000000)
outName <- paste("data",ii,sep="_")
save(data,file=file.path(outPath,paste(outName,".RData",sep="")))
}
(Untested but...) You can try something like this:
nrows <- 1000000
ind <- c(0, seq(from = nrows, length.out = 10, by = nrows) + 1)
header <- names(read.csv("myfolder/file.csv", header = TRUE, nrows = 1))
for (i in seq_along(ind)) {
data <- read.csv("myfolder/file.csv",
row.names = NULL, header = FALSE,
sep = ",", stringsAsFactors = FALSE,
skip = ind[i], nrows = 1000000)
names(data) <- header
outName <- paste("data", ii, sep = "_")
save(data, file = file.path(outPath, paste(outName, ".RData", sep = "")))
}