I would like to be able to scan a csv file row by row in R and exclude the rows that contain the word "target".
The problem is that the data comes from different places and the word "target" can come up in a number of different columns in the data frame.
So I need a line in a function that will look for this string, and if it is not present, then append that row to a new data frame (that I will then write out as a new csv).
Any and all help gratefully recieved.
Andrie's comment is probably the way most users would approach this, but if you want to do this at the reading in stage, you can try this:
Read in your csv using readLines and make any lines that have the text target blank:
temp = gsub(".*target.*", "", readLines("test.csv"))
Use read.table to convert temp to a data.frame. Since all lines that have the text target are now blank, the default blank.lines.skip=TRUE in read.table should correctly read in the rest of your data as a data.frame.
read.table(text=temp, sep=",", header=TRUE)
Use readLines:
lines <- readLines(file)
n.lines <- length(lines)
vec.1 <- rep(0, n.lines)
vec.2 <- rep(0, n.lines)
# more vectors as necessary
counter <- 0
for (i in 1:n.lines){
this.line <- strplit(lines[i], ",")
if ("target" %in% this.line) next
counter <- counter + 1
vec.1[counter] <- this.line[1]
vec.2[counter] <- this.line[2]
# etc.
}
df <- data.frame(vec.1[1:counter], vec.2[1:counter])
You may have to change n.lines slightly and change the indexing of the for loop if your file has headers; two lines would change as follows:
n.lines <- length(lines) - 1
and
for(i in 2:(n.lines+1)){
I would call from.readLines <- readLines(filename) and then just sub-select the rows that don't contain the target string: data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F).
The faster way to do it (if your file is huge) would be to grep -v 'target' original.csv > new.csv first on the command line and then run read.csv(new.csv, ...) in R.
But anyway,
> #Without header
> from.readLines <- c('afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F)
> print(data)
V1 V2 V3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg
>
> #With header
> from.readLines <- c('var1,var2,var3', 'afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-(grep('target', from.readLines[-1]) + 1)])
> print(data)
var1 var2 var3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg
Related
I have a very large (10 million row x 12 column) comma-delimited text file. The first column contains UNIX times (in seconds to 2 d.p.)
I would like to extract all rows corresponding to a particular date (e.g. 2014-06-26), and save the rows for each date in other smaller files.
In the below I scan through the file, reading in the first number in each row (the time), and spit out the row number whenever the date associated with the current row differs from the previous row:
## create fake data ; there are many duplicate times, rows are not always in order
con <- "BigFile.txt"; rile.remove(con)
Times <- seq ( 1581259391, 1581259391 + (7*24*3600), by=100)
write.table(data.frame(Time=Times, x=runif(n = length(Times))), file=con, sep=",", row.names=F, col.names=F, append=F)
## read in fake data line-by-line, note
con <- file( "BigFile.txt", open="r")
Row <- 0
Now <- 0
Last <- 0
while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 )
{Row <- Row+1
Now <- as.Date(as.POSIXct( as.numeric(myLine[1]), origin="1970-01-01", tz="GMT" ) , format="%Y-%m-%d")
if (Now!=Last) {print(data.frame(Row,Now))}
Last <- Now
}
The idea would then be to save these indices, and use them to cut up the file into smaller daily chunks... However, I am sure there must be much more efficient approaches (I have tried opening these files using the data.table package, but still run into memory issues).
Any pointers will be greatly appreciated.
library(sqldf)
# data
con <- "BigFile.txt"
Times <- seq ( 1581259391, 1581259391 + (7*24*3600), by=100)
write.table(data.frame(Time=Times, x=runif(n = length(Times))), file=con, sep=",", row.names=F, col.names=F, append=F)
# solution
df <- read.csv.sql("BigFile.txt", header = F,
sql = "select * from file where V1 = 1403740800", eol = "\n")
I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.
Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')
You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc
Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))
I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}
I’m looking to do the following in R.
I have 250+ csv files of chromatographic data structured similarly to the example below, but with 21 rows instead of three:
1 4.708252 BB 9.946890 7.830349 0.01982016 4.684836 4.742056
2 4.970352 BB 1.792341 1.497008 0.01896829 4.945352 5.005390
3 6.393414 BB 6.599891 5.309925 0.01950091 6.368413 6.428723
What I want to do is read a subset of the data in all 250 files into a single data frame, which is easy enough — but I also need to restructure it a fair bit.
Every row in the table above is a peak. I only want the data from the first and fourth columns (which are ‘peak number’ and ‘area under the peak’, respectively), and in the output I need to make each peak an individual column, rather than a row as above, with the peak number as the header. Finally, I want to create a new column where each row (that is, the data from each individual csv file) is given the same name as the csv file name.
So, imagine I have 3 files: ABC1.csv, ABC2.csv, and ABC3.csv. Each file looks like my example above. I want to automatically take all those files and merge them into a single data frame such as the one below.
ID 1 2 3
ABC1 9.94689 1.792341 6.599891
ABC2 9.76651 1.932332 6.600022
ABC3 8.99193 2.556471 6.718934
I hope I’ve made this clear enough. I’ve been able to manage most of the steps but haven’t been successful writing them into a single script. And I have no idea how, if there is any way, to make the file name into a variable.
Cheers
I am assuming the working directory is set to where the files are. Then you can get the list of files below.
filenames <- list.files()
Have a helper function to read a file and keep just columns 1 and 4.
readdata <- function(filename) {
df <- read.csv(filename)
vec <- df[, 4]
names(vec) <- df[, 1]
return(vec)
}
Loop over all of the files and rbind them
result <- do.call(rbind, lapply(filenames, readdata))
Name them as you like
row.names(result) <- filenames
this following code can probably be of some help, though the file name is still not working properly -
path <- "C:\\Users\\Vidyut\\"
filenames <- list.files(path = path,pattern = ".csv")
l <- data.frame(ID=character(),col1=numeric(),col2=numeric(),col3=numeric(),stringsAsFactors=FALSE)
for (i in filenames) {
#i = filenames[1]
full = paste(path,i,sep="")
m <- read.csv(full, header=F)
# extract the subset of rows required from each file
# m <- m[c(),]
n<- m[,c(1,4)]
y <- gsub('.csv','',i)
print("y=")
print(y)
d <- list(ID=as.character(y),col1=n[1,2],col2=n[2,2],col3=n[3,2])
print("d=")
print(d)
l <- rbind.data.frame(l,d)
print("l=")
print(l)
}
Mind you, this is not very pretty code - just something hacked together to get the job done (visible from the multiple print lines scattered across).
Here's a solution for you. This only works if we can assume that there are exactly 21 peaks in each file and they are in order 1:21. If that's not the case a few changes to the code should remedy this.
folder = "c:/temp/"
files <- dir(folder)
first_loop <- TRUE
for (file in files) {
# Read one file, only the first and fourth columns
temp <- read.csv(file=paste0(folder,file),
header = FALSE,
colClasses = c("integer", "NULL", "NULL", "numeric", "NULL", "NULL", "NULL", "NULL"))
# Transpose the data
temp <- data.frame(t(temp))
# Remove the peak number
temp <- temp[2,]
# Concatenate the dataframes together
temp$file <- file
if (first_loop) {
data <- temp
first_loop <- FALSE
} else {
data <- rbind(data, temp)
}
}
data
I am struggling to do something that I know should be simple.
I have a list of dataframes like so:
a <- rep(1, 10)
b <- rep(3.6, 10)
foo1 <- cbind(a, b)
d <- rep(2, 8)
b <- rep(4.9, 8)
foo2 <- cbind(d, b)
data <- list(foo1, foo2)
I want to extract the 2nd column from each dataframe, either by indexing or by column name, and save to a csv file using write.table and with the same name as the dataframe. I have tried a lot of things---for loops and lapply and sapply.
I get a variety of error messages, but mostly the following:
In if (file == "") file <- stdout() else if (is.character(file)) { :
the condition has length > 1 and only the first element will be used
which I can't resolve.
I know I'm not indexing properly. Help me please!
You can use a loop to iterate over the fields of data:
for (i in 1:length(data)) {
col <- data[[i]][,2]
fname <- paste("foo", i, ".csv", sep="")
write.table(col,fname)
}
The write.table command will likely need a bit of tweaking, until you get the data in the format you want.