How to delete specific rows from multiple columns - r

I am importing some columns from multiple csv files from R. I want to delete all the data after row 1472.
temp = list.files(pattern="*.csv") #Importing csv files
Normalyears<-c(temp[1],temp[2],temp[3],temp[5],temp[6],temp[7],temp[9],temp[10],temp[11],temp[13],temp[14],temp[15],temp[17],temp[18],temp[19],temp[21],temp[22],temp[23])
leapyears<-c(temp[4],temp[8],temp[12],temp[16],temp[20]) #separating csv files with based on leap years and normal years.
Importing only the second column of each csv file.
myfiles_Normalyears = lapply(Normalyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
new.data.leapyears <- NULL
for(i in 1:length(myfiles_leapyears)) {
in.data <- read.table(if(is.null(myfiles_leapyears[i])),skip=c(1472:4399),sep=",")
new.data.leapyears <- rbind(new.data.leapyears, in.data)}
the loop is suppose to delete all the rows starting from 1472 to 4399.
Error: Error in read.table(myfiles_leapyears[i], skip = c(1472:4399), sep = ",") :
'file' must be a character string or connection

There is a nrows parameter to read.table, so why not try
read.table(myfiles_leapyears[i], nrows = 1471,sep=",")

Your myfiles_leapyears is a list. When subsetting a list, you need double brackets to access a single element, otherwise you just get a sublist of length 1.
So replace
myfiles_leapyears[i]
with
myfiles_leapyears[[i]]
that will at least take care of invalid subscript type 'list' errors. I'd second Josh W. that the nrows argument seems smarter than the skip argument.
Alternatively, if you define using sapply ("s" for simplify) instead of lapply ("l" for list), you'll probably be fine using [i]:
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")

It is fine. I just turned the data from a list into a dataframe.
df <- as.data.frame(myfiles_leapyears,byrow=T)
leap_df<-head(df,-2928)

Related

How to call a variable using a vector and for loop in R

I'm trying to be efficient and use a loop instead of copy-pasting code. I would also like to limit my hard-coding of ID numbers.
I want to loop through a vector containing ID numbers to open and name various data sets with lapply. Here is what I can do via copy-paste to get what I need:
id2174 <- lapply(trials_2174,
FUN = read.table,
header = FALSE,
sep=",")
id2181 <- lapply(trials_2181,
FUN = read.table,
header = FALSE,
sep=",")
id2182 <- lapply(trials_2182,
FUN = read.table,
header = FALSE,
sep=",")
I have a variable named id_nums containing the IDs: 2174, 2181, 2182, 2183, 2185, etc. How do I iterate through these IDs and apply them to the vector argument in lapply, while maintaining the necessary "trials_" part of the vector name? So that it reads like the copy-pasted formula above?
Below is what I've tried, with various changes (single bracket, +, "trials_", etc.) with no success:
EDIT: The trials_2174, trials_2181, etc. variables are file directories. I had used list.files to compile them. Ultimately I need 5 lists of data frames for each ID, not a single list of data frames.
for (i in id_nums) {
assign(paste0("id",i), lapply(trials_[[i]],
FUN = read.table,
header = FALSE,
sep = ","))
}
It seems like such a simple syntax thing, but I haven't been able to find anything that works. Let me know if I can somehow clarify my question! Thanks!
Maybe this serves your purpose:
idnum <- c(2174, 2181, 2182, 2183, 2185)
# To create "trials_2174", "trials_2181", etc:
trials <- paste0("trials_", idnum)
# Prepare an empty list to contain the result of read.table later.
# The number of elements of this list equals the length of idnum.
ids <- vector("list", length(idnum))
#Apply `read.table` to trials2174, trials 2181, etc, and
# store the results in the ids list.
ids <- lapply(trials, read.table, header = FALSE, sep=",")
names(ids) <-paste0("id", idnum)

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Errors in finding column mean of .csv file with NA cells in R

I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA

Creating a new file with both a subset of data and file names from a group of .csv files

My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.

Skip all leading empty lines in read.csv

I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.
How do I work out how many lines are empty, and dynamically skip them for each file?
As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:
,,,
w,x,y,z
a,b,5,c
a,b,5,c
a,b,5,c
a,b,4,c
a,b,4,c
a,b,4,c
which means there are rows of commas at the start.
read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv
After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:
library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)
The last line can be omitted if a data.table is ok as the returned value.
Depending on your file size, this may be not the best solution but will do the job.
Strategy here is, instead of reading file with delimiter, will read as lines,
and count the characters and store into temp.
Then, while loop will search for first non-zero character length in the list,
then will read the file, and store as data_filename.
flist = list.files()
for (onefile in flist) {
temp = nchar(readLines(onefile))
i = 1
while (temp[i] == 0) {
i = i + 1
}
temp = read.table(onefile, sep = ",", skip = (i-1))
assign(paste0(data, onefile), temp)
}
If file contains headers, you can start i from 2.
If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:
df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
It's not efficient if you have large files (since you have to import twice), but it works.
If you want to import a tab-delimited file with the same problem (variable blank lines) then use:
df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))

Resources