Import data from excel but get warning messages - r

I import data from excel and I have multiple excel so I read at one time.
Here is my code:
library(readxl)
library(data.table)
file.list <- dir(path = "path/", pattern='\\.xlsx', full.names = T)
df.list <- lapply(file.list, read_excel)
data <- rbindlist(df.list)
However, I get this warning messages between df.list <- lapply(file.list, read_excel) and data <- rbindlist(df.list).
Warning messages:
1: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[3083, 9]: expecting date: got '2015/07/19'
2: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[3084, 9]: expecting date: got '2015/07/20'
What's going on? How can I check and correct?

According to my comment I submit this as an answer. Have you looked into your excel sheet at the respective lines? to me it seems that there is something going on there. maybe you have an empty cell before or after these lines, some space or anything like that... or the format of your date is different in these ones from what is in the other cells.

It is not an elegant solution but use the parameter guess_max = "number of lines in your data file"; this eliminates the warnings and the side effects.

Related

Error in type.convert when reading data from CSV

I am working on a basketball project. I am struggling to open my data on R :
https://www.basketball-reference.com/leagues/NBA_2019_totals.html
I have imported the data on excel and then saved it as CSV (for macintosh).
When I import the data on R I get an error message :
"Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<e7>lex<20>Abrines' "
The following seems to work. The readHTMLTable function does give warnings due to the presence of null characters in column Player.
library(XML)
uri <- "https://www.basketball-reference.com/leagues/NBA_2019_totals.html"
data <- readHTMLTable(readLines(uri), which = 1, header = TRUE)
i <- grep("Player", data$Player, ignore.case = TRUE)
data <- data[-i, ]
cols <- c(1, 4, 6:ncol(data))
data[cols] <- lapply(data[cols], function(x) as.numeric(as.character(x)))
Check if there are NA values. This is needed because the table in the link restarts the headers every now and then and character strings become mixed with numeric entries. The grep above is meant to detect such cases but maybe there are others.
sapply(data, function(x) sum(is.na(x)))
No, everything is alright. So write the data set as a CSV file.
write.csv(data, "nba.csv")
The file Encoding to Latin1 can help.
For example, to read a file in csv skipping second row:
Test=(read.csv("IMDB.csv",header=T,sep=",",fileEncoding="latin1")[-2,])

header=FALSE not working when importing multiple CSVs at once

I am trying to import multiple CSVs from a folder at once, but the CSVs do not have column names. The following code works, but the first row is converted into column names:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv)
When I try to use the code below:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv(header = FALSE))
I get the following error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : argument "file" is missing, with no default
Any idea how I can avoid this?
The issue comes from incorrect specifying of additional parameters to FUN.
? lapply:
lapply(X, FUN, ...)
... optional arguments to FUN.
You need to make a tiny change to your code to get it to work:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv, header=FALSE)
If you're in the tidyverse you might want
list.files(pattern=".csv") %>%
purrr::map(readr::read_csv, col_names=FALSE)
(watch out for differences in default behaviour between read.csv and readr::read_csv)

Reading a dat file in R

I am trying to read a dat file with ";" separated. I want to read a specific line that starts with certain characters like "B" and the other line are not the matter of interest. Can anyone guide me.
I have tried using the read_delim, read.table and read.csv2.But since some lines are not of equal length. So, I am getting errors.
file <- read.table(file = '~/file.DAT',header = FALSE, quote = "\"'",dec = ".",numerals = c("no.loss"),sep = ';',text)
I am expecting a r dataframe out of this file which I can write it to a csv file again.
You should be able to do that through readLines
allLines <- readLines(con = file('~/file.DAT'), 'r')
grepB <- function(x) grepl('^B',x)
BLines <- filter(grepB, allLines)
df <- as.data.frame(strsplit(BLines, ";"))
And if your file contains header, then you can specify
names(df) <- strsplit(allLines[1], ";")[[1]]

Import tabbed spreadsheet into list in R

My data exists as a tabbed spreadsheet, and I'm trying to write a script to import it.
library(readxl)
oput <- 0
tabnames <- excel_sheets("dataset.xlsx")
for(x in seq_along(tabnames)){
assign(tabnames[x], read_excel("dataset.xlsx", sheet = tabnames[x], col_names = TRUE)
}
This works, giving me multiple datasheets in the environment:
tab1
tab2
...
What I would like to do is have these outputs as items in a list:
>oput
$tab1
[1] data1
$tab2
[1] data2
...
But I can't get this working properly
assign(oput[[x]], read_excel("dataset.xlsx", sheet = tabnames[x], col_names = TRUE)
and
assign(oput$x, read_excel("dataset.xlsx", sheet = tabnames[x], col_names = TRUE)
both give:
Error in assign(oput[[x]], read_excel("dataset.xlsx", :
invalid first argument
It's obviously an error on my part in identifying the sheetname variable.
What's the correct way of doing this, please?
Found previously on SO with some slightly different search terms. Apologies for the duplicate post.
How to read all worksheets in an Excel Workbook into an R list with data.frame elements using XLConnect?

Merge multiple CSV files and remove duplicates in R

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.
The CSV files have the following format:
Image of CSV format:
I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.
Raw text file:
"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 #A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 #B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 #C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 #D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 #E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27"
Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).
Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.
http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!
Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):
filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This results in the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file '..': No such file or directory
Update
I have tried the following code:
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
But I run into the following errors:
After the 3rd line I get:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
After the 4th line I get:
Error: object 'my.df' not found
I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?
To solve the extra column problem I adjusted the code slightly:
#grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
I tried this code on all the files, although R clearly started processing, I eventually got the following errors:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv'
Error: object 'my.df' not found
What did I do wrong?
First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This should get you a data.frame with the contents of all the tweets
A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:
read.csv('fred.csv', header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4))
Nb. changed so all columns are character, and ';' separated
I'd parse out the time later if it was needed...
A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like
my.new.df <- my.df[!duplicated(my.df$tweet),]
For unique by author, I'd append the two fields - hard to know what works without the real data though!
my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
So bringing it all together and assuming a few things along the way...
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...
Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
cat('about to read from file:', fnam, '\n')
if (exists('tmp.df')) rm(tmp.df)
tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time','extra'),
colClasses=rep('character', 5))
if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
cat(' successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
# now lets append a column containing the originating file name
# so that debugging the file contents is easier
tmp.df$fnam <- fnam
# now lets rbind everything together
if (exists('my.df')) {
my.df <- rbind(my.df, tmp.df)
} else {
my.df <- tmp.df
}
} else {
cat(' read NO rows from ', fnam, '\n')
}
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

Resources