Read tab delimited text file in Spark R - r

I have a tab delimited file that is saved as a .txt with " " around the string variables. The file can be found here.
I am trying to read it into Spark-R (version 3.1.2), but cannot successfully bring it into the environment. I've tried variations of the read.df code, like this:
df <- read.df(path = "FILE.txt", header="True", inferSchema="True", delimiter = "\t", encoding="ISO-8859-15")
df <- read.df(path = "FILE.txt", source = "txt", header="True", inferSchema="True", delimiter = "\t", encoding="ISO-8859-15")
I have had success with bringing in CSVs with read.csv, but many of the files I have are over 10GB, and is not practical to convert them to CSV before bring them into Spark-R.
EDIT: When I run read.df I get a laundry list of errors, starting with this:
I am able to bring in csv files used in a previous project with both read.df and read.csv, so I don't think it's a java issue.

If you don't need to specifically use Spark R, then base R read.table should work just fine for the .txt you provided. Note that it is tab-delimited, and so this should be specified.
Something like this should work:
dat <- read.table("FILE.TXT",
sep="\t",
header=TRUE)

Related

How to create a dataframe from csv file with texts separated by pipe I? [duplicate]

I have just received a data file, whose extension is "*.psv". After doing a bit of research, I don't know how to open it R.
We could use read.table to read *.psv file.
read.table("myfile.psv", sep = "|", header = FALSE, stringsAsFactors = FALSE)
There might be many different representations of psv file, but when it comes to data mining, I think it might be more about "pipe separated" file. The data in the file is separated by "|"

Why R cannot read this table while excel can?

I am trying to read a specific file that I have copied from an SFTP location. The file is pipe delimited. I can read the file in Excel. But R read is as null values and column names are being duplicated. I don't understand if this is an encoding issue? I am trying to create a bash script to automate this process. Any help? Below is the link for the data.
Here's file!
I have tried changing the Encoding. But without knowing which encoding I am struggling. I have tried using read_delim, ead_table, read.table, read_csv and read.csv. But no help.
this is the code I have used to read the file.
read_delim("./Engagement_Level.txt", delim = "|")
I would like to read it as a data frame.
The issue is that the file encoding is UTF-16LE, which read_delim cannot read at present.
You could use the base read.delim and file() to specify the encoding:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|")
That will convert all the quoted numbers to numeric. If you'd rather they were type character, to deal with later:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|",
colClasses = "character")
I really recommend you to use Excel to build a CSV file using Data>Text in columns, this is not appropriate in this context but it's incredibly infallible and quickly.
Then use read.csv(file,sep=",").

I can't import all the rows of my csv file in R with read.csv in R

I have a very large dataset in a csv file. When opening it in a data visualization software (Spotfire) I can see that it has more than 7 millions rows.
However, I am working with RStudio, and when I try to open the file with read.csv2, being cautious of the quotes or other options that could affect my dataset, I end up with a 4 million file.
Here is my code when I import the file
my_data <- {
as.data.frame(read.csv2(file,
sep = ";",
header = TRUE,
na.strings=c(""," ","NA"),
quote = "",
check.names=FALSE,
stringsAsFactors=FALSE))
}
Moreover, when I take a look at the data in RStudio with View(my_data) I can see that my lines are perfectly correct
Is it related to a size limit of the files in RStudio or something like that ?

R- import CSV file, all data fall into one (the first) column

I'm new, and I have a problem:
I got a dataset (csv file) with the 15 columns and 33,000 rows.
When I view the data in Excel it looks good, but when I try to load the data
into R- studio I have a problem:
I used the code:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="")
View(x)
The result is that the columnnames are good, but the data (row 2 and further) are
all in my first column.
In the first column the data is separated with ; . But when i try the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";")
The next problem is: Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names = NULL)
And it looks liked it worked.... But now the data is in the wrong columns (for example, the "name" column contains now the "time" value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think that is not the best way.
Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.
It happens because of Exel's specifics. The easy solution is just to copy all your data Ctrl+C to Notepad and Save it again from Notepad as filename.csv (don't forget to remove .txt if necessary). It worked well for me. R opened this newly created csv file correctly, all data was separated at columns right.
Open your file in text edit and see if it really is separated with commas...
Sometimes .csv files are separated with tabs instead of commas or semicolon and when opening in excel it has no problem but in R you have to specify the separator like this:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="\t")
I once had the same problem, this was my solution. Hope it works for you.
This problem can arise due to regional settings on the excel application where the .csv file was created.
While in most places a "," separates the columns in a COMMA separated file (which makes sense), in other places it is a ";"
Depending on your regional settings, you can experiment with:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",") #used in North America
or,
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";") #used in some parts of Asia and Europe
You could use -
df <- read.csv("filename.csv", sep = ";", quote = "")
It solved one my problems similar to yours.
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names =
NULL) And it looks liked it worked.... But now the data is in the
wrong columns (for example, the "name" column contains now the "time"
value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think
that is not the best way.
I had the exact same issue. Did quite some research and found out, that the CSV was ill-formed.
In the header line of the CSV there were all the labels (separated by the separator) and then a line break.
Starting with line 2, there was an additional separator at the end of each line. So an example of such an ill-formed CSV file looks like this:
Field1;Field2 <-- see the *missing* semicolon at the end
12;23; <-- see the *trailing* semicolon in each of the data lines
34;67;
45;56;
Such ill-formatted files are even harder to spot for TAB-separated files.
Excel does not care about that, when importing CSV files.
But R does care.
When you use skip=1 you skip the header line that contains part of the mismatch. The data frame will be imported well, but there will be a column of "NA" at the end of each row. And obviously you will not have column names, as these were skipped.
Easiest solution: edit the CSV file and either add an additional separator at the end of the header line as well, or remove the trailing delimiters in the data lines. You can also use generic read and write functions in R for text files to automate that editing.
You can transform the data by arranging the data into many cells corresponding to columns.
1.Open your csv file
2.copy the content and paste it into txt file save and copy its content
3.open new excell file
4.in excell go to the section responsible for data . it is acually called "Data"
5.then on the left side go to external data query , in german "externe Daten abfragen"
6.go ahead step by step and seperate by commas
7. save your file as csv
I had the same problem and it was frustrating...
However, I found the ultimate solution
First take this (csv file) and then convert it online to Json file and download it ... then redo the whole thing backwards (re-convert Jason to csv) online... download the converted file... give it a name...
then put it on your Rstudio
file name <- read.csv(file='name your file.csv')
... took me 4 days to think out of the box... 🙂🙂🙂

Export data from R to Excel

I am writing code to export database from R into Excel, I have been trying others code including:
write.table(ALBERTA1, "D:/ALBERTA1.txt", sep="\t")
write.csv(ALBERTA1,":\ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
write.csv(df, file = "ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
write.csv(ALBERTA1, "ALBERTA1.csv")
write.table(ALBERTA1, 'clipboard', sep='\t')
write.table(ALBERTA1,"ALBERTA1.txt")
write.table(as.matrix(ALBERTA2),"ALBERTA2.txt")
write.table(as.matrix(vecm.pred$fcst$Alberta_Females[,1]), "vecm.pred$fcst$Alberta_Females[,1].txt")
write.table(as.matrix(foo),"foo.txt")
write.xlsx(ALBERTA2, "/ALBERTA2.xlsx")
write.table(ALBERTA1, "D:/ALBERTA1.txt", sep="\t").
Other users of this forum advised me this:
write.csv2(ALBERTA1, "ALBERTA1.csv")
write.table(kt, "D:/kt.txt", sep="\t", row.names=FALSE)
You can see on the pictures the outcome I have got from the code above. But this numbers can't be used to make any further operations such as addition with other matrices.
Has someone experienced this kind of problems?
Another option is the openxlsx-package. It doesn't depend on java and can read, edit and write Excel-files. From the description from the package:
openxlsx simplifies the the process of writing and styling Excel xlsx files from R and removes the dependency on Java
Example usage:
library(openxlsx)
# read data from an Excel file or Workbook object into a data.frame
df <- read.xlsx('name-of-your-excel-file.xlsx')
# for writing a data.frame or list of data.frames to an xlsx file
write.xlsx(df, 'name-of-your-excel-file.xlsx')
Besides these two basic functions, the openxlsx-package has a host of other functions for manipulating Excel-files.
For example, with the writeDataTable-function you can create formatted tables in an Excel-file.
Recently used xlsx package, works well.
library(xlsx)
write.xlsx(x, file, sheetName="Sheet1")
where x is a data.frame
writexl, without Java requirement:
# install.packages("writexl")
library(writexl)
tempfile <- write_xlsx(iris)
The WriteXLS function from the WriteXLS package can write data to Excel.
Alternatively, write.xlsx from the xlsx package will also work.
One could also use the readODS package. Granted it doesn't produce an .xlsx, but Excel can read Open Document Spreadsheet (ODS) / LibreOffice files too.
require(readODS)
tmp = file.path(tempdir(), 'iris.ods')
write_ods(iris, tmp)
If I might offer an alternative, you could also save your dataframe in a regular csv file, and then use the "get data" function within Excel to import the dataframe. This worked like a charm for me, and you need not bother with any excel packages in R.
Here is a way to write data from a dataframe into an excel file by different IDs and into different tabs (sheets) by another ID associated to the first level id. Imagine you have a dataframe that has email_address as one column for a number of different users, but each email has a number of 'sub-ids' that have all the data.
data <- tibble(id = c(1,2,3,4,5,6,7,8,9), email_address = c(rep('aaa#aaa.com',3), rep('bbb#bbb.com', 3), rep('ccc#ccc.com', 3)))
So ids 1,2,3 would be associated with aaa#aaa.com. The following code splits the data by email and then puts 1,2,3 into different tabs. The important thing is to set append = True when writing the .xlsx file.
temp_dir <- tempdir()
for(i in unique(data$email_address)){
data %>%
filter(email_address == i) %>%
arrange(id) -> subset_data
for(j in unique(subset_data$id)){
write.xlsx(subset_data %>% filter(id == j),
file = str_c(temp_dir,"/your_filename_", str_extract(i, pattern = "\\b[A-Za-z0-
9._%+-]+"),'_', Sys.Date(), '.xlsx'),
sheetName = as.character(j),
append = TRUE)}
}
The regex gets the name from the email address and puts it into the file-name.
Hope somebody finds this useful. I'm sure there's more elegant ways of doing this but it works.
Btw, here is a way to then send these individual files to the various email addresses in the data.frame. Code goes into second loop [j]
send.mail(from = "sender#sender.com",
to = i,
subject = paste("Your report for", str_extract(i, pattern = "\\b[A-Za-z0-9._%+-]+"), 'on', Sys.Date()),
body = "Your email body",
authenticate = TRUE,
smtp = list(host.name = "XXX", port = XXX,
user.name = Sys.getenv("XXX"), passwd = Sys.getenv("XXX")),
attach.files = str_c(temp_dir, "/your_filename_", str_extract(i, pattern = "\\b[A-Za-z0-9._%+-]+"),'_', Sys.Date(), '.xlsx'))
I have been trying out the different packages including the function:
install.packages ("prettyR")
library (prettyR)
delimit.table (Corrvar,"Name the csv.csv") ## Corrvar is a name of an object from an output I had on scaled variables to run a regression.
However I tried this same code for an output from another analysis (occupancy models model selection output) and it did not work. And after many attempts and exploration I:
copied the output from R (Ctrl+c)
in Excel sheet I pasted it (Ctrl+V)
Select the first column where the data is
In the "Data" vignette, click on "Text to column"
Select Delimited option, click next
Tick space box in "Separator", click next
Click Finalize (End)
Your output now should be in a form you can manipulate easy in excel. So perhaps not the fanciest option but it does the trick if you just want to explore your data in another way.
PS. If the labels in excel are not the exact one it is because Im translating the lables from my spanish excel.

Resources