Reading a txt file separated by "¬" in R [duplicate] - r

I'm trying to read a large file into R which is separated by the "not" sign (¬). What I normally do, is to change this symbol into semicolons using Text Edit, and save it as a csv file, but this file is too large, and my computer keeps crashing when I try to do so. I have tried the following options:
my_data <- read.delim("myfile.txt", header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
which results in a dataframe with a single row. This makes sense, I know, since my file is not separated by tabs, but by the not sign. However, when I try so change sep to ¬ or \¬, I get the following message:
Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, :
invalid 'sep' value: must be one byte
I have also tried with
my_data <- read.csv2(file.choose("myfile.txt"))
and
my_data <- read.table("myfile.txt", sep="\¬", quote="", comment.char="")
getting similar results. I have searched for options similar to mine, but his kind of separator is not commonly used.

You can try to read in a piped translation of it.
Setup:
writeLines("a¬b¬c\n1¬2¬3\n", "quux.csv")
The work:
read.csv(pipe("tr '¬' ',' < quux.csv"))
# a b c
# 1 1 2 3
If commas don't work for you, this works equally well with other replacement chars:
read.table(pipe("tr '¬' '\t' < quux.csv"), header = TRUE)
# a b c
# 1 1 2 3
The tr utility is available on all linuxes, it should be available on macos, and it is included in Rtools for windows (as well as git-bash, if you have that).
If there is an issue using pipe, you can always use the tr tool to create another file (replacing your text-editor step):
system2("tr", c("¬", ","), stdin="quux.csv", stdout="quux2.csv")
read.csv("quux2.csv")
# a b c
# 1 1 2 3

Related

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

R read.table skip not working. Why?

I have a file similar to
ColA ColB ColC
A 1 0.1
B 2 0.2
But with many more columns.
I want to read the table and set the correct type of data for each column.
I am doing the following:
data <- read.table("file.dat", header = FALSE, na.string = "",
dec = ".",skip = 1,
colClasses = c("character", "integer","numeric"))
But I get the following error:
Error in scan(...): scan() expected 'an integer', got 'ColB'
What am I doing wrong? Why is it trying to parse also the first line according to colClasses, despite skip=1?
Thanks for your help.
Some notes: This file has been generated in a Linux environment and is being worked on in a Windows environment. I am thinking of a problem with newline characters, but I have no idea what to do.
Also, if I read the table without colClasses the table is read correctly (skipping the first line) but all columns are factor type. I can probably change the class later, but still I would like to understand what is happening.
Instead of skipping first line, you can change header = TRUE and it should work fine.
data <- read.table("file.dat", header = TRUE, na.string = "",
dec = ".",colClasses = c("character", "integer","numeric"), sep = ",")

How can I read .dat file separating "::" in R

I have a text file with "::" separator.
When I read this file like below.
tmp <- fread("file.dat", sep="::")
tmp <- read.table("file.dat", sep="::")
There is a 'sep' must be 'auto' or a single character or invalid 'sep' value: must be one byte error message.
How can I read this file?
You could try
fread("cat file.dat | tr -s :", sep = ":")
fread() allows a system call in its first argument. This one uses tr -s, which is a "squeeze" command, replacing the repetitions of : with single occurrences of that character.
With this call, fread() may even recognize the sep argument automatically, eliminating the need to name it.
Using the same concept, another way you could go (with an example file "x.txt") is to do
writeLines("a::b::c", "x.txt")
read.table(text = system("cat x.txt | tr -s :", intern = TRUE), sep = ":")
# V1 V2 V3
# 1 a b c
I'm not sure how this translates to Windows-based systems.

read csv +unicode in R

I have the same problem as explain in here ,the only difference is that the CSV file contain non_english string and I couldn't find any solution for it :
when I read the csv file with out encoding it gives me no error but the data changed to :
network=read.csv("graph1.csv",header=TRUE)
اشپیل(60*4)
and if I run the read.csv with fileEncoding it gives me this error:
network=read.csv("graph1.csv",fileEncoding="UTF-8",header=TRUE)
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'graph1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'graph1.csv'
network[1]
[1] X.
<0 rows> (or 0-length row.names)
system info :
windows server 2008
R:R3.1.2
sample file :
node1,node2,weight
ورق800*750*6,ورق 1350*1230*6mm,0.600000024
ورق900*1200*6,ورق 1350*1230*6mm,0.600000024
ورق76*173,ورق 1350*1230*6mm,0.600000024
ورق76*345,ورق 1350*1230*6mm,0.600000024
ورق800*200*4,ورق 1350*1230*6mm,0.600000024
I tried with your input this:
> read.csv("graph1.csv", encoding="UTF-8")
X.U.FEFF.node1 node2 weight
1 <U+0648><U+0631><U+0642>800*750*6 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
2 <U+0648><U+0631><U+0642>900*1200*6 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
3 <U+0648><U+0631><U+0642>76*173 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
4 <U+0648><U+0631><U+0642>76*345 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
5 <U+0648><U+0631><U+0642>800*200*4 <U+0648><U+0631><U+0642> 1350*1230*6mm 0.6
The following should work – mind you, I can’t test it since I don’t have Windows (and Windows, Unicode and R simply do not mix):
x = read.csv('graph1.csv', fileEncoding = '', stringsAsFactors = TRUE)
At this point, x is gibberish, since it was read as-is, without parsing the byte data into an encoding. We should be able to verify this:
Encoding(x[1, 1])
# [1] "unknown"
Now we tell R to treat it as UTF-8:
x = as.data.frame(lapply(x, iconv, from = 'UTF-8', to = 'UTF-8),
stringsAsFactors = FALSE)
These two steps can be compressed into one by using encoding instead of fileEncoding as the argument to read.csv:
x = read.csv('graph1.csv', encoding = 'UTF-8', stringsAsFactors = TRUE)
In either case, roughly the same process takes place.
At this point, x still appears as gibberish, since your terminal on Windows presumably does not support a Unicode code page which R understands. In fact, when running the code with a non-UTF-8 code page on Mac, I get the following output now:
x[1, 1]
# [1] "<U+0648><U+0631><U+0642>800*750*6"
However, at least the encoding is now correctly set, and the bytes are parsed:
Encoding(x[1, 1])
# [1] "UTF-8"
And if you pass the data to a device or program which speaks UTF-8, it should appear correctly. For instance, using the data as labels in a plot command should work.
plot.new()
text(0.5, seq(0, 1, along.with = x[, 1]), x[, 1])

How can I change the encoding of a text file that is delimited by pipe and quotes so I can read it into R?

I want to read data from a text file into an R dataframe. The data is delimited by pipes | and also has quotes around the values. I've tried some combinations of read.table but it's importing everything into a single field as opposed to splitting it. The data looks like this:
"CompetitorDataID"|"CompetitorID"|"ItemID"|"UserID"|"CountryID"|"SegmentID"|"TaskID"|"Price"|"Comment"|"CreateDate"|"GeneralCustomer"|"TenderResult"
"29"|"5"|"187630"|"1375"|"5"|"398"|"4085"|"5.000000"|"test"|"2013-01-1002:58:23.230000000"|"False"|"1"
"30"|"5"|"1341"|"1294"|"5"|"398"|"4088"|"6.000000"|"test"|"2013-01-1003:15:26.687000000"|"False"|"1"
"31"|"5"|"1007"|"1375"|"5"|"398"|"4105"|"5.000000"|""|"2013-01-1005:50:51.150000000"|"False"|"1"
Although this code will import when pasted into R it won't work from the original text file. I get the following error message:
Warning messages:
1: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 1 appears to contain embedded nulls
2: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 2 appears to contain embedded nulls
3: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 3 appears to contain embedded nulls
4: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 4 appears to contain embedded nulls
5: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 5 appears to contain embedded nulls
6: In read.table("competitorDataCopy.txt", header = TRUE, sep = "|") :
line 1 appears to contain embedded nulls
7: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
You can easily import a pipe delimited .txt file this way:
file_in <- read.table("C:/example.txt", sep = "|")
That applies for any character separated text files, just change the sep to suit.
Setting sep="|" seems to work for me. The default parameter for read.table is quote="\"" so it will automatically strip the quotes from the beginning/ending of values.
read.table(text='"CompetitorDataID"|"CompetitorID"|"ItemID"|"UserID"|"CountryID‌​
"|"SegmentID"|"TaskID"|"Price"|"Comment"|"CreateDate"|"GeneralCustomer"|"TenderRe‌​sult"
"29"|"5"|"187630"|"1375"|"5"|"398"|"4085"|"5.000000"|"test"|"2013-01-10 02:58:23.230000000"|"False"|"1"
"30"|"5"|"1341"|"1294"|"5"|"398"|"4088"|"6.000000"|"test"|"2013-01-10 03:15:26.687000000"|"False"|"1"
"31"|"5"|"1007"|"1375"|"5"|"398"|"4105"|"5.000000"|""|"2013-01-10 05:50:51.150000000"|"False"|"1"'
, sep="|", header=T)
I have solved the issue by opening the file in notepad and changing the encoding from Unicode to ANSI. Not sure why this makes a difference but it imports cleanly now.

Resources