R how to read a .csv file with different separators - r

ItemID,Sentiment,SentimentSource,SentimentText
1,0,Sentiment140, ok thats it you win.
2,0,Sentiment140, i think mi bf is cheating on me!!! T_T
3,0,Sentiment140," I'm completely useless rt now. Funny, all I can do is twitter. "
How would you read a csv file like this into R?

Read a csv with read.csv(). You can specify sep="" to be whatever you need it to be. But as noted below, , is the default value for the separator.
R: Data Input
For example, csv file with comma as separator to a dataframe, manually choosing the file:
df <- read.csv(file.choose())

Related

how to write a file so that it does not have commas and remains in same format in R

I am reading file in R:
data <- read.delim ("file.fas", header=TRUE, sep="\t" )
However, after I have done some manipulations to the data, the output format is not same. It now contains commas "," like this all over.
write.table(x= data, file = "file_1.fas")
How can I avoid this? Maybe I should use some different function to write a file?

How to convert .rds files to .csv in R without losing special characters?

Good day! I need some help for this. So I am trying to do some text mining where I am counting the word frequencies for every word in a text. I was able to do it fine in R with all the different characters, but the problem lies when I export it to .csv file. Basically, I am working on Hungarian text and when I try to save my data frame to .csv, three accented letters (ő, ű, ú) get converted to non-accented ones (o, u and u). It doesn't happen when the file is in .rds but I need to convert it to a .csv file so one of my consultants (zero knowledge of programming) can look at it in a normal Excel file. I tried some tricks e.g. making sure Notepad++ is in UTF-8 format, adding a line like this (fileEncoding = "UTF-8" or encoding ="UTF-8") when writing the .csv file using the write.csv
command, but it doesn't work.
Hope you can help me.
Thank you.
write.csv() works with the three characters you mentioned in the question.
Example
First create a data.frame containing special characters
library(tidyverse)
# Create an RDS file
test_df <- "ő, ű, ú" %>%
as.data.frame
test_df
# .
# 1 ő, ű, ú
Save it as an RDS
saveRDS(test_df, "test.RDS")
Now read in the RDS, save as csv, and read it back in:
# Read in the RDS
df_with_special_characters <- readRDS("test.RDS")
write.csv(df_with_special_characters, "first.csv", row.names=FALSE)
first <- read.csv("first.csv")
first
# .
# 1 ő, ű, ú
We can see above that the special characters are still there!
Extra note
If you have even rarer special characters, you could try setting the file encoding, like so:
write.csv(df_with_special_characters, "second.csv", fileEncoding = "UTF-8", row.names=FALSE)
second <- read.csv("second.csv")
# second
# # .
# # 1 ő, ű, ú
With the writexl package you could use write_xlsx(...) to write an xlsx file instead. It should handle unicode just fine.

Headers changing when reading data from csv or tsv in R

I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

export and import data when some string variable embedded new line

Suppose I have variable s with this code:
s <- "foo\nbar"
Then change it to data.frame
s2 <- data.frame(s)
Now s2 is a data.frame with one records, next I export to a csv file with:
write.csv(s2, file = "out.csv", row.names = F)
Then I open it with notepad, the "foo\nbar" was flown into two lines. With SAS import:
proc import datafile = "out.csv" out = out dbms = csv replace;
run;
I got two records, one is '"foo', the other is 'bar"', which is not expected.
After struggling for a while, I found if I export from R with foreign package like this:
write.dbf(s2, 'out.dbf')
Then import with SAS:
proc import datafile = "out.dbf" out = out dbms = dbf replace;
run;
Everything works nice and got one records in sas, the value seems to be 'foo bar'.
Does this mean csv is a bad choice when dealing with data, compared with dbf? Are there any other solutions or explanations to this?
A CSV file stands for comma-separated-version. This means that each line in the file should contain a list of values separated by a comma. SAS imported the file correctly based on the definition of the CSV file (ie. 2 lines = 2 rows).
The problem you are experiencing is due to the \n character(s) in your string. This sequence of characters happens to represent a newline character, and this is why the R write.csv() call is creating two lines instead of putting it all on one.
I'm not an expert in R so I can't tell you how to either modify the call to write.csv() or mask the \n value in the input string to prevent it from writing out the newline character.
The reason you don't have this problem with .dbf is probably because it doesn't care about commas or newlines to indicate when new variables or rows start, it must have it's own special sequence of bytes that indicate this.
DBF - is a database formats, which are always easier to work with because they have variable types/lengths embedded in their structure.
With a CSV or any other delimited file you have to have documentation included to know the file structure.
The benefit of CSV is smaller file sizes and compatibility across multiple OS and applications. For a while Excel (2007?) no longer supported DBF for example.
As Robert says you will need to mask the new line value. For example:
replace_linebreak <- function(x,...){
gsub('\n','|n',x)
}
s3 <- replace_linebreak(s2$s)
This replaces \n with |n, which would you would then need to replace when you import again. Obviously what you choose to mask it with will depend on your data.

Resources