Opening csv file correctly - r

I am trying to use this dataset: wine_quality_dataset
I am running the following function:
data2 <- read.table("C:/Users/Magda/Downloads/winewhite.csv")
And here is what I got:
head(data2)
V1
1 fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality
2 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
3 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
4 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
5 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
6 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
What command should I use to read csv file correctly?

Try
readr::read_csv("C:/Users/Magda/Downloads/winewhite.csv")
readr is part of tidyverse a collection of libraries that help you tidying up data.
If you are using European format CSV with a semicolon ; separator, use
readr::read_csv2("C:/Users/Magda/Downloads/winewhite.csv")

Related

Trying to remove "ZCTA" from rows

I am trying to extract only the zip code values from my imported ACS data file, however, the rows all include "ZCTA" before the 5 digit zip code. Is there a way to remove that so just the 5 digit zip code remains?
Example:
I tried using strtrim on the data but I can't figure out how to target the last 5 digits. I image there is a function or loop that could also do this since the dataset is so large.
To remove "ZCTA5":
gsub("ZCTA5", "", df$zip) # df - your data.frame name
or
library(stringr)
str_replace(df$zip,"ZCTA5","")
To extract ZIP CODE:
str_sub(df$zip,-5,-1)
Here is a few others for fun:
#option 1
stringr::str_extract(df$zip, "(?<=\\s)\\d+$")
#option 2
gsub("^.*\\s(\\d+)$", "\\1", df$zip)

Readxl and openxlsx add extra characters to numbers from an excel file

I have some numbers in an excel file that I want to read into R as characters. When I import the file either using readxl or openxlsx, the imported data have two extra characters, which are not in the excel file. The excel sheet looks like this:
The example file is here
I have tried changing the format within the Excel file but this messes up the numbers. My current work-around is to concatenate the number with ' in a separate column in excel and then read that column into R. This works for some reason.
library(readxl)
boo <- read_excel("./boo.xlsx",
col_types = c("text"))
boo
Reading the excel file gives the following (note the last two characters in the Example numbers column. The concatNum column shows the concatenated version.
# A tibble: 6 x 2
`Example numbers` concatNum
<chr> <chr>
1 985.12002779568002 '985.12002779568
2 985.12002826159505 '985.120028261595
3 985.12002780627301 '985.120027806273
4 985.12002780627301 '985.120027806273
5 985.12002780724401 '985.120027807244
6 985.12002780291402 '985.120027802914
Any reasons why this would be happening? Does anyone have a better way of fixing it than my current work-around?

Read csv but skip escaped commas in strings

I have a csv file like this:
id,name,value
1,peter,5
2,peter\,paul,3
How can I read this file and tell R that "\," does not indicate a new column, only ",".
I have to add that file has 400mb.
Thanks
You can use readLines() to read the file into memory and then pre-process it. If you're willing to convert the non-separate commas into something else, you can do something like:
> read.csv(text = gsub("\\\\,", "-", readLines("dat.csv")))
id name value
1 1 peter 5
2 2 peter-paul 3
Another option is to utilize the fact that the fread function from data.table can perform system commands as its first argument. Then you can do something like a sed operation on the file before reading it in (which may or may not be faster):
> data.table::fread("sed -e 's/\\\\\\,/-/g' dat.csv")
id name value
1: 1 peter 5
2: 2 peter-paul 3
You can always then use gsub() to convert the temporary - separator back into a comma.

R tm package DataframeSource import

Reading a CSV into R and wanting to make a corpus from it with the tm package, but not getting the desired results. Currently, when I read in a CSV of text, then inspect the corpus, the data is all numerical. (I only included the first three columns of data to protect privacy; there are nine as shown in the inspect results.)
library(tm)
data <- read.csv("filename.csv")
head(data)
Directory.Code First.Name Last.Name
1 SCA0025 Nbcde Cdbaace
2 SCA0025 AJCocei aiceice
3 SCA0025 aceca Ac;eice
4 SCA0025 Acoicm aie;cee
5 SCA0025 acei aciomac
6 SCA0025 caeij CIMCEv
data.corp <- corpus(DataframeSource,data)
inspect(data.corp[1])
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1`
16
2195
6655
6613
1
5
9757
1
1
If it helps to know the purpose: I am trying to read in a csv of names and un-normalized job titles/descriptions, then compare to a corpus of known titles/descriptions as categories. Now that I type this in, I realize that this csv will be my test/prediction data, but I still want to build a corpus from a csv with colnames = KnownJobTitle,Description.
The goal of this question is to successfully read a CSV into a corpus, but I would also like to know if it is advisable to use the tm package for more than 2 categorizations, and/or if there are other packages more suited to this task.
I get the similar error. It's because the text fields read from the csv are categorical instead of char. You need to first convert those to character using something like:
data <- data.frame(lapply(data, as.character), stringsAsFactors=FALSE)

what kind of files is suitable to be read in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Read an Excel file directly from a R script
I made an Excel file, I named it test.xlsx. I wanted to read the file in R.
date price
1 34
2 34.5
3 34
4 34
5 35
6 34.5
7 36
Now, when I used
x = read.csv("test.xlsx")
didn't work. Also I used
x = read.table("test.xlsx")
I got the warning
Warning message:
In read.table("test.xlsx") :
incomplete final line found by readTableHeader on 'test.xlsx'
and the result:
V1
1 PK\003\004\024
2 PˆTز\005›DQ4ï½ùfىé|[™d\003\001µ³9\033g
So, do I need to make a special file in order to read it in R?
try using a simple CSV file. you can save one in Excel using the Save As option
You may want to have a look at the XLConnect package for dealing with Excel files in R: http://cran.r-project.org/web/packages/XLConnect/index.html

Resources