what kind of files is suitable to be read in R [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Read an Excel file directly from a R script
I made an Excel file, I named it test.xlsx. I wanted to read the file in R.
date price
1 34
2 34.5
3 34
4 34
5 35
6 34.5
7 36
Now, when I used
x = read.csv("test.xlsx")
didn't work. Also I used
x = read.table("test.xlsx")
I got the warning
Warning message:
In read.table("test.xlsx") :
incomplete final line found by readTableHeader on 'test.xlsx'
and the result:
V1
1 PK\003\004\024
2 PˆTز\005›DQ4ï½ùfىé|[™d\003\001µ³9\033g
So, do I need to make a special file in order to read it in R?

try using a simple CSV file. you can save one in Excel using the Save As option

You may want to have a look at the XLConnect package for dealing with Excel files in R: http://cran.r-project.org/web/packages/XLConnect/index.html

Related

Opening csv file correctly

I am trying to use this dataset: wine_quality_dataset
I am running the following function:
data2 <- read.table("C:/Users/Magda/Downloads/winewhite.csv")
And here is what I got:
head(data2)
V1
1 fixed acidity;volatile acidity;citric acid;residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;density;pH;sulphates;alcohol;quality
2 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
3 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
4 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
5 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
6 7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
What command should I use to read csv file correctly?
Try
readr::read_csv("C:/Users/Magda/Downloads/winewhite.csv")
readr is part of tidyverse a collection of libraries that help you tidying up data.
If you are using European format CSV with a semicolon ; separator, use
readr::read_csv2("C:/Users/Magda/Downloads/winewhite.csv")

Error: object not found in R. Headers not naming from .csv file

I am new to R and I keep getting inconsistent results with trying to display a column of data from a csv. I am able to import the csv into R without issue, but I can't call out the individual columns.
Here's my code:
setwd('mypath')
cdata <- read.csv(file="cendata.csv",header=TRUE, sep=",")
cdata
This prints out the following:
year pop
1 2010 2,775,332
2 2011 2,814,384
3 2012 2,853,375
4 2013 2,897,640
5 2014 2,936,879
6 2015 2,981,835
7 2016 3,041,868
8 2017 3,101,042
9 2018 3,153,550
10 2019 3,205,958
When I try to plot the following, the columns cannot be found.
plot(pop,year)
Error: object 'pop' not found
I even checked if the column names existed, and only data shows up.
ls()
[1] "data"
I can manually enter the data and label them "pop" and "year" but that kind of defeats the point of importing the csv.
Is there a way to label each header as an object?
year and pop are not independent objects. You need to refer them as part of the dataframe you have imported. Also you might need to remove "," from the numbers to turn them to numeric before plotting. Try :
cdata$pop <- as.numeric(gsub(',', '', cdata$pop))
plot(cdata$year, cdata$pop)

Readxl and openxlsx add extra characters to numbers from an excel file

I have some numbers in an excel file that I want to read into R as characters. When I import the file either using readxl or openxlsx, the imported data have two extra characters, which are not in the excel file. The excel sheet looks like this:
The example file is here
I have tried changing the format within the Excel file but this messes up the numbers. My current work-around is to concatenate the number with ' in a separate column in excel and then read that column into R. This works for some reason.
library(readxl)
boo <- read_excel("./boo.xlsx",
col_types = c("text"))
boo
Reading the excel file gives the following (note the last two characters in the Example numbers column. The concatNum column shows the concatenated version.
# A tibble: 6 x 2
`Example numbers` concatNum
<chr> <chr>
1 985.12002779568002 '985.12002779568
2 985.12002826159505 '985.120028261595
3 985.12002780627301 '985.120027806273
4 985.12002780627301 '985.120027806273
5 985.12002780724401 '985.120027807244
6 985.12002780291402 '985.120027802914
Any reasons why this would be happening? Does anyone have a better way of fixing it than my current work-around?

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources