read.csv importing two columns instead of one - r

I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.

The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!

Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.

If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv

Related

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

How to use fread or read_delim in R on characters with no linebreak

I have several .txt files which need to be imported to R as dataframes for some data analysis. One of these files has no EOL in any form, so I'm left wondering how I would go about to import that.
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
This is how the first ~500 characters of that .txt file look like. The EOL would need to be placed like this:
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"
\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
Normally I would just gsub a "\n" to the places I need it to be, but there is no reoccurring string at the places where I would place a \n, so I don't think that gsub would work in this instance.
Seeing how the missing values are clearly indicated with NA, is there a function similar to read_delim that has a "col_number = x" argument? Like the first x values are the headers, the next x values are the values of the first row and so on and so forth?
If it changes anything, these .txt files are rather big (>300mb).
Big thank you to Julian_Hn. Works like a charm.
I would probably just read this in as a vector and then reformat as matrix with the number of columns you know are in the dataset. This essentially does what you want
str <- "\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\";\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA"
vec <- strsplit(str,";")[[1]]
//EDIT: add byrow = T To stay in the right format. Thanks Yuriy
table <- matrix(vec,ncol=23,nrow=3, byrow = T)
df <- as.data.frame(table)

write.table unintendedly adds subscript x to header

I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.

Isolate column in R from text file

I want to analyze market data that is being saved in a text file.
The data consists of "Date Time;Price;Size". I want to only look at the Sizes, how can I separate this data in R so that I may do statistical analysis on the sizes?
Example:
20170918 040001;50.42;1
20170918 040002;50.42;1
Just use read.csv with semicolon as a delimeter:
df <- read.csv(file="path/to/your/file.csv", sep=";", header=TRUE)
The sizes can be accessed using df$Sizes.
You can use the select argument of data.table:
library(data.table)
#[[1L]] extracts the column of the temporary table to a vector;
# you could also use $V2, but this _may_ not be perfectly robust
price = fread('/path/to/file'select = 2L)[[1L]]
fread should be able to detect automatically that your file doesn't have headers, as well as that the field separator is ;. If not, set header = FALSE and/or sep = ';'.
Of course, it's not likely that you will only use the vector of prices independently of the rest of the data. So you should really just store the whole data file in a data.table:
market_data = fread('/path/to/file', col.names = c('date_time', 'price', 'size'))
Then you can manipulate market_data as you would any data.table (see Getting Started), e.g.
market_data[ , mean(price)]
market_data[ , sd(price)]
and so on.
df=read.table("your file")
size=df[4]
your sizes data will be in size as a data frame

Using cbind to create a columnar .csv file but 'X' always appears in row one

I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara
Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here

Resources