R can't read random columns of my CSV file

R can't read random columns of my CSV file - r

R thinks that the columns "nonrow" (which I added as a tester) and "sample" don't exist (it says they're null) in the following CSV file. (It sees every other column fine...)
nonrow,,Fill Weight,Fill Weight,Fill Weight,Fill Weight,xbar,r,sample
0,,352,348,350,351,350.25,3,1
0,,351,352,351,350,351,2,2
0,,351,346,342,350,347.25,8,3
0,,349,353,352,352,351.5,1.5,4
0,,351,350,351,351,350.75,1,5
0,,353,351,346,346,349,5,6
0,,348,344,350,347,347.25,6,7
0,,350,349,351,346,349,5,8
0,,344,345,346,349,346,4,9
0,,349,350,352,352,350.75,2,10
0,,353,352,354,356,353.75,4,11
0,,348,353,346,351,349.5,7,12
0,,352,350,351,348,350.25,3,13
0,,356,351,349,352,352,3,14
0,,353,348,351,350,350.5,3,15
0,,353,354,350,352,352.25,4,16
0,,351,348,347,348,348.5,1.5,17
0,,353,352,346,352,350.75,6,18
0,,346,348,347,349,347.5,2,19
0,,351,348,347,346,348,2,20
0,,348,352,351,352,350.75,1.25,21
0,,356,351,350,350,351.75,1.75,22
0,,352,348,347,349,349,2,23
0,,348,353,351,352,351,2,24
I have tried moving the column around, renaming it, and trying the tester column (which it also mysteriously can't see). Does anyone have any suggestions? Thank you!

Related

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?

Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.

You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name

If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [Âµm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML

If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)

Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".

This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.

I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.

readcsv fails to read # character in Julia

I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?

The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.

What would cause Microsoft Jet OLEDB SELECT to miss a whole column?

I'm importing an .xls file using the following connection string:
If _
SetDBConnect( _
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & filepath & _
";Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1""", True) Then
This has been working well for parsing through several Excel files that I've come across. However, with this particular file, when I SELECT * into a DataTable, there is a whole column of data, Item Description, missing from the DataTable. Why?
Here are some things that may set this particular workbook apart from the others that I've been working with:
The workbook has a freeze pane consisting of the first 24 rows (however, all of these rows appear in the DataTable)
There is some weird cell highlighting going on throughout the workbook
That's pretty much it. I can't see anything that would make the Item Description column not import correctly. Its data is comprised of all Strings that really have no special characters apart from &. Additionally, each data entry in this column is a maximum of 20 characters. What is happening? Is there any other way I can get all of the data? Keep in mind I have to use the original file and I cannot alter it, as I want this to ultimately be an automated process.
Thanks!

Some initial thoughts/questions: Is the missing column the very first column? What happens if you remove the space within "Item Description"? Stupid question, but does that column have a column header?
-- EDIT 1 --
If you delete that column, does the problem move to another column (the new index 4), or is the file complete. My reason for asking this -- is the problem specific to data in that column/header, or is the problem more general, on index 4.
-- EDIT 2 --
Ok, so since we know it's that column, we know it's either the header, or the rows. Let's concentrate on rows for now. Start with that ampersand; dump it, and see what happens. Next, work with the first 50% of rows. Does deleting that subset affect anything? What about the latter 50% of rows? If one of those subsets changes the result, you ought to be able to narrow it down to an individual row (hopefully not plural) by halfing your selection each time.
My guess is that you're going to find a unicode character or something else funky is one of the cells. Maybe there's a formula or, as you mentioned, some of that "weird cell highlighting."

It's been years since I worked with excel access, but I recall some problems with excel grouping content into some areas that would act as table inside each sheet. Try copy/paste the content from the problematic sheet to a new workbook and connect to that workbook. If this works you may be able to investigate a bit further about areas.

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.

ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex