I have a dataframe that is sourced from an excel CSV file. because of how I build the source file, some of the columns will come in with blanks (i.e. I didnt put anything in that cell in excel). I need to set those cells to 0 before I process the data.
I read the data in using fread.
Some of the cells have come in with N/A in the value which is easy enough to replace. However some show "blank" in View(theDataFrame) and are not caught by replace_na. I cant figure out what is in these cells so i can target them and replace with 0.
I tried paste("*", theDataFrame[1, "Current Price"], "*") which printed an asterisk, two white spaces and another asterisk. so, I tried gsub("/s/s", "0", theDataFrame$`Current Price`) but that did not work.
is.empty(theDataFrame[1, "Current Price"]) is FALSE so, it's not empty. How can i figure out what the heck fread put in there so I can replace it?
Thanks!
We could use
theDataFrame[theDataFrame == ''] <- 0
Related
Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).
I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.
I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!
I have a input file (| separated ) with the following content.
name|val1|val2
foo|30|20
bar||20
qux|30|
ert|12|11
uyt||
What I want to do is to extract lines only when every fields
contain value. Yielding this:
foo|30|20
ert|12|11
Why the following doesn't work:
data<-read.delim("myinput.txt",sep="|",na.strings="",
header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
If the file isn't too big to be read into memory at first, you can easily fix this with:
data <- data[complete.cases(data), ]
Also, you don't have any lines that are blank, you just have lines where some cells have missing values, which isn't quite the same thing, and it's not really covered by the blank.lines.skip argument.
I am trying to load a csv file that has 14 columns like this:
StartDate, var1, var2, var3, ..., var14
when I issue this command:
systems <- read.table("http://getfile.pl?test.csv", header = TRUE, sep = ",")
I get an error message.
duplicate row.names are not allowed
It seems to me that the first column name is causing the issue. When I manually download the file and remove the StartDate name from the file, R successfully reads the file and replaces the first column name with X. Can someone tell me what is going on? The file is a (comma separated) csv file.
Then tell read.table not to use row.names:
systems <- read.table("http://getfile.pl?test.csv",
header=TRUE, sep=",", row.names=NULL)
and now your rows will simply be numbered.
Also look at read.csv which is a wrapper for read.table which already sets the sep=',' and header=TRUE arguments so that your call simplifies to
systems <- read.csv("http://getfile.pl?test.csv", row.names=NULL)
This related question points out a part of the ?read.table documentation that explains your problem:
If there is a header and the first row contains one fewer field
than the number of columns, the first column in the input is used
for the row names. Otherwise if row.names is missing, the rows are numbered.
Your header row likely has 1 fewer column than the rest of the file and so read.table assumes that the first column is the row.names (which must all be unique), not a column (which can contain duplicated values). You can fix this by using one of the following two Solutions:
adding a delimiter (ie \t or ,) to the front or end of your header row in the source file, or,
removing any trailing delimiters in your data
The choice will depend on the structure of your data.
Example:
Here the header row is interpreted as having one fewer column than the data because the delimiters don't match:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
This is how it is interpreted by default:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
The first column (with no header) values are interpreted as row.names: a1 and b1. If this column contains duplicates, which is entirely possible, then you get the duplicate 'row.names' are not allowed error.
If you set row.names = FALSE, the shift doesn't happen, but you still have a mismatching number of items in the header and in the data because the delimiters don't match.
Solution 1
Add trailing delimiter to header:
v1,v2,v3, # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 2
Remove excess trailing delimiter from non-header rows:
v1,v2,v3 # 3 items
a1,a2,a3 # 3 items!!
b1,b2,b3 # 3 items!!
In my case was a comma at the end of every line. By removing that worked
I had this error when opening a CSV file and one of the fields had commas embedded in it. The field had quotes around it, and I had cut and paste the read.table with quote="" in it. Once I took quote="" out, the default behavior of read.table took over and killed the problem. So I went from this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",", quote="")
to this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",")
I used read_csv from the readr package
In my experience, the parameter row.names=NULL in the read.csv function will lead to a wrong reading of the
file if a column name is missing, i.e. every column will be shifted.
read_csv solves this.
Another possible reason for this error is that you have entire rows duplicated. If that is the case, the problem is solved by removing the duplicate rows.
The answer here (https://stackoverflow.com/a/22408965/2236315) by #adrianoesch should help (e.g., solves "If you know of a solution that does not require the awkward workaround mentioned in your comment (shift the column names, copy the data), that would be great." and "...requiring that the data be copied" proposed by #Frank).
Note that if you open in some text editor, you should see that the number of header fields less than number of columns below the header row. In my case, the data set had a "," missing at the end of the last header field.
It seems the problem can arise from more than one reasons. Following two steps worked when I was having same error.
I saved my file as MS-DOS csv. ( Earlier it was saved in as just csv , excel starter 2010 ).
Opened the csv in notepad++. No coma was inconsistent (consistency as described above #Brian).
Noticed I was not using argument sep="," . I used and it worked ( even though that is default argument!)