R delete special characters (faster way) - r

I have a huge data frame, with some columns containing "characters". The problem is that I have some "wrong" characters, like this:
mutate_all(data, funs(tolower))
> Error in mutate_impl(.data, dots) : Evaluation error: invalid input
> 'https://www.ps.f/c-w/nos-promions/v-ambght-rembment.html#modalit<e9>s'
> in 'utf8towcs'.
So I deleted the "wrong" characters (note: I can't just easily remove all the characters, because I need the ":" to separate the data).
I found an solution:
library(qdap)
keep <- c(":")
data$column <- strip(data$column, keep, lower = TRUE)
See: How to remove specific special characters in R
That worked... but it is really slow. So therefore my question: how can I apply a function on all my columns (columns that are character) which is quicker then what I just did?
EDIT
Some example what happened in my script:
View(data$column)
"CP:main:234e5qhaw/00:lcd-monitor-with-smatimge-lite"
"CP:main:234e5qhaw/00:lcd-monitor-with-smarimge-lite"
"CP:main:234e5qhaw/00:lcd-monitor-with-sartimge-lite"
"CP:main:bri953/00:faq:skça_sorulan_sorular:xc000003329:f02:9044:9512"
tolower(data$column)
Error in tolower(data$column) :
invalid input "CP:main:bri953/00:faq:skça_sorulan_sorular:xc000003329:f02:9044:9512" in 'utf8towcs'
Optimal situation: keep as much as possible from the original data. But I can imagine that "special" characters must be replaced. But I really need to keep the ":" to separate the data in a later stage.

Related

Error in non-numeric argument to binary operator: opt parse and date formatting

I am working through a code on a sightings dataset and am trying to bin the dates into monthly (4 week) groups.
The command given to be used is:
breaks <- seq(min(min(survey.c$DATE), min(sightings.d$date))-opt$window_length*7,
max(max(survey.c$DATE), max(sightings.d$date))+opt$window_length*7,
by = paste(opt$window_length, 'week'))
This consistently gives back the error
Error in min(min(survey.c$DATE), min(sightings.d$date)) -
opt$window_length * : non-numeric argument to binary operator
Separately, part of the command (min(min(survey.c$DATE), min(sightings.d$date)) and (max(survey.c$DATE), max(sightings.d$date)) work, and so the issues comes with the last part opt$window_length.
The window_length is "2", however even if I change this last part to simply say "14" I run into the same issue. If I am totally honest, I don't completely understand exactly what opt$window_length is trying to do.
I have tried converting columns survey.c$DATE and sightings.d$date into date formats, and this reads back at true with no NAs. I can also convert these into characters, but the data reads back as NAs.
I think I need to change the final part of the command into a proper date format OR change the original date column into numeric form?

Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning?

I am trying to read a csv file in R using read.csv2.sql, since I would like to use a SELECT query from SQL to help me filter my data, but before I can even get to my SELECT query, I discovered that simply reading my csv file using read.csv2.sql already generates a warning message.
This is my code:
investment2 <- read.csv2.sql("investmentdata.csv")
This is the warning message:
Warning message:
In result_fetch(res#ptr, n = n) :
Column 'Capital.Investment': mixed type, first seen values of type real, coercing other values of type string
However, when I use the normal read.csv2 function, there is no error. In particular, the following code works fine with no warning messages:
investment <- read.csv2("investmentdata.csv")
Next, I tried to resolve this issue by casting the Capital.Investment column to be real as follows:
investment3 <- read.csv2.sql("investmentdata.csv", "SELECT *, CAST(Capital.Investment AS real) FROM file")
However, R now generates the following error:
Error: no such column: Capital.Investment
Thus, I have two questions. Firstly, why does using read.csv2.sql generate that warning message when read.csv2 works just fine? Secondly, why does R (or SQL) not recognise my Capital.Investment column when I try to cast it as real?
Perhaps it is also worth noting that I cannot simply ignore this warning that the read.csv2.sql function is showing, because I discovered that as a consequence of this warning, it has automatically casted some of the NA rows in my Capital.Investment column to 0, which I cannot allow - the NA rows must stay as NA. I do not seem to be having this problem with the other columns of my csv file though.
As I am quite new to R, any help and explanations will be greatly appreciated :)
Edit
The coded version of what my truncated csv file looks like is as follows. In particular, the name of the column-in-question is indeed Capital.Investment.
id;targetC;year;comp_id;homeC;Industry.Activity;Capital.Investment;Estimated;Jobs.Created;Estimated.1;Project.Type;geographic distance;SIC;listed;sales;assets;cap_structure;rnd;profit;rndintensity;polcon;homeC_gdp;targetC_gdp;homeC_gdppc;targetC_gdppc
1302;AUS;2008;FR338966385;FRA;Design, Development & Testing;33.1;Yes;36;Yes;New;15.26414042;3669;Unlisted;4333088.972;4037211.732;0;NA;-1339221.733;NA;0.489032525;2.92347E+12;1.05456E+12;45413.06571;49628.11513
1311;AUS;2008;US*190521496652;USA;Research & Development;8.4;Yes;30;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1313;AUS;2008;GB05817296;GBR;Business Services;9.7;Yes;10;Yes;New;15.31094496;7389;Unlisted;NA;87.64187374;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
1318;AUS;2008;US129687150L;USA;Business Services;1.3;Yes;225;Yes;New;15.24712914;7373;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1351;AUS;2008;GB*P0060071;GBR;Electricity;516;No;51;Yes;New;15.31094496;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9925;AUS;2008;GB00034121;GBR;Business Services;34.8;Yes;37;Yes;New;15.31094496;4412;Unlisted;NA;2079288.611;0.355157008;NA;94320.15469;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9932;AUS;2008;CA30060NC;CAN;Sales, Marketing & Support;3.2;Yes;11;Yes;New;14.88812529;1094;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.54913E+12;1.05456E+12;46596.33599;49628.11513
9935;AUS;2008;US940890210;USA;Manufacturing;771;Yes;266;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9938;AUS;2008;US770059951;USA;Technical Support Centre;9.1;Yes;104;Yes;Co-Locati;15.24712914;3661;Listed;34922000;53340000;0.120134983;4598000;7333000;0.086201723;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9946;AUS;2008;US010562944;USA;Extraction;535.8;Yes;198;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9955;AUS;2008;DE5030147191;DEU;Logistics, Distribution & Transportation;21.2;Yes;134;Yes;New;14.6718338;4311;Listed;93495971.01;346629334.8;0.036629492;0;2044745.934;0;0.489032525;3.75237E+12;1.05456E+12;45699.19832;49628.11513
9958;AUS;2008;US126012192L;USA;Business Services;9.7;Yes;10;Yes;New;15.24712914;8111;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9969;AUS;2008;US135409005;USA;Extraction;NA;No;538;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9977;AUS;2008;JP000000728JPN;JPN;ICT & Internet Infrastructure;128.6;Yes;77;Yes;New;7.0333688;3571;Listed;53255396.85;38181450.16;0.190244908;2584585.523;480589.4308;0.067692176;0.489032525;5.03791E+12;1.05456E+12;39339.29757;49628.11513
9984;AUS;2008;US841547578;USA;Sales, Marketing & Support;13.6;Yes;23;Yes;New;15.24712914;2095;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9993;AUS;2008;US258715604L;USA;Customer Contact Centre;1.8;No;40;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
This issue was resolved in chat, to be one of two issues:
see my original answer below, this was causing an Error; when that is fixed, we see that ...
there is a warning, informing about the fact that a column (happens to be the same column) looks numeric but has a non-numeric cell somewhere within the guts of the file.
The first is resolved below, the second is just a warning.
However, because the OP is asking to convert to numeric via SQL, the NA is converted to 0, which is not good. My recommendation is to either cast([Capital.Investment] as char) as [Capital.Investment] and use R's as.numeric to convert to numeric (preserving the NA-nature), or to just read.csv2(.) the file outright and use sqldf(.) to use its SQL querying on table-like data.
Up front: add brackets or quotes around your column name.
Rationale: Capital.Investment is seen as a dot-delimited table-column or schema-table or something similarly not what you intend. I believe in general in SQL that field names with embedded dots need this escaping. If your data has an embedded space, realize that R does not like spaces in its field names, so it is by-default using make.names when reading it in (which replaces spaces with dots).
Setup:
Save the following as "quux.csv". (I've named it csv, but since I'm changing it to be ;-delimited, it behaves the same.)
quux;Capital.Investment
1;100
2;200
(Or you can use Capital Investment, it's the same thing.)
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast(Capital.Investment as real) from file')
# Error: no such column: Capital.Investment
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast([Capital.Investment] as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast("Capital.Investment" as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200

Comparing Strings for match in a vectorized way

I have a large data frame, which contains two columns containing strings. When these columns are unequal, I want to do an operation.
The problem is that when I use a simple != operator, it gives incorrect results. I.e. apparently, 'Tout_Inclus' & 'Tout_Inclus' are unequal.
This leads me to string comparison functions, like strcmp from pracma package. However, this is not vectorised - my dataframe has 9.6M rows, therefore I think this would crash/take ages if I looped through.
Has anyone got any vectorised methods for comparing strings?
My dataframe looks like this:
City_Break City_Break
City_Break City_Break
Court_Break Court_Break
Petit_Budget Petit_Budget
Pas_Cher Pas_Cher
Deals Deals_Pas_Chers
Vacances Vacances_Éco
Hôtel_Vol Hôtel_Vol
Dernière_Minute Dernière_Minute
Formule Formule_Éco
Court_Séjour Court_Séjour
Voyage Voyage_Pas_Cher
Séjour Séjour_Pas_Cher
Congés Congés_Éco
when I do something like df[colA != colB,] it gives incorrect results, where strings (by looking at them) are equal.
I've ensured encoding is UTF-8, strings are not factors, and I also tried removing special characters before doing the comparison.
By the way, these strings are from multiple languages.
edit: I've already trimmed whitespaces, and still no luck
Try removing leading/trailing whitespace from both columns, and then compare:
df[trimws(df$colA, "both") != trimws(df$colB, "both"), ]
If evertyhing else is fine(trim, etc..), yours could be an encoding problem. In UTF-8 the same accented character could be rapresented with different byte sequences. It may be single byte coded or with modifier byte. However, very strange with 'Tout_Inclus'.
Just to have a check, from stringi package try this:
stringi::stri_compare(df$colA,df$colB, "fr_FR")
What's the output?

Error occured while using fread function in R

I am trying to read a csv file with delimited with | using fread() in R.But i am getting bellow error.
I am using query like :-
fread("\\Path\\file_name.csv",sep="|")
Expecting 23 cols, but line 3747410 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep='|' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("Path\\File_name.csv",sep="|") :
Bumped column 5 to type character on data row 3422329, field contains 'NULL'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
Help me for solving this issue...Thanks in advance.

What does the "More Columns than Column Names" error mean?

I'm trying to read in a .csv file from the IRS and it doesn't appear to be formatted in any weird way.
I'm using the read.table() function, which I have used several times in the past but it isn't working this time; instead, I get this error:
data_0910<-read.table("/Users/blahblahblah/countyinflow0910.csv",header=T,stringsAsFactors=FALSE,colClasses="character")
Error in read.table("/Users/blahblahblah/countyinflow0910.csv", :
more columns than column names
Why is it doing this?
For reference, the .csv files can be found at:
http://www.irs.gov/uac/SOI-Tax-Stats-County-to-County-Migration-Data-Files
(The ones I need are under the county to county migration .csv section - either inflow or outflow.)
It uses commas as separators. So you can either set sep="," or just use read.csv:
x <- read.csv(file="http://www.irs.gov/file_source/pub/irs-soi/countyinflow1011.csv")
dim(x)
## [1] 113593 9
The error is caused by spaces in some of the values, and unmatched quotes. There are no spaces in the header, so read.table thinks that there is one column. Then it thinks it sees multiple columns in some of the rows. For example, the first two lines (header and first row):
State_Code_Dest,County_Code_Dest,State_Code_Origin,County_Code_Origin,State_Abbrv,County_Name,Return_Num,Exmpt_Num,Aggr_AGI
00,000,96,000,US,Total Mig - US & For,6973489,12948316,303495582
And unmatched quotes, for example on line 1336 (row 1335) which will confuse read.table with the default quote argument (but not read.csv):
01,089,24,033,MD,Prince George's County,13,30,1040
you have have strange characters in your heading # % -- or ,
For the Germans:
you have to change your decimal commas into a Full stop in your csv-file (in Excel:File -> Options -> Advanced -> "Decimal seperator") , then the error is solved.
Depending on the data (e.g. tsv extension) it may use tab as separators, so you may try sep = '\t' with read.csv.
This error can get thrown if your data frame has sf geometry columns.

Resources