R mistreat a number as a character - r

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!
The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.
My codes are as follows:
AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.

Try this instead:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]
Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).
Belgium, Exposure to risk (period 1x1) Last modified: 04-Feb-2011, MPv5 (May07)
Year Age Female Male Total
1841 0 61006.15 62948.23 123954.38
1841 1 55072.53 56064.21 111136.73
1841 2 51480.76 52521.70 104002.46
1841 3 48750.57 49506.71 98257.28
.... . ....
So I might have suggested the even more accurate colClasses:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=2, # really two lines to skip I think
colClasses=c(rep("integer", 2), rep("numeric",3)),
header=TRUE)[,-5]
I suspect the promlem occurs because of lines like these:
1842 110+ 0.00 0.00 0.00
So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:
BELe<-read.table("Exposures_1x1.txt",skip=2,
header=TRUE)
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
str(BELe)
#-------------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : chr "0" "1" "2" "3" ...
$ Female: chr "61006.15" "55072.53" "51480.76" "48750.57" ...
$ Male : chr "62948.23" "56064.21" "52521.70" "49506.71" ...
$ Total : chr "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)
#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : num 0 1 2 3 4 5 6 7 8 9 ...
$ Female: num 61006 55073 51481 48751 47014 ...
$ Male : num 62948 56064 52522 49507 47862 ...
$ Total : num 123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46 98257.28 94875.89

The way I typically read those files is:
BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)
Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:
BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))
Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:
library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")
BELexp <- readHMD("BEL.Exposures_1x1.txt")
Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.

Can you try read.csv(... stringsAsFactors=FALSE) ?

Related

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
occurs when trying to run my script in r.
I have tried to find the solution for it, but it seems to be pretty specific, and little help for me.
My dataset contains 3936 obs of 7 variables.
environment, skill, volume, datetime, year, month, day
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3696 obs. of 7 variables:
$ environment: chr "b2b" "b2b" "b2b" "b2b" ...
$ skill : chr "BO Bedrift" "BO Bedrift" "BO Bedrift" "BO Bedrift" ...
$ year : num 2017 2017 2017 2017 2017 ...
$ month : num 1 1 1 1 1 2 2 2 2 3 ...
$ day : num 2 9 16 23 30 6 13 20 27 6 ...
$ volume : num 360 312 305 222 113 ...
$ datetime : Date, format: "2017-01-02" "2017-01-09" "2017-01-16" "2017-01-23" ...
but when trying to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
Tried to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
volume_ets <- volume_tsbl %>% ETS(volume)
my tsibble looks like this;
volume_tsbl <- volume %>¤ as_tsibble(key = c(skill, environment), index = c(datetime), regular = TRUE )
Expected the code to run, but it does not.
This is the result of an interface change made in late 2018. The change was to make model functions (such as ETS()) create model definitions, rather than fitted models. Essentially, ETS() no longer accepts data as an input, and the specification for the ETS model would become ETS(volume).
The equivalent code in the current version of fable is:
volume_ets <- volume_tsbl %>% model(ETS(volume))
Where the model() function is used to train one or more model definitions (ETS(volume) in this case) to a given dataset.
You can refer to the pkgdown site for fable to see more details: http://fable.tidyverts.org/
In particular, the ETS() function is documented here: http://fable.tidyverts.org/reference/ETS.html

I get the error NAs introduced by coercionNAs when trying to run kNN in R?

I am trying to run kNN on a dataset but I keep getting some NA error. I have exhausted stack overflow trying to find a solution to this problem. I could not find anything useful anywhere.
This is the dataset I am working with : https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles
I have converted every single factor variable and integer variable for my predictor and target to numeric so it can do Euclidean distance. I have removed all the NA's but kNN keeps throwing the following error message :
NAs introduced by coercionNAs introduced by coercionError in knn(train[2:nrow(train), c(11, 22, 23, 25, 27, 28)], test[(2:nrow(test)), :
NA/NaN/Inf in foreign function call (arg 6)
This is one example of how I am converting all the predictors and running kNN :
as.numeric(levels(test$Road_Type))[levels(test$Road_Type)]
as.numeric(levels(train$Road_Type))[levels(train$Road_Type)]
train <- na.exclude(train)
test <- na.exclude(test)
cl=as.numeric(train[2:nrow(train),5])
cl <- na.exclude(cl)
knn0 <- knn(train[2:nrow(train),c(11,22,23,25,27,28)], test[(2:nrow(test)),c(11,22,23,25,27,28)], cl)
I am doing the as.numeric stuff for all the columns 11,22,23,25,27,28 and also the target. I am starting the row at 2 so it doesn't include the labels. I have also tried running the following code before passing the parameters into the kNN function :
sum(is.na(train[2:nrow(train),c(11,22,23,25,27,28)]))
sum(is.na(test[2:nrow(test),c(11,22,23,25,27,28)]))
sum(is.na(cl))
All 3 of these return 0 so there are no NA values before I am passing it into the kNN function.
EDIT
Fixed the issue by converting to numeric like this :
train$Road_Type <- as.numeric(as.integer(factor(train$Road_Type)))
Thanks to everyone who helped!
You need to always look into the data. This helps you and others to answer the question.
If we check your data it looks like this:
str(df[, c(11, 22, 23, 25, 27, 28)])
'data.frame': 2047256 obs. of 6 variables:
$ Junction_Control : chr "Data missing or out of range" "Auto traffic signal" "Data missing or out of range" "Data missing or out of range" ...
$ Number_of_Vehicles : int 1 1 2 1 1 2 2 1 2 2 ...
$ Pedestrian_Crossing.Human_Control: int 0 0 0 0 0 0 0 0 0 0 ...
$ Police_Force : chr "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" "Metropolitan Police" ...
$ Road_Type : chr "Single carriageway" "Dual carriageway" "Single carriageway" "Single carriageway" ...
$ Special_Conditions_at_Site : chr "None" "None" "None" "None" ...
What happens if we transform a character to numeric:
df$Police_Force <- as.numeric(df$Police_Forc)
df$Police_Force
[1] NA NA NA NA NA NA NA ....
Warning message:
NAs introduced by coercion
This does not work in R. However if we set them as factors and afterward change them to numeric the problem is solved.
df$Police_Force <- as.numeric(as.factor(df$Police_Forc))
df$Police_Force
[1] 30 30 30 30 30 30 30 ...
Your approach does not work because the variables are not factors but characters.
levels(df$Road_Type)
NULL
as.numeric(levels(df$Road_Type))[levels(df$Road_Type)]
numeric(0)
As you have not shown how your data looks after imported into R I might be wrong. I used the read.csv function.
Are you sure you have converted your data into numeric? as.numeric() does not work in place, you have to assign its result, as you have done it with cl.

Splitting a single variable dataframe

I have a CSV file that appears as just one variable. I want to split it to 6. I need help.
str(nyt_data)
'data.frame': 3104 obs. of 1 variable:
$ Article_ID.Date.Title.Subject.Topic.Code: Factor w/ 3104 levels "16833;7-Dec-03;Ruse in Toyland: Chinese Workers' Hidden Woe;Chinese Workers Hide Woes for American Inspectors;5",..: 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 ...
nyt_data$Article_ID.Date.Title.Subject.Topic.Code
The result displaced after the above line of code is:
> head(nyt_data$Article_ID.Date.Title.Subject.Topic.Code)
[1] 41246;1-Jan-96;Nation's Smaller Jails Struggle To Cope With Surge in Inmates;Jails overwhelmed with hardened criminals;12
[2] 41257;2-Jan-96;FEDERAL IMPASSE SADDLING STATES WITH INDECISION;Federal budget impasse affect on states;20
[3] 41268;3-Jan-96;Long, Costly Prelude Does Little To Alter Plot of Presidential Race;Contenders for 1996 Presedential elections;20
Please help me with code to split these into 6 separate columns Article_ID, Date, Title, Subject, Topic, Code.
The data is split with ";" but read.csv defaults to ",". Simply do the following:
df <- read.csv(data, sep = ";")
Just read CSV file with custom sep.
Like this:
data <- read.csv(input_file, sep=';')

What parameters does Excel use to read in CSV files, and how can it be adapted to R?

I have a CSV file which mixes quoting and unquoting that gives R problems when trying to read it in. The issue arises with commas within the quotes, it delimits on these but I want them ignored. When viewing the CSV in Excel, it manages it perfectly and understands where to break. Is there a way these settings can be viewed/translated to R?
Here is the link to download the file in question, it's a set of gene ontologies and their associated terms and whether or not the gene is part of it (0 or 1). It should be 4 columns of text, 1 column of pValues, and 50 columns of 0/1.
I've tried reading it into R with read.table(file, quote="\"", sep=",", row.names=NULL), but the values from Category, Name, Verbose ID spill into the pValue and then affect the counts data. Then entire rows of data may be put into one cell until another misinterpreted delimiter arises.
Here's an example problem line, with some of the last columns of 0/1 redacted for length.
"Pubmed","Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis.","22327010","pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i...",0.005837270080633278,0,0,0,0,0,1,0,...
Turns out the R command data.table::fread does exactly what I want, found from this SO post
Hmm, I can't replicate. Using quote="\"", sep="," seems to give what you're asking for ...
example_line <- '"Pubmed","Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis.","22327010","pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i...",0.005837270080633278,0,0,0,0,0,1,0'
r <- read.table(header=FALSE,quote="\"",sep=",",text=example_line,stringsAsFactors=FALSE)
str(r)
## 'data.frame': 1 obs. of 12 variables:
## $ V1 : chr "Pubmed"
## $ V2 : chr "Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis."
## $ V3 : int 22327010
## $ V4 : chr "pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i..."
## $ V5 : num 0.00584
## $ V6 : int 0
## $ V7 : int 0
## $ V8 : int 0
## $ V9 : int 0
## $ V10: int 0
## $ V11: int 1
## $ V12: int 0
read_cvs from the readr package is also a possibility. It can apparently deal with odd sort of oddities

Data importing Delimiter issue in R

I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3
A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.

Resources