Possible inconsistency in conversion from text to numeric - r

Compare the conversion of a character string with as.numeric to how it can be done with read.fwf .
as.numeric("457") # 457
as.numeric("4 57") # NA with warning message
Now read from a file "fwf.txt" containing exactly " 5 7 12 4" .
foo<-read.fwf('fwf.txt',widths=c(5,5),colClasses='numeric',header=FALSE)
V1 V2
1 57 124
foo<-read.fwf('fwf.txt',widths=c(5,5),colClasses='character',header=FALSE)
V1 V2
1 5 7 12 4
Now, I'll note that in the "numeric" version, read.fwf does concatenation the same way Fortran does. I was just a bit surprised that it doesn't throw an error or NA in the same manner as as.numeric . Anyone know why?

As #eipi10 pointed out, the space eliminating behavior is not unique to read.fwf. It actually comes form the scan() function (which is used by read.table which is used by read.fwf). Actually the scan() function will remove spaces (or tabs if they are not specified as the delimiter) from any value that is not a character as it process the input stream. Once it has the "cleaned" the value of spaces, then it uses the same function as as.numeric to turn that value into a number. With character values it don't take out any white space unless you set strip.white=TRUE which will only remove space from the beginning and end of the value.
Observe these examples
scan(text="TRU E", what=logical(), sep="x")
# [1] TRUE
scan(text="0 . 0 0 7", what=numeric(), sep="x")
# [1] 0.007
scan(text=" text ", what=character(), sep="~")
# [1] " text "
scan(text=" text book ", what=character(), sep="~", strip.white=T)
# [1] "text book"
scan(text="F\tALS\tE", what=logical(), sep=" ")
# [1] FALSE
You can find the source for scan() in /src/main/scan.c and the specific part responsible for this behavior is around this line.
If you wanted as.numeric to behave like, you could create a new function like
As.Numeric<-function(x) as.numeric(gsub(" ", "", x, fixed=T))
in order to get
As.Numeric("4 57")
# [1] 457

Related

Turn txt file into dataframe

I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names
You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine

How to read a file with more than one tab as separator and where the space is part of column value

I have to read a CSV file with tab ("\t") separator and it can occur multiple times. The read.table function has the special white space separator (sep="") that considers multiple occurrences of any whitespace (tab or space). The problem is that I have the space character as part of the column value, so I cannot use the white space separator. When I use "\t" it only consider one occurrence.
Here is a toy example of my problem:
text1 <- "
a b c
11 12 13
21 22 23
"
ds <- read.csv(sep = "", text = text1)
before the element [1,3], i.e. "13" there are two tabs as separator. Then I get:
a b c
1 11 12 13
2 21 22 23
This is the expected result.
Let's say we add an space in the third column values between the first and second number, so now it would be: "1 3" and "2 3". Now we cannot use a white space delimiter because the space is not a delimiter in this case, it is part of the column value. Now when I use "\t" I get this unexpected result:
text3 <- "
a b c
11 12 1 3
21 22 2 3
"
ds <- read.csv(sep = "\t", text = text3)
The string representation of the input text is:
"a\tb\tc\n11\t12\t\t1 3\n21\t22\t2 3\n"
And now the result is:
a b c
11 12 1 3
21 22 23
It seems to be simple, but I cannot find a way to do it using the read.table interface, because the input argument sep does not accept a regular expression as delimiter.
I think I found a workaround for this, 1) replacing all extra tabs with one first, 2) read the file/text. For example:
read.csv(text = gsub("[\t]+", "\t", readLines(text3), perl = TRUE), sep = "\t")
and also using a file instead:
temp <- tempfile()
writeLines(text3, temp)
read.csv(text = gsub("[\t]+", "\t", readLines(temp), perl = TRUE), sep = "\t")
The text input argument will result:
> text
[1] "a\tb\tc" "11\t12\t1 3" "21\t22\t2 3" ""
and the result of read.csv will be:
a b c
1 11 12 1 3
2 21 22 2 3
This is similar to #Badger suggestion, just in one step.
Okay I think I've got something for you:
write.table( gsub("\\r","", gsub("\t","", readChar( "C:/_Localdata/tab_sep.txt", file.info( "C:/_Localdata/tab_sep.txt" )$size) ) ), "C:/_Localdata/test.txt", sep=" ", quote = F, col.names = T, row.names=F)
## In the event there is a possibility that it is 1 or 2 tabs in series, you can use gsub("\t|\t\t", in place of gsub("\t", just add a | and more \t's if needed!
read.table("C:/_Localdata/test.txt",sep=" ",skip=1,header=T)
Okay what just happened? First we read in the file as a massive character string using readChar(), we need to tell R how big the file is, using file.info(), from this we need to get rid of any tabs using gsub and the \t call, then we have a character string with \r's and \n's, the \r and \n are both carriage returns however R sees both within the file, so it reports both. As such we get rid of one of the carriage returns. Then we write the table out (ideally back to where it came from). Now you can read it in with an easy separating value of a single space, and skip the first line. The first line will be an X, an artifact of writing out a gsub. Also declare a header and you should be good to go!
Let's say you have 500 files.
Place all your files in a folder, and set the pattern to the file type they are, or just allow R to view them all by removing the pattern call.
for( filename in 1:list.files("C:_/Localdata/",patten=".txt") ) {
write.table( gsub("\\r","", gsub("\t","", readChar( filename , file.info( filename )$size) ) ), filename , sep=" ", quote = F, col.names = T, row.names=F)
}
Now your files are ready to be read in however you would like.

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

How to find and replace double quotes in R data frame

I have a data frame that looks like this (sorry, I can't replicate the actual data frame with code as the double quotes don't show up. Vx are variables):
V1, V2, V3, V4
home, 15, "grand", terminal,
"give", 32, "cuz", good,
"miles", 5, "before", ten,
yes, 45, "sorry," fine
Question: how I might be able to fix the double quote issue for my entire data frame that I've imported using the read.csv function, where all the double quotes are removed?
What I'm looking for is the excel or word equivalent of FIND + REPLACE: Find the double quote, and replace with nothing.
Notes:
1) I've confirmed it's a data frame by running is.data.frame() function
2) The actual data frame has hundreds of columns, so going through each one and declaring the type of column it is isn't feasible
3) I tried using the following, and it didn't work: as.data.frame(sapply(my_data, function(x) gsub("\"", "", x)))
4) I confirmed that this isn't a simple print issue by testing using sql on the the data frame. It won't find columns in double quotes unless I use LIKE instead of =
Thanks in advance!
7/7/15 EDIT 01: as requested from #alexforrence, here is the d(put) output for a couple of columns:
billing_first_name billing_last_name billing_company
3 NA
4 Peldi Guilizzoni NA
5 NA
6 "James Andrew" Angus NA
7 NA
8 Nova Spivack NA
Here is a solution using dplyr and stringr. Note that purely numerical columns will be character columns afterwards. It's not clear to me from your description whether there are purely numerical columns. If there are then you'd probably want to treat them separately, or alternatively convert back into numbers afterwards.
require(dplyr)
require(stringr)
df <- data.frame(V1=c("home", "\"give\"", "\"miles\"", "yes"),
V2=c(15, 32, 5, 45),
V3=c("\"grand\"", "\"cuz\"", "\"before\"", "\"sorry\""),
V4=c("terminal", "good", "ten", "fine"))
df
## V1 V2 V3 V4
## 1 home 15 "grand" terminal
## 2 "give" 32 "cuz" good
## 3 "miles" 5 "before" ten
## 4 yes 45 "sorry" fine
df %>% mutate_each(funs(str_replace_all(., "\"", "")))
## V1 V2 V3 V4
## 1 home 15 grand terminal
## 2 give 32 cuz good
## 3 miles 5 before ten
## 4 yes 45 sorry fine
You can identify the double quotes using nchar().
a <- ""
nchar(a)==0
[1] TRUE
In addition to the above I ran into a very strange problem. Using the tips I wrote this very short program:
setClass("char.with.deleted.quotes")
setAs("character", "char.with.deleted.quotes",
function(from) as.character(gsub('„',"xxx", as.character(from), fixed = TRUE)))
TMP = read.csv2("./test.csv", header=TRUE, sep=";", dec=",",
colClasses = c("character","char.with.deleted.quotes"))
temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
print(temp)
with the Output:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
which reads the dummy csv:
Number;Name
X-23;This is some „Test
K-33.01;And another „Test
My goal is to get rid of this double quote before the word Test. However this so far does not work. And this is because of this double quote.
If instead I choose to replace a different part of the character it does work with either read.csv2 and the above definition of a class or directly with gsub saving it into the temp variable.
Now what is really strange is the following. After running the program I copied the two lines "temp <- gsub" and "print(temp)" manually into the command line:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
>
> temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(temp)
[1] "This is some xxxTest" "And another xxxTest"
This for whatever reason works and it does also work if I modify the data frame directly:
> TMP$Name <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(TMP)
Number Name
1 X-23 This is some xxxTest
2 K-33.01 And another xxxTest
But if I repeat this command in the program and run it again, it does not work. And I really have no idea why.

How to make format() yield numeric objects

Whenever I format a table as such:
example <- sample(LETTERS, replace = T)
format(table(example), scientific = T)
The numbers become characters. How can I tell format() my object is numeric without resorting to as.numeric()? I can't find any such parameters in the function's help page. It says that format() objects are usually numeric, so I guess I'm missing some basic command.
My real data looks like this:
> xtabs(...)
PRU/DF PSU/ILH PSU/JFA PSU/MCL PSU/SRM PSU/ULA
1.040771e+01 0.000000e+00 2.280347e-01 0.000000e+00 0.000000e+00 8.186240e+00
PSU/URA PSU/VGA PU/AC PU/AM PU/AP PU/BA
0.000000e+00 1.534169e+01 8.184747e+01 1.410106e+01 1.028717e+01 1.099289e+00
PU/GO PU/MA PU/MG PU/MT PU/PA PU/PI
0.000000e+00 4.369910e+01 5.350849e+00 0.000000e+00 4.706721e-01 0.000000e+00
I want to have the console print the numbers prettier so my co-workers don't have a heart attack. This is what I've come up with:
> format(xtabs(...), scientific = F, digits = 1)
PRU/DF PSU/ILH PSU/JFA PSU/MCL PSU/SRM PSU/ULA PSU/URA PSU/VGA
"10.4077" " 0.0000" " 0.2280" " 0.0000" " 0.0000" " 8.1862" " 0.0000" "15.3417"
PU/AC PU/AM PU/AP PU/BA PU/GO PU/MA PU/MG PU/MT
"81.8475" "14.1011" "10.2872" " 1.0993" " 0.0000" "43.6991" " 5.3508" " 0.0000"
PU/PA PU/PI PU/RO PU/RR PU/TO PRU/RJ PSU/CPS PSU/NRI
" 0.4707" " 0.0000" "40.6327" "10.3247" " 0.0000" "10.9644" " 0.0000" "55.4122"
I'd like to get rid of those quotes so the data looks better on the console.
The format function returns character vectors with the numbers now "formatted" per your specifications. If you convert the result back to numbers then any formatting is now lost. I think your problem may be rather the difference between creating the formatted output and printing the formatted output. When you create an object, but don't do anything with it, the object is automatically printed using default arguments, one of the defaults is to put quotes around the character strings. If you don't want the quotes then just call print yourself and tell it to not include the quotes:
> example <- sample(LETTERS, replace = T)
> print(format(table(example), scientific = T), quote=FALSE)
example
B E F G H J K L Q S T U W X Z
1 1 1 2 1 2 3 1 1 1 3 1 1 5 2
If your main goal is to not use scientific notation then you should look at the zapsmall function which will turn extremely small values (often the culprit in switching to scientific notation) into zeros. Or do options(scipen=5) (or some other value than 5) which will reduce the likelihood of switching to scientific notation in subsequent printing.
Format returns character vectors, so this is in general to be expected. Your problem, I think, results from the fact that the "format" representation of an integer vector c( 1, 2, 3) is "1" "2" "3", whereas the representation of a numeric (real) vector c( 1, 2, 3 ) is "1e+00" "2e+00" "3e+00".
Consider the following:
format( 1:10 )
format( 1:10, scientific= TRUE )
format( as.numeric( 1:10 ), scientific= TRUE )
Therefore, try
format( as.numeric( table( example ) ), scientific= TRUE )
Unfortunately, you cannot omit the as.numeric, since table generates integer values, and you need real.
From your comments on #January's answer, it appears you're just looking for c(table(example)).
Here's a solution I've found:
View(format(xtabs(...), scientific = F, digits = 1))
The table will appear in another window instead of inside the console, like I originally wanted, but it solves my problem of quickly showing pretty data without resorting to long commands. More elegant solutions are welcome!

Resources