I'm trying to read a csv file that looks like this (Let's call this test1.csv)
test_1;test_2;test_3;test_4
Test with Ö Ä;20;10,45;15,34
As you can see, the values are separated by ; and not , - in fact , is the decimal separator. I've added "Ö" and "Ä" because my data has German letters in it - requiring me to use ISO-8859-1 in the locale() in read_delim(). Note, this isn't as important, it just explains why I want to use read_delim().
Now I would read all this using read_delim():
read_delim("test1.csv", delim = ";", locale = locale(encoding = 'ISO-8859-1',
decimal_mark = ","))
Giving me this:
# A tibble: 1 x 4
test_1 test_2 test_3 test_4
<chr> <dbl> <dbl> <dbl>
1 "Test with Ö Ä" 20 10.4 15.3
And indeed, I can get the 10.45 value out by using pull(test_3):
[1] 10.45
But now if I simply add five 0s to the 10.45 making it 1000000.45 like so (let's call this test2.csv)
test_1;test_2;test_3;test_4
Test with Ö Ä;20;1000000,45;15,34
And then repeat everything, I completely lose the .45 behind the 1000000.
read_delim("test2.csv", delim = ";",locale = locale(encoding = 'ISO-8859-1',decimal_mark = ",")) %>% pull(test_3)
Rows: 1 Columns: 4
0s── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
chr (1): test_1
dbl (3): test_2, test_3, test_4
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] 1000000
I must be able to retain this information, no? Or control this behaviour? Is this a bug?
This is a printing issue.
If you add %>% print(digits = 22) to the end of your workflow you get:
[1] 1000000.449999999953434
this is not 1000000.45 because what's shown is the closest approximation available in the standard floating-point system;
the default getOption("digits") value is 7; you can set this however you like with options(digits = <your_choice>). In this case anything between digits = 10 and digits = 17 will get you a printed result of "1000000.45"; digits = 18 starts to reveal the underlying approximation.
Related
I've got the following .txt structure
test <- "A n/a:
4001
Exam date:
2020-01-01 15:38
Pos (deg):
18.19
18.37"
I'd like to read this into a list, where each list element is given the name of the row ending with a colon, and the values are given by the following rows. (see: expected output).
Challenges
The number of rows (the length of each list element) can differ. There can be special characters (e.g., "A n/a") and there is the date time value which contains a pesky colon.
My problem
My current solution (see below) is unsafe, because I cannot be sure that I have a full list of all expected elements - the file might contain unexpected list elements which I would then not capture, or worse, they would mess up the entire data.
What I tried
I tried reading the txt to json with jsonlite::fromJson, because the structure somehow resembled it, but this gave an error about an unexpected character.
I tried to read into a single string and split, but this leaves me, again, with all values in a single list element:
readr::read_file(test)
strsplit(test, split = ":\n")
My current approach is to read this in with read.csv2 and generate a lookup on the (expected) row names, create a vector for splitting and using the first element of the resulting list for naming.
myfile <- read.csv2(text = test,
header = FALSE)
lu <- paste(c("A n", "date", "Pos"), collapse = "|")
ls_file <- split(myfile$V1, cumsum(grepl(lu, myfile$V1, ignore.case = TRUE)))
names(ls_file) <- unlist(lapply(ls_file, function(x) x[1]))
ls_file <- lapply(ls_file, function(x) x <- x[2:length(x)])
## expected output is a named list
## The spaces and backticks below do not really bother me,
## but I would get rid of them in a next step.
ls_file
#> $`A n/a:`
#> [1] " 4001"
#>
#> $`Exam date:`
#> [1] " 2020-01-01 15:38"
#>
#> $`Pos (deg):`
#> [1] "18.19" "18.37"
Assuming the name of each element ends with :, then we can:
res <- readLines(textConnection(test))
res <- split(res, cumsum(endsWith(res, ':')))
res <- setNames(lapply(res, `[`, -1), sapply(res, `[`, 1))
# > res
# $`A n/a:`
# [1] " 4001"
#
# $`Exam date:`
# [1] " 2020-01-01 15:38"
#
# $`Pos (deg):`
# [1] "18.19" "18.37"
I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names
You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine
parse_number from readr fails if the character string contains a .
It works well with special characters.
library(readr)
#works
parse_number("%ç*%&23")
#does not work
parse_number("art. 23")
Warning: 1 parsing failure.
row col expected actual
1 -- a number .
[1] NA
attr(,"problems")
# A tibble: 1 x 4
row col expected actual
<int> <int> <chr> <chr>
1 1 NA a number .
Why is this happening?
Update:
The excpected result would be 23
There is a space in after the dot which is causing an error. What is the expected number from this sequence (0.23 or 23)?
parse_number seems to look for decimal and grouping separators as defined by your locale, see the documentation here https://www.rdocumentation.org/packages/readr/versions/1.3.1/topics/parse_number
You can opt to change the locale using the following (grouping_mark is a dot with a space):
parse_number("art. 23", locale=locale(grouping_mark=". ", decimal_mark=","))
Output: 23
or remove the space in front:
parse_number(gsub(" ", "" , "art. 23"))
Output: 0.23
Edit: To handle dots as abbreviations and numbers use the following:
library(stringr)
> as.numeric(str_extract("art. 23", "\\d+\\.*\\d*"))
[1] 23
> as.numeric(str_extract("%ç*%&23", "\\d+\\.*\\d*"))
[1] 23
The above uses regular expressions to identify number patterns within strings.
\\d+ finds a digits
\\.* finds a dot
\\d* finds the remaining digits
Note: I am no expert on regex but there are plenty of other resources that will make you one
This is my sample dataset:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.
stringi package has the convenience function stri_enc_isascii:
library(stringi)
stri_enc_isascii(data$Name)
# [1] TRUE FALSE FALSE
As the name suggests,
the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).
An alternative to regex would be to use iconv and than filter for non NA entries:
library(dplyr)
data <- data %>%
mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
filter(!is.na(Name))
What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.
I have a data frame that looks like this (sorry, I can't replicate the actual data frame with code as the double quotes don't show up. Vx are variables):
V1, V2, V3, V4
home, 15, "grand", terminal,
"give", 32, "cuz", good,
"miles", 5, "before", ten,
yes, 45, "sorry," fine
Question: how I might be able to fix the double quote issue for my entire data frame that I've imported using the read.csv function, where all the double quotes are removed?
What I'm looking for is the excel or word equivalent of FIND + REPLACE: Find the double quote, and replace with nothing.
Notes:
1) I've confirmed it's a data frame by running is.data.frame() function
2) The actual data frame has hundreds of columns, so going through each one and declaring the type of column it is isn't feasible
3) I tried using the following, and it didn't work: as.data.frame(sapply(my_data, function(x) gsub("\"", "", x)))
4) I confirmed that this isn't a simple print issue by testing using sql on the the data frame. It won't find columns in double quotes unless I use LIKE instead of =
Thanks in advance!
7/7/15 EDIT 01: as requested from #alexforrence, here is the d(put) output for a couple of columns:
billing_first_name billing_last_name billing_company
3 NA
4 Peldi Guilizzoni NA
5 NA
6 "James Andrew" Angus NA
7 NA
8 Nova Spivack NA
Here is a solution using dplyr and stringr. Note that purely numerical columns will be character columns afterwards. It's not clear to me from your description whether there are purely numerical columns. If there are then you'd probably want to treat them separately, or alternatively convert back into numbers afterwards.
require(dplyr)
require(stringr)
df <- data.frame(V1=c("home", "\"give\"", "\"miles\"", "yes"),
V2=c(15, 32, 5, 45),
V3=c("\"grand\"", "\"cuz\"", "\"before\"", "\"sorry\""),
V4=c("terminal", "good", "ten", "fine"))
df
## V1 V2 V3 V4
## 1 home 15 "grand" terminal
## 2 "give" 32 "cuz" good
## 3 "miles" 5 "before" ten
## 4 yes 45 "sorry" fine
df %>% mutate_each(funs(str_replace_all(., "\"", "")))
## V1 V2 V3 V4
## 1 home 15 grand terminal
## 2 give 32 cuz good
## 3 miles 5 before ten
## 4 yes 45 sorry fine
You can identify the double quotes using nchar().
a <- ""
nchar(a)==0
[1] TRUE
In addition to the above I ran into a very strange problem. Using the tips I wrote this very short program:
setClass("char.with.deleted.quotes")
setAs("character", "char.with.deleted.quotes",
function(from) as.character(gsub('„',"xxx", as.character(from), fixed = TRUE)))
TMP = read.csv2("./test.csv", header=TRUE, sep=";", dec=",",
colClasses = c("character","char.with.deleted.quotes"))
temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
print(temp)
with the Output:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
which reads the dummy csv:
Number;Name
X-23;This is some „Test
K-33.01;And another „Test
My goal is to get rid of this double quote before the word Test. However this so far does not work. And this is because of this double quote.
If instead I choose to replace a different part of the character it does work with either read.csv2 and the above definition of a class or directly with gsub saving it into the temp variable.
Now what is really strange is the following. After running the program I copied the two lines "temp <- gsub" and "print(temp)" manually into the command line:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
>
> temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(temp)
[1] "This is some xxxTest" "And another xxxTest"
This for whatever reason works and it does also work if I modify the data frame directly:
> TMP$Name <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(TMP)
Number Name
1 X-23 This is some xxxTest
2 K-33.01 And another xxxTest
But if I repeat this command in the program and run it again, it does not work. And I really have no idea why.