R read_xlsx Adds Trailing Digit to Character

R read_xlsx Adds Trailing Digit to Character - r

I am reading an Excel file into R using the read_xlsx function from the readxl package. Some of the columns could be "numerics" in Excel, but I convert everything to a character as I read things in. This solves a lot of downstream problems for me because really none of the data from Excel is actually numeric in practice. Things that look like numerics are really identification numbers of some sort.
Here is my issue. I am trying to read in the following data:
You can see that the first column is a numeric in Excel. When I read this in, I get:
library(readxl)
xl <- read_xlsx("C:/test/test.xlsx", col_types = c("text"))
xl
#> # A tibble: 1 x 3
#> some_id_number some_name some_other_name
#> <chr> <chr> <chr>
#> 1 310.16000000000003 name name_Descriptions
Where is that trailing 3 coming from? I have tried to adjust the digits option per this question without any luck.
Any thoughts?

Related

R tibble with comma separated fields - read/write_csv() incorrectly parses data as double

I hope the title makes sense. I will explain a bit here.
I am working with data that comes from a network performance monitoring tool running synthetic transactions (mimicking user activity by making timed and measurable transactions allowing for performance analysis and problem detection). Several of the output fields are capturing different values like Header Read Times, TLS Times, etc for multiple transactions in a single test. These fields have the data separated by comma. When the data is first retrieved from the API and converted from JSON to a tibble, theses fields are correctly parsed as:
metrics.HeaderReadTimes
"120,186,191,184,186,182,190,186,192"
"232,310,282,289,354,292,292,293,306"
...
I have verified also that these fields are typed as character when they are imported from the API and stored in the tibble. I even checked this during debug just before write_csv() gets called.
However, when I write this data to CSV for storage and then read it back in later, the output of read_csv() has these fields as if they were re-typed as double:
metrics.HeaderReadTimes
"1.34202209222205e+26"
"4.17947405424481e+26"
...
I used mutate() to type these fields as.character() on read, but that doesn't seem to fix the issue, it just gives me a double that has beeen coerced into a character.
I'm beginning to think that the best solution is to change the delimiter in those fields before I call write_csv(), but I'm unsure how to do this in an efficient manner. It's probably something stupidly obvious, and I'm going to keep researching, but I figured it wouldn't hurt to ask...

csv-files does not store any information about the column type, why you'd want to specify the column type in readr (or alternatively save the data as .Rdata or .RDS).
read_csv("filename.csv",
col_types = cols(metrics.HeaderReadTimes = col_character()))

An alternative that is a little more agnostic to column names.
The issue is that the locale is inferring the commas as "grouping marks", i.e., thousands indicators. We can change that with readr::locale.
Failing:
readr::read_csv(I('a,b\n"1,2",3'))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <dbl> <dbl>
# 1 12 3
Working as intended:
readr::read_csv(I('a,b\n"1,2",3'), locale = readr::locale(grouping_mark = ""))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# chr (1): a
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <chr> <dbl>
# 1 1,2 3

Readxl and openxlsx add extra characters to numbers from an excel file

I have some numbers in an excel file that I want to read into R as characters. When I import the file either using readxl or openxlsx, the imported data have two extra characters, which are not in the excel file. The excel sheet looks like this:
The example file is here
I have tried changing the format within the Excel file but this messes up the numbers. My current work-around is to concatenate the number with ' in a separate column in excel and then read that column into R. This works for some reason.
library(readxl)
boo <- read_excel("./boo.xlsx",
col_types = c("text"))
boo
Reading the excel file gives the following (note the last two characters in the Example numbers column. The concatNum column shows the concatenated version.
# A tibble: 6 x 2
`Example numbers` concatNum
<chr> <chr>
1 985.12002779568002 '985.12002779568
2 985.12002826159505 '985.120028261595
3 985.12002780627301 '985.120027806273
4 985.12002780627301 '985.120027806273
5 985.12002780724401 '985.120027807244
6 985.12002780291402 '985.120027802914
Any reasons why this would be happening? Does anyone have a better way of fixing it than my current work-around?

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present

This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.

Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

R: Two Identically Structured Excel Files Return Different Data Types in Data Frames

I have two different Excel files, excel1 and excel2.
I am reading them in using separate but identical functions:
df1<- readxl::read_xlsx("excel1.xlsx", sheet= "Ad Awareness", skip= 7)
df2<- readxl::read_xlsx("excel2.xlsx", sheet= "Ad Awareness", skip= 7)
However, when I run head() on each, here is what df` returns:
calDate Score
<dttm> <dbl>
1 2016-10-17 00:00:00 17.8
2 2016-10-18 00:00:00 17.2
3 2016-10-19 00:00:00 20.3
And here is what df2 returns:
calDate Score
<dbl> <lgl>
1 43025 NA
2 43026 NA
3 43027 NA
Any reason why the data type are being read-in different? There is nothing different about the files.

read_xlsx() will guess the variable types based on your data (see here for more information).
So what you are describing could be due to:
different amount of data in your different files (not enough data in one of them to get to a correct guess)
changes you might have made in Excel to the cell format (those changes are not always visually obvious in Excel)
Without seeing your data, it is hard to give you more answer than this.
But you can control this with the col_types argument:
col_types: Either ‘NULL’ to guess all from the spreadsheet or a
character vector containing one entry per column from these
options: "skip", "guess", "logical", "numeric", "date",
"text" or "list". If exactly one ‘col_type’ is specified, it
will be recycled. The content of a cell in a skipped column
is never read and that column will not appear in the data
frame output. A list cell loads a column as a list of length
1 vectors, which are typed using the type guessing logic from
‘col_types = NULL’, but on a cell-by-cell basis.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.

words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.

Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex