R reading csv file with different seperators - r

I wanna read a csv file where the first line is written like this:
GeoFIPS,GeoName,Region,TableName,LineCode,IndustryClassification,Description,Unit,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
The following lines are written like this (just one line of 500):
"01000","Alabama",5,SAGDP1,2,"...","Chain-type quantity indexes for real GDP","Quantity index",77.435,80.198,83.178,84.532,84.232,86.440,88.581,94.298,97.490,99.348,99.971,99.317,95.894,98.103,99.525,100.000,101.212,100.544,101.541,102.664,103.827,106.164,107.652
How can I read this file so that all the values are seperated correctly?
With the command: gdp_data <- read.csv("GDP.csv", sep = ",") only the headers get seperated correctly the text of the further lines is all put in the first column.
Thank you very much for your answer.

Your problem appears to be that some of your columns are unnamed rateher than the "different separators". (I don't think there are any different separators.)
Assuming you know in advance how many columns there are, then something like this should work.
readr::read_csv(
"<your file name>",
# Provide custom file names
col_names=c("GeoFIPS","GeoName","Region","TableName", paste0("X", 5:7))
) %>%
# Remove first row of "column names"
filter(row_number() > 1)
# A tibble: 1 x 7
GeoFIPS GeoName Region TableName X5 X6 X7
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 00000 United States dummy SAGDP1 1 ... Real GDP (millions of chained 2012 dollars)
You could, of course, provide more meaningful names for the unnamed columns.

Related

R tibble with comma separated fields - read/write_csv() incorrectly parses data as double

I hope the title makes sense. I will explain a bit here.
I am working with data that comes from a network performance monitoring tool running synthetic transactions (mimicking user activity by making timed and measurable transactions allowing for performance analysis and problem detection). Several of the output fields are capturing different values like Header Read Times, TLS Times, etc for multiple transactions in a single test. These fields have the data separated by comma. When the data is first retrieved from the API and converted from JSON to a tibble, theses fields are correctly parsed as:
metrics.HeaderReadTimes
"120,186,191,184,186,182,190,186,192"
"232,310,282,289,354,292,292,293,306"
...
I have verified also that these fields are typed as character when they are imported from the API and stored in the tibble. I even checked this during debug just before write_csv() gets called.
However, when I write this data to CSV for storage and then read it back in later, the output of read_csv() has these fields as if they were re-typed as double:
metrics.HeaderReadTimes
"1.34202209222205e+26"
"4.17947405424481e+26"
...
I used mutate() to type these fields as.character() on read, but that doesn't seem to fix the issue, it just gives me a double that has beeen coerced into a character.
I'm beginning to think that the best solution is to change the delimiter in those fields before I call write_csv(), but I'm unsure how to do this in an efficient manner. It's probably something stupidly obvious, and I'm going to keep researching, but I figured it wouldn't hurt to ask...
csv-files does not store any information about the column type, why you'd want to specify the column type in readr (or alternatively save the data as .Rdata or .RDS).
read_csv("filename.csv",
col_types = cols(metrics.HeaderReadTimes = col_character()))
An alternative that is a little more agnostic to column names.
The issue is that the locale is inferring the commas as "grouping marks", i.e., thousands indicators. We can change that with readr::locale.
Failing:
readr::read_csv(I('a,b\n"1,2",3'))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <dbl> <dbl>
# 1 12 3
Working as intended:
readr::read_csv(I('a,b\n"1,2",3'), locale = readr::locale(grouping_mark = ""))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# chr (1): a
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <chr> <dbl>
# 1 1,2 3

How do I create a data table in code in R

I have a data table as a CSV file that I use to create metrics for a dashboard. The data table includes Metric IDs and associates these with field names. This table--this definition of metrics--is largely static, and I'd like to include it within R code rather than, for example, importing a CSV file containing these headings.
The table looks something like this:
Metric_ID
Metric_Name
Numerator
Denominator
AB0001
Number_of_Customers
No_of_Customers
AB0002
Percent_New_Customers
No_of_New_Customers
No_of_Customers
This has about 40 rows of data, and I'd like to set this table up in code so that it is created at the time the R query is run. I'll then use it to associate metric IDs with measures I retrive through SQL queries. Sometimes this table may change -- for example, new metrics might be added or existing metrics modified. This would need some modificatoin in the code to incorporate these metrics.
The closet way I could find was to create a data table, along the lines described in the query below.
dt<-data.table(x=c(1,2,3),y=c(2,3,4),z=c(3,4,5))
dt
x y z
1: 1 2 3
2: 2 3 4
3: 3 4 5
cbind with data table and data frame
This works for a table with a few rows or columns, but will be unwieldy for tables with 40+ rows. For example, if I wanted to modify a metric 20 rows down, I'd have to go 20 rows down in each column, and then test the table to ensure I switched the metric at the right place in each column -- especially where some metrics have empty cells. for example, I may correct the metric ID in row 20, but accidentally put the definition (a separate column) in row 19.
Is there a more straightforward way of, in essence, creating a table in code?
(I appreciate the most straightforward way would be to keep a CSV file accessible and use read_csv to import it into R. However, this doesn't work so well if colleagues are running this query on their machine and have a different file path to the CSV -- it also raises the risk of them running the query with an out-of-date metrics table, as they may not have the latest version in their files).
Thanks in advance for any guidance you might have!
Tony
Here are two options (examples taken from respective help pages):
data.table::fread()
fread("A,B
1,2
3,4
")
#> A B
#> <int> <int>
#> 1: 1 2
#> 2: 3 4
https://rdatatable.gitlab.io/data.table/reference/fread.html
tibble::tribble()
tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)
#> # A tibble: 3 × 2
#> colA colB
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
https://tibble.tidyverse.org/reference/tribble.html
Other options:
If you already have the data.frame from somewhere, you can also use dput() to get a structure() code you can paste into the files you are distributing.
use the reprex package https://reprex.tidyverse.org/

R read_xlsx Adds Trailing Digit to Character

I am reading an Excel file into R using the read_xlsx function from the readxl package. Some of the columns could be "numerics" in Excel, but I convert everything to a character as I read things in. This solves a lot of downstream problems for me because really none of the data from Excel is actually numeric in practice. Things that look like numerics are really identification numbers of some sort.
Here is my issue. I am trying to read in the following data:
You can see that the first column is a numeric in Excel. When I read this in, I get:
library(readxl)
xl <- read_xlsx("C:/test/test.xlsx", col_types = c("text"))
xl
#> # A tibble: 1 x 3
#> some_id_number some_name some_other_name
#> <chr> <chr> <chr>
#> 1 310.16000000000003 name name_Descriptions
Where is that trailing 3 coming from? I have tried to adjust the digits option per this question without any luck.
Any thoughts?

Separating txt (conversation) into columns with speaker names as variables

I'm new to text mining in R. I have multiple txt files of conversations between the same speakers organized as follows:
speaker one [speakers' names are on their own line]
what speaker one says [paragraph of each speaker's speech after
line break from name]
[empty line]
speaker two
what speaker two says
[empty line]
speaker one
what speaker one replies
[empty line]
speaker three
what speaker three says
...
I want to break up the texts into one row per text with columns as the names of speakers. I want to have everything that speaker one says in each text combined in one cell on each row and the same for other speakers. Something like this:
text "speaker one" "speaker two" ...
text1 everything speaker one said everything speaker two said
text2 everything speaker one said everything speaker two said
...
Any help on how to get started would be appreciated.
Using some tidyverse packages you can get there. First read the text with readr::read_file, next split on the empty line, use readr::read_delim to read this into data.frames. As the data is now in a list, using bind_rows will collapse all of this into one data.frame. bind_rows matches on the column names so that all the text of a speaker is in the correct column. Depending on which outcome you want either the first or the second solution.
I leave combining multiple text files up to you.
library(readr)
library(tidyr)
library(dplyr)
# read file into a character vector
text <- readr::read_file("conversation.txt")
# split the text on the empty line
split_text <- strsplit(text, split = "\r\n\r\n")
# read the data in again with read_delim. This will generate a list of data.frames
list_text <- lapply(unlist(split_text), function(x) readr::read_delim(x, col_names = TRUE, delim = "\t"))
# use bind_rows from dplyr to combine everything into 1 tibble. bind_rows matches on the column names.
list_text %>%
bind_rows
# A tibble: 5 x 3
`speaker one` `speaker two` `speaker three`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. NA NA
2 It might be in multiple lines, but not seperated by an empty line. NA NA
3 NA what speaker two says NA
4 what speaker one replies NA NA
5 NA
Collapsing all the text in one line:
This needs a bit more work with first gathering the data in a tidy long format, collapsing the text and then spreading it wide again. Run the statements in chunks if you want to see what is happening in each step.
list_text %>%
bind_rows %>%
pivot_longer(everything(),
names_to = "speakers",
values_to = "text",
values_drop_na = TRUE) %>%
group_by(speakers) %>%
summarise(text = paste0(text, collapse = " ")) %>%
pivot_wider(names_from = speakers, values_from = text)
# A tibble: 1 x 3
`speaker one` `speaker three` `speaker two`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. It might be in multiple lines, but not seperated b~ what speaker three s~ what speaker two ~
text used in text file conversation.txt
speaker one
what speaker one says is in this paragraph.
It might be in multiple lines, but not seperated by an empty line.
speaker two
what speaker two says
speaker one
what speaker one replies
speaker three
what speaker three says.

R: Two Identically Structured Excel Files Return Different Data Types in Data Frames

I have two different Excel files, excel1 and excel2.
I am reading them in using separate but identical functions:
df1<- readxl::read_xlsx("excel1.xlsx", sheet= "Ad Awareness", skip= 7)
df2<- readxl::read_xlsx("excel2.xlsx", sheet= "Ad Awareness", skip= 7)
However, when I run head() on each, here is what df` returns:
calDate Score
<dttm> <dbl>
1 2016-10-17 00:00:00 17.8
2 2016-10-18 00:00:00 17.2
3 2016-10-19 00:00:00 20.3
And here is what df2 returns:
calDate Score
<dbl> <lgl>
1 43025 NA
2 43026 NA
3 43027 NA
Any reason why the data type are being read-in different? There is nothing different about the files.
read_xlsx() will guess the variable types based on your data (see here for more information).
So what you are describing could be due to:
different amount of data in your different files (not enough data in one of them to get to a correct guess)
changes you might have made in Excel to the cell format (those changes are not always visually obvious in Excel)
Without seeing your data, it is hard to give you more answer than this.
But you can control this with the col_types argument:
col_types: Either ‘NULL’ to guess all from the spreadsheet or a
character vector containing one entry per column from these
options: "skip", "guess", "logical", "numeric", "date",
"text" or "list". If exactly one ‘col_type’ is specified, it
will be recycled. The content of a cell in a skipped column
is never read and that column will not appear in the data
frame output. A list cell loads a column as a list of length
1 vectors, which are typed using the type guessing logic from
‘col_types = NULL’, but on a cell-by-cell basis.

Resources