R readr package - written and read in file doesn't match source - r

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Related

Yet another "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')". I have checked, but data seems to be ok

I'm trying to prepare a dataset to use it as training data for a deep neural network. It consists of 13 .txt files, each between 500MB and 2 GB large. However, when trying to run a "data_prepare.py" file, I get the Value error of this post's title.
Reading answers from previous posts, I have loaded my data into R and checked both for NaN and infinite numbers, but the commands used tell me there appears to be nothing wrong with my data. I have done the following:
I load my data as one single dataframe using magrittr, data.table and purrr packages(there are about 300 Million rows, all with 7 variables):
txt_fread <-
list.files(pattern="*.txt") %>%
map_df(~fread(.))
I have used sapply to check for finite and NaN values:
>any(sapply(txt_fread, is.finite))
[1] TRUE
> any(sapply(txt_fread, is.nan))
[1] FALSE
I have also tried loading each data frame into a jupyter notebook and check individually for those values using the following commands:
file1= pd.read_csv("File_name_xyz_intensity_rgb.txt", sep=" ", header=None)
np.any(np.isnan(file1))
False
np.all(np.isfinite(file1))
True
And when I use print(file1.info()), this is what I get as info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22525176 entries, 0 to 22525175
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 0 float64
1 1 float64
2 2 float64
3 3 int64
4 4 int64
5 5 int64
6 6 int64
dtypes: float64(3), int64(4)
memory usage: 1.2 GB
None
I know the file containing the code (data_prepare.py) works because it runs properly with a similar dataset. I therefore know it must be a problem with the new data I mention here, but I don't know what I have missed or done wrong while checking for NaNs and infinites. I have also tried reading and checking the .txt files individually, but it also hasn't helped much.
Any help is really appreciated!!
Btw: the R code with map_df came from a post by leerssej in How to import multiple .csv files at once?

Reading tab-delimited text file into R only expecting one column out of five

Here's my problem. I'm trying to read a tab-delimited text file into R, and I keep getting an error messages and it only loads one column out of the five in the dataset.
Our professor is requiring us to use the read_csv() command for this, but I've tried using read_tsv() as well, and neither has worked. I've looked into it everywhere, I just can't find anything about what could possibly be going wrong.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
I can't seem to link the text file here, but it's a simple tab-delimited text file with 5 columns, column headers, 22 rows (not counting the headers). * is used for N/A results.
I have no clue how to do this "properly" according to my professor by using read_csv.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
Parsed with column specification:
cols(
`PT1 PT2 PT3 PT4 PT5` = col_double()
)
Warning: 22 parsing failures.
row col expected actual file
1 -- 1 columns 5 columns 'wasterunup.txt'
2 -- 1 columns 5 columns 'wasterunup.txt'
3 -- 1 columns 5 columns 'wasterunup.txt'
4 -- 1 columns 5 columns 'wasterunup.txt'
5 -- 1 columns 5 columns 'wasterunup.txt'
... ... ......... ......... ................
See problems(...) for more details.
To clarify my errors:
When I use read_csv(), all of the data is there, but all five datapoints are crammed into one cell of each row.
When I use read_tsv(), only one column of the data is there.

RMYSQL Writetable error

I have the following R dataframe
Sl NO Name Marks
1 A 15
2 B 20
3 C 25
I have a mysql table as follows. (Score.table)
No CandidateName Score
1 AA 1
2 BB 2
3 CC 3
I have written my dataframe to Score.table using this code
username='username'
password='userpass'
dbname='cdb'
hostname='***.***.***.***'
cdbconnection = dbConnect(MySQL(), user=username, password=userpass,
dbname=dbname, host=hostname)
Next we write the dataframe to the table as follows
score.table<-'score.table'
dbWriteTable(cdbconn, score.table, dataframe, append =F, overwrite=T).
The code runs and I get TRUE as the output.
However, when I check the SQL table, the new values haven't overwritten the existing values.
I request someone to help me. The code works. I have reinstalled the RMySQL package again and rerun and the results are the same.
That updates are not happening indicates that the RMySQL package cannot successfully map any of the rows from your data frame to already existing records in the table. So this would imply that your call to dbWriteTable has a problem. Two potential problems I see are that you did not assign values for field.types or row.names. Consider making the following call:
score.table <- 'score.table'
dbWriteTable(cdbconn, score.table, dataframe,
field.types=list(`Sl NO`="int", Name="varchar(55)", Marks="int"),
row.names=FALSE)
If you omit field.types, then the package will try to infer what the types are. I am not expert with this package, so I don't know how robust this inference is, but most likely you would want to specify explicit types for complex update queries.
The bigger problem might actually be not specifying a value for row.names. It can default to TRUE, in which case the package will actually send an extra column during the update. This can cause problems, for example if your target table has three columns, and the data frame also has three columns, then you are trying to update with four columns.

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

Resources