I have a long script that sorts things and finds unique items. Everything was functioning fine until I added more data to my database. Then things started breaking. I've gone back and cleaned up my database to fix some errors, but my code is now failing elsewhere. And I cant figure out why.
Error:
Error in `$<-.data.frame`(`*tmp*`, "type", value = character(0)) :
replacement has 0 rows, data has 144
happens at:
OPdf$"type" <- IPdf$"type"[OPdf$"name"==IPdf$"name"]
IPdf: (two columns of characters)
# type name
1 ball Test-7
2 square bob-allen
3 cat HHH_67
4 groot 765-6
OPdf: (one column of factors)
# name
1 bob-allen
2 765-6
3 HHH_67
4 Test-7
I have the same number of rows in each dataframe. I can call my original test set of data into the script and everything works fine. I have verified that there aren't any weird characters in my names column that would throw something off.
I'm at a loss.
I'm using a "Split Data" module set to recommender split to split data for training and testing a matchbox recommender. The input data is a valid user-item-rating tuple (for example, 575978 - 157381 - 3) and I've left the parameters for the recommender split as default (0s for everything), besides changing it to a .75 and .25 split. However, when this module finishes, it returns the complete, unsplit dataset for dataset1 and a completely empty (but labelled) dataset for dataset2. This also happens when doing a stratified split using the "Split Rows" mode. Any idea what's going on?
Thanks.
Edit: Including a sample of my data.
UserID ItemID Rating
835793 165937 3
154738 11214 3
938459 748288 3
819375 789768 6
738571 98987 3
847509 153777 3
991757 124458 3
968685 288070 2
236349 8337 3
127299 545885 3
Figured it out. In my "Remove Duplicate Rows" module up the chain a bit I was only removing duplicates by UserID instead of UserID and ItemID. This still left quite a bit of rows but I'm assuming it messed with the stratification.
I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.
I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
I'm new to octave and running into a formatting issue which I can't seem to fix. If I display a variable with multiple columns I get something along the lines of:
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
However what I would really like to have is:
0.75883
0.93290
0.40064
0.43818
0.94958
0.16467
I've read the format documentation here but haven't been able to make the change. I'm running Octave 3.6.4 on Windows however I've used Octave 3.2.x on Windows and seen it output to the desired output by default.
To be specific, in case it matters, I am using the fir1 command as part of the signal package and these are sample outputs that I might see.
It sounds like, as Dan suggested, you want to display the transpose of your vector, i.e. a row vector rather than a column vector:
>> A = rand(1,20)
A =
Columns 1 through 7:
0.681499 0.093300 0.490087 0.666367 0.212268 0.456260 0.532721
Columns 8 through 14:
0.850320 0.117698 0.567046 0.405096 0.333689 0.179495 0.942469
Columns 15 through 20:
0.431966 0.100049 0.650319 0.459100 0.613030 0.779297
>> A'
ans =
0.681499
0.093300
0.490087
0.666367
0.212268
0.456260
0.532721
0.850320
0.117698
0.567046
0.405096
0.333689
0.179495
0.942469
0.431966
0.100049
0.650319
0.459100
0.613030
0.779297