R matching error after changing input data set

R matching error after changing input data set - r

I have a long script that sorts things and finds unique items. Everything was functioning fine until I added more data to my database. Then things started breaking. I've gone back and cleaned up my database to fix some errors, but my code is now failing elsewhere. And I cant figure out why.
Error:
Error in `$<-.data.frame`(`*tmp*`, "type", value = character(0)) :
replacement has 0 rows, data has 144
happens at:
OPdf$"type" <- IPdf$"type"[OPdf$"name"==IPdf$"name"]
IPdf: (two columns of characters)
# type name
1 ball Test-7
2 square bob-allen
3 cat HHH_67
4 groot 765-6
OPdf: (one column of factors)
# name
1 bob-allen
2 765-6
3 HHH_67
4 Test-7
I have the same number of rows in each dataframe. I can call my original test set of data into the script and everything works fine. I have verified that there aren't any weird characters in my names column that would throw something off.
I'm at a loss.

Related

Recommender Split Returning Empty Dataset

I'm using a "Split Data" module set to recommender split to split data for training and testing a matchbox recommender. The input data is a valid user-item-rating tuple (for example, 575978 - 157381 - 3) and I've left the parameters for the recommender split as default (0s for everything), besides changing it to a .75 and .25 split. However, when this module finishes, it returns the complete, unsplit dataset for dataset1 and a completely empty (but labelled) dataset for dataset2. This also happens when doing a stratified split using the "Split Rows" mode. Any idea what's going on?
Thanks.
Edit: Including a sample of my data.
UserID ItemID Rating
835793 165937 3
154738 11214 3
938459 748288 3
819375 789768 6
738571 98987 3
847509 153777 3
991757 124458 3
968685 288070 2
236349 8337 3
127299 545885 3

Figured it out. In my "Remove Duplicate Rows" module up the chain a bit I was only removing duplicates by UserID instead of UserID and ItemID. This still left quite a bit of rows but I'm assuming it messed with the stratification.

RMYSQL Writetable error

I have the following R dataframe
Sl NO Name Marks
1 A 15
2 B 20
3 C 25
I have a mysql table as follows. (Score.table)
No CandidateName Score
1 AA 1
2 BB 2
3 CC 3
I have written my dataframe to Score.table using this code
username='username'
password='userpass'
dbname='cdb'
hostname='***.***.***.***'
cdbconnection = dbConnect(MySQL(), user=username, password=userpass,
dbname=dbname, host=hostname)
Next we write the dataframe to the table as follows
score.table<-'score.table'
dbWriteTable(cdbconn, score.table, dataframe, append =F, overwrite=T).
The code runs and I get TRUE as the output.
However, when I check the SQL table, the new values haven't overwritten the existing values.
I request someone to help me. The code works. I have reinstalled the RMySQL package again and rerun and the results are the same.

That updates are not happening indicates that the RMySQL package cannot successfully map any of the rows from your data frame to already existing records in the table. So this would imply that your call to dbWriteTable has a problem. Two potential problems I see are that you did not assign values for field.types or row.names. Consider making the following call:
score.table <- 'score.table'
dbWriteTable(cdbconn, score.table, dataframe,
field.types=list(`Sl NO`="int", Name="varchar(55)", Marks="int"),
row.names=FALSE)
If you omit field.types, then the package will try to infer what the types are. I am not expert with this package, so I don't know how robust this inference is, but most likely you would want to specify explicit types for complex update queries.
The bigger problem might actually be not specifying a value for row.names. It can default to TRUE, in which case the package will actually send an extra column during the update. This can cause problems, for example if your target table has three columns, and the data frame also has three columns, then you are trying to update with four columns.

Using order in R dataframes fails after column names have been changed. how can I recover this?

Setup dataframe
mta<-c("ldall","nold","ldall","nold","ldall","nold","ldall","nold")
mtb<-c(491, 28581,241,5882,365,7398,512,10887)
df1<-data.frame(mta,mtb)
I can order my dataframe in the normal way. This works fine.
df1[order(mtb),]
But if I change the names of the columns
names(df1)<-c("mta1","mtb1")
df1[order(mtb1),]
This gives the error
Error in order(mtb1) : object 'mtb1' not found.
If I use the old column name in the instruction it works, although the output shows the new column name.
df1[order(mtb),]
If I change the name back to the original, the command appears to work normally. Can anyone explain? Is order using a hidden version of the column name?

This should work. Let me know if this helps.
mta<-c("ldall","nold","ldall","nold","ldall","nold","ldall","nold")
mtb<-c(491, 28581,241,5882,365,7398,512,10887)
df1<-data.frame(mta,mtb)
# Change column names
colnames(df1) <- c("mta1","mtb1")
# Sort column mtb1 from the data frame
df1[order(df1$mtb1), ]
mta1 mtb1
3 ldall 241
5 ldall 365
1 ldall 491
7 ldall 512
4 nold 5882
6 nold 7398
8 nold 10887
2 nold 28581

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Technique for finding bad data in read.csv in R

I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
Why does the quote="" fix the 'bad data'
What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).
Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
#Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!

The count.fields function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.

One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith, it reads fine.
I saved this data to data.txt:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread and read.csv works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex