I have a long script that sorts things and finds unique items. Everything was functioning fine until I added more data to my database. Then things started breaking. I've gone back and cleaned up my database to fix some errors, but my code is now failing elsewhere. And I cant figure out why.
Error:
Error in `$<-.data.frame`(`*tmp*`, "type", value = character(0)) :
replacement has 0 rows, data has 144
happens at:
OPdf$"type" <- IPdf$"type"[OPdf$"name"==IPdf$"name"]
IPdf: (two columns of characters)
# type name
1 ball Test-7
2 square bob-allen
3 cat HHH_67
4 groot 765-6
OPdf: (one column of factors)
# name
1 bob-allen
2 765-6
3 HHH_67
4 Test-7
I have the same number of rows in each dataframe. I can call my original test set of data into the script and everything works fine. I have verified that there aren't any weird characters in my names column that would throw something off.
I'm at a loss.
I used
milsa <- edit(data.frame())
To open the R Data Editor and now I can type the data of my table.
My problem is: my table has 36 rows, but for some reason I have 39 rows appearing in the program (the 3 additional rows are all filled with NA).
When I try to use:
length(civil)
I'm getting 39 instead of 36. How can I solve this? I am trying to use fix(milsa) but it can't delete the additional rows.
PS: Civil is a variable of milsa.
Subset with the index:
You can reassign the data.frame to itself with only the rows you want to keep.
milsa <- milsa[1:36,]
Here is a LINK to a quick tutorial for your reference
To delete specific rows
milsa <- milsa[-c(row_num1, row_num2, row_num3), ]
To delete rows containing one or more NA's
milsa <- na.omit(milsa)
Setup dataframe
mta<-c("ldall","nold","ldall","nold","ldall","nold","ldall","nold")
mtb<-c(491, 28581,241,5882,365,7398,512,10887)
df1<-data.frame(mta,mtb)
I can order my dataframe in the normal way. This works fine.
df1[order(mtb),]
But if I change the names of the columns
names(df1)<-c("mta1","mtb1")
df1[order(mtb1),]
This gives the error
Error in order(mtb1) : object 'mtb1' not found.
If I use the old column name in the instruction it works, although the output shows the new column name.
df1[order(mtb),]
If I change the name back to the original, the command appears to work normally. Can anyone explain? Is order using a hidden version of the column name?
This should work. Let me know if this helps.
mta<-c("ldall","nold","ldall","nold","ldall","nold","ldall","nold")
mtb<-c(491, 28581,241,5882,365,7398,512,10887)
df1<-data.frame(mta,mtb)
# Change column names
colnames(df1) <- c("mta1","mtb1")
# Sort column mtb1 from the data frame
df1[order(df1$mtb1), ]
mta1 mtb1
3 ldall 241
5 ldall 365
1 ldall 491
7 ldall 512
4 nold 5882
6 nold 7398
8 nold 10887
2 nold 28581
I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.
I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.