When I count the rows in my data table by store code using the follwoing:
DailyProduct[, .N, by = ANA_Code]
I get 14 store codes with total rows of 120,237 - the same number of rows in DailyPRoduct which is great!
If I then get a unique list of the store codes:
unique(DailyProduct$ANA_Code)
I get 21 store codes which is more than I got above but is actually the correct number.
If I convert the store code from numeric to a factor, everything shows as expected, I get 21 stores and the row counts of each add up to 120,237
This is also causing me a problem when I aggregate the data, the sales value is correct but 7 of the store codes are missing.
Is there a fundamental difference in how data table treats numeric vs. factor when performing these operations?
I don't understand why this is happening so I can't provide an example as such so apologies for that.
It could happen if the 'ANA_Code' is a big integer and while reading it with fread incorrect output can happen. One way would be to load the bit64 library and then read it with fread
library(bit64)
library(data.table)
DailyProduct <- fread("yourfile.csv")
Related
I have a fairly large data set in csv format that I'd like to read into R. The data is annoyingly structured (my own fault) as follows:
,US912828LJ77,,US912810ED64,,US912828D804,...
17/08/2009,101.328125,15/08/1989,99.6171875,02/09/2014,99.7265625,...
And with the second line style repeated for a few thousand times. The structure is that each pair of columns represents a timeseries of differing lengths (so that the data is not rectangular).
If I use something like
>rawdata <- read.csv("filename.csv")
I get a dataframe with all the blank entries padded with NA, and the odd columns forced to a factor datatype.
What I'd like to ultimately get to is either a set of timeseries objects (for each pair of columns) named after every even entry in the first row (the "US912828LJ77" fields) or a single dataframe with row labels as dates running from the minimum of (min of each odd column) to max of (max of each odd column).
I can't imagine I'm the only mook to put together a dataset in such an unhelpful structure but I can't see any suggestions out there for how to deal with this. Any help would be greatly appreciated!
First you need to parse every odd column to date
odd.cols = names(rawdata)[seq(1,dim(rawdata)[2]-1,2)]
for(dateCol in odd.cols){
rawdata[[dateCol]] = as.Date(rawdata[[dateCol]], "%d/%m/%Y")
}
Now I guess the problem is straightforward, you just need to find min, max values per column, create a vector running from min date to max date, join it with rawdata and handle missing values for you US* columns.
I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).
I want to aggregate and count how often in my dataset is a special kind of disease at one date. (I don't use duplicate because I want all rows, not only the duplicated ones)
My original data set looks like:
id dat kinds kind
AE00302 2011-11-20 valv 1
AE00302 2011-10-31 vask 2
(of course my data.frame is much larger)
I try this:
xagg<-aggregate(kind~id+dat+kinds,subx,length)
names(xagg)<-c("id","dat","kinds","kindn")
and get:
id dat kinds kindn
AE00302 2011-10-31 valv 1
AE00302 2011-11-20 vask 1
I wonder why R is going wrong by the 'date' resp. the 'kinds'-column.
Has anybody an idea?
I still don't know why.
But I found out, aggregate goes wrong, because of columns I don't use for aggregating.
Therefor these steps solve the problem for me:
# 1st step: reduce the data.frame to only the needed columns
# 2nd Step: aggregate the reduced data.frame
# 3rd Step: merge aggregated data to reduced dataset
# 4th step: remove duplicated rows from reduced dataset (if they occur)
# 5th step: merge reduced dataset without dublicated data to original dataset
Maybe the problem occurs, if there are duplicated datasets in the aggregated data.frame.
Thanks for all your help, questions and attempts to solve my problem!
elchvonoslo
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)
I have a dataset with 7 million records.
I need to filter the data to only show about 9000 of these.
The first field dmg is effectively the primary key and take the format 1-Apr-123456. There are about 12 occurrences of each dmg value.
Another column is O_Y and takes the value of 0 or 1. It is most often 0, but 1 on about 900 occasions.
I would like to return all the rows with the same dmg value, where at least one of those records has and O_Y value of 1.
I recommend using data.table for doing this (fread in data.table will be quite handy in reading in the large data set too as you say you have enough RAM).
I am not sure that the following is the best way to do this in data.table but, at least, it should get you started. Hopefully, someone else will come along and list the most idiomatic data.table way for this. But this is what I can think of right now:
Assuming your data.table is called DT and has two columns dmg and O_Y. Use O_Y as the index key for DT and subset DT for O_Y == 1 (DT[.(1)] in data.table syntax). Now find the corresponding dmg values. The unique of these dmg values is your keys.with.ones. All this is succinctly done as follows:
setkey(DT, O_Y)
keys.with.ones <- unique(DT[.(1), dmg][["dmg"]])
Next, we need to extract rows corresponding to these values of dmg. For this we need to change the key for DT to dmg and extract the rows corresponding to the keys above:
setkey(DT, dmg)
DT.filtered <- DT[.(keys.with.ones)]
And we are done. :)
Please refer to ?data.table to figure out a better method if possible and let us know.