Explanation for aggregate and cbind function

Explanation for aggregate and cbind function - r

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!

cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

Related

What is the difference between these two data in R?

I have got two data. I am using r for forestplot.
Data one:
coef lower upper
Males vs Female 0.04088551 0.03483956 0.04693145
85 vs 65 years -0.05515741 -0.06508088 -0.04523394
Charlsons Medium vs Low -0.03833060 -0.04727946 -0.02938173
Charlsons High vs Low -0.09247572 -0.12020001 -0.06475144
Data two:
..1 mean lower upper
1 A 1.4194 0.8560 2.3536
2 B 0.6574 0.2333 1.8523
3 C 0.7751 0.4012 1.4973
4 D 1.0831 0.6587 1.7811
5 E 1.3362 0.6559 2.7221
1. I need my data two should be looked like data one(not in value). The data two is dataframe, what do you think is data one?
2. In data one there is no reference for row number but in data two. I need the row number to be gone.
3. I need data two as a matrix. But if I convert it as a matrix, these are the row reference which gets considered as a column while I am doing forest plot.
Can you please suggest anything?

Dealing with duplicated data, reassign a new value

It seems that when we have duplicated data, most of the time we want to remove the duplicated data.
Lets say, we do not want to exclude it, but instead assign it with a new variable.
Taking the following data as a example
b <- c(1:100,1:99,1:104,1:105,1:105)
So we see that between the values for 1-99 are repeated 5 times, the number 100 repeated 4 times, the number 101 repeated 4 times etc.....
How can one search through b (ideally in sequential order), find a repeated/duplicate number and then assign it a new value

Try this if you're interested in assigning one (universal) new value
b <- c(1:100,1:99,1:104,1:105,1:105)
b[duplicated(b)] = 888 # new value
The duplicated command helps you spot the positions of all values that are duplicates in b.

Missing rows after subsetting datatable on a single column

I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?

Thanks to #Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.
In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.
(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)

omitting certain data in R to maintain overall data integrity

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.

As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Calculate sum of counts per min from data frame in R

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?

You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.

Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex