How to assign different index numbers to determined observations? - r

What I want to do is assign a value of 1 to the first 1/3 of observations ofmy data, then a value of 2 to the second 1/3 of observations of my data and finally a value of 3 to the third 1/3 of observations of my data.
Taking into a ccount that my data consists of 30 observations, I did the following code:
c1 <- c(rep(1,10),rep(2,10),rep(3,10))
which I cbinded to my data
gala2 <- cbind(data,c1)
Then, for the first 10 observations (my first 1/3), the value of c1 is 1, for the next ten observations (second 1/3) the value of c1 is 2 and for the last ten observations (my third 1/3) the value of c1 is 3.
This works perfectly fine, but I wanted to ask if there is a way to do this in a more "abstract" way. That is, to tell R to assign the value of 1 to the first 1/3 of the data, assign the value of 2 to the second 1/3 and the value of 3 to the third 1/3?
Best regards,

Yes there is, try to take a look to cut(). To illustrate a bit try this with your example:
cut(yourDataAsNumeric,3,labels=FALSE)

You can use
sort(rep_len(seq(3), length(c1)))
where c1 is your vector.

Related

create lists that contain the rownumbers for which column i contains the maximum value of that row

In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4.
the first column contains parameter names,
column 2 a shapiro test outcome on the raw data of parameter x
column 3, shapiro test outcome of log10 transformed data for parameter x
column 4, shapiro test outcome of a custom transformation given by the user for parameter x
if this is the data:
Parameter xval xlog10val xcustomval
1 FWS.Range 0.62233371 0.9741614 0.9619065
2 FL.Red.Range 0.48195980 0.9855781 0.9643206
3 FL.Orange.Range 0.43338087 0.9727243 0.8239867
4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407
5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224
6 SWS.Range 0.46932823 0.9487955 0.9825318
7 SWS.Length 0.02927791 0.4565962 0.7309313
8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000
9 FL.Red.Total 0.22437754 0.9655873 0.9923307
QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval)
detailed explanation, ignore perhaps. ....
list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score
list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row:
'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range',
'FL.Red.Gradient', 'FWS.Fill.factor'
and list 3 the rest of the names
I tried something like
df$Parameter[which(df$xval == max(df[ ,2:4]))]
but this gives integer(0) results.
EDIT
to clarify:
Lets start with looking at column 2 (xval).
PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval
if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list
Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list.
To make the DF:
df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'),
xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707),
xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157),
xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter'
i1 <- max.col(df[-1], 'first')
split(df$Parameter, i1)
EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well
df$Parameter <- as.character(df$Parameter)
par.xval.max <- df[which.max(df$xval), "Parameter"]
par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"]
par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)]
In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%

Explanation for aggregate and cbind function

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

Match each row in a table to a row in another table based on the difference between row timestamps

I have two unevenly-spaced time series that each measure separate attributes of the same system. The two series's data points are not sampled at the same times, and the series are not the same length. I would like to match each row from series A to the row of B that is closest to it in time. What I have in mind is to add a column to A that contains indexes to the closest row in B. Both series have a time column measured in Unix time (eg. 1459719755).
for example, given two datasets
a time
2 1459719755
4 1459719772
3 1459719773
b time
45 1459719756
2 1459719763
13 1459719766
22 1459719774
The first dataset should be updated to
a time index
2 1459719755 1
4 1459719772 4
3 1459719773 4
since B[1,]$time has the closest value to A[1,]$time, B[4,]$time has the closest value to A[2,]$time and A[3,]$time.
Is there any convenient way to do this?
Try something like this:
(1+ecdf(bdat$time)(adat$time)*nrow(bdat))
[1] 1 4 4
Why should this work? The ecdf function returns another function that has a value from 0 to 1. It returns the "position" in the "probability range" [0,1] of a new value in a distribution of values defined by the first argument to ecdf. The expression is really just rescaling that function's result to the range [1, nrow(bdat)]. (I think it's flipping elegant.)
Another approach would be to use approxfun on the sorted values of bdat$time which would then let get you interpolated values. These might need to be rounded. Using them as indices would instead truncate to integer.
apf <- approxfun( x=sort(bdat$time), y=seq(length( bdat$time)) ,rule=2)
apf( adat$time)
#[1] 1.000 3.750 3.875
round( apf( adat$time))
#[1] 1 4 4
In both case you are predicting a sorted value from its "order statistic". In the second case you should check that ties are handled in the manner you desire.

Reordering a paired variable

I was wondering about the following thing:
I have a 16x2 matrix with in the first column numerical values and in the second column also numerical values but actually they're position numbers so they need to be treated as a factor.
I want to order the values from the first column from low to high but I need the numbers of the second column to stay with their original partner value from the first column.
So let's say you've got:
4 1
6 2
2 3
And now I want to sort the first column from low to high.
Then I want to get
2 3
4 1
6 2
Does anybody know how I can do this?
R doesn't seem to provide a variable type for paired data...
You can do:
dat[order(dat[, 1]), ]

Resources