How to use setorder function in R - r

I have a mock up data set:
d1 = structure(
list(
chan1 = c(1.49955768204777, 1.57924608282282,
1.62079872172079, 1.49955768204777,
1.50897108417039, 1.47897959168283),
chan2 = c(3.71459266186863, 3.71459266186863,
3.66763591782946, 3.67359273988532,
3.66408366995924, 3.68083665073346),
chan3 = c(8.32529316285155, 6.30229174858652,
6.97551768293611, 6.52653674461786,
6.52653674461786, 6.07823977152575),
chan4 = c(11.023719681933, 11.023719681933,
11.023719681933, 11.4613297390623,
11.4613297390623, 11.5813471428122),
chan5 = c(7.32862391337389, 7.38103675023449,
7.81796038841145, 7.4216715642288,
7.51924428352424, 7.35498863975821),
rowname = c(2042051, 1454646, 289170,
3307469, 3890829, 1741489),
total_conv = c(359.161333500186, 359.161312264452,
359.16130836516, 359.161294408793,
359.161289598969, 359.161209958641),
sum = c(31.8917871020749, 30.0008869254455,
31.1056323928309, 30.5826884698421,
30.6801655213341, 30.1743917965125)
),
.Names = c("chan1", "chan2", "chan3", "chan4", "chan5",
"rowname", "total_conv", "sum"),
class = "data.frame",
row.names = c(NA, -6L)
)
Now I need to sort this data set by total_conv and sum variables.
Here total_conv should be sort in descending order and sum in ascending order.
When I use the following function, I unable to sort my data set in required format.
d1<-setorder(as.data.table(d1),-total_conv,sum)
How can I overcome this issue?

You can also try order instead of setorder:
setDT(d1)[order(-total_conv, sum)]
It will first sort by descending total_conv and then by descending sum.

Related

Rolling Sample standard deviation in R

I wanted to get the standard deviation of the 3 previous row of the data, the present row and the 3 rows after.
This is my attempt:
mutate(ming_STDDEV_SAMP = zoo::rollapply(ming_f, list(c(-3:3)), sd, fill = 0)) %>%
Result
ming_f
ming_STDDEV_SAMP
4.235279667
0.222740262
4.265353
0.463348209
4.350810667
0.442607461
3.864739333
0.375839159
3.935632333
0.213821765
3.802632333
0.243294783
3.718387667
0.051625808
4.288542333
0.242010836
4.134689
0.198929941
3.799883667
0.112733475
This is what I expected:
ming_f
ming_STDDEV_SAMP
4.235279667
0.225532646
4.265353
0.212776157
4.350810667
0.23658801
3.864739333
0.253399417
3.935632333
0.26144862
3.802632333
0.246259684
3.718387667
0.20514358
4.288542333
0.208578409
4.134689
0.208615874
3.799883667
0.233948429
It doesn't match your output exactly, but perhaps this is what you need:
zoo::rollapply(quux$ming_f, 7, FUN=sd, partial=TRUE)
(It also works replacing 7 with list(-3:3).)
This expression isn't really different from your sample code, but the output is correct. Perhaps your original frame has a group_by still applied?
Data
quux <- structure(list(ming_f = c(4.235279667, 4.265353, 4.350810667, 3.864739333, 3.935632333, 3.802632333, 3.718387667, 4.288542333, 4.134689, 3.799883667), ming_STDDEV_SAMP = c(0.225532646, 0.212776157, 0.23658801, 0.253399417, 0.26144862, 0.246259684, 0.20514358, 0.208578409, 0.208615874, 0.233948429)), class = "data.frame", row.names = c(NA, -10L))

Labelling Variables

I have a series of variables that fall under one related question: lets say there are 20 such variables in my dataframe, each one corresponds to an option on a MC question. They are titled popn1, popn2......popn20.
I want to label each variable by its option, as an example: (popn1 = Everyone; popn2=Children)
I'm using the labelVector package.
Is there a way I can do it without writing out each variable name? Ex. is there a paste function I can use, such as
df2 <- Set_label(df1,
(paste0(popn, 1:20) = "Everyone", "Children", .... "Youth"?)
This can be done in base R quite easily. Here's some sample data (using columns instead of 20, to make it easier to view)
popn1 popn2 popn3 popn4 popn5
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
It looks like you already have your new column names in a character vector:
your_column_names <- c("Everyone", "Youth", "Someone", "Something", "Somewhere")
Then you just use the setNames argument on the column names for your data:
colnames(data) <- setNames(your_column_names, colnames(data))
Everyone Youth Someone Something Somewhere
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
Sample Data:
data <- structure(list(popn1 = c(-0.408514139489243, 3.13789432899688,
2.40732780606037), popn2 = c(3.24071608151551, 2.51269963339946,
1.47561933493116), popn3 = c(2.73083728435832, 2.02154567048998,
2.44974180329751), popn4 = c(6.42872215439841, 3.3333709733048,
2.81744655980154), popn5 = c(8.0152099281755, 5.65440141443164,
6.29556905855252)), class = "data.frame", row.names = c(NA, -3L
))

get rows from dataframe matching element of a list

Here are one dataframe/tibble and one character element(this element is one column of a tibble)
df1 <- structure(list(Twitter_name = c("CHESHIREKlD", "JellyComons",
"kirmiziburunlu", "erkekdeyimleri", "herosFrance", "IkishanShah"
), Declared_followers = c(60500L, 43100L, 31617L, 27852L, 26312L,
16021L), Real_followers = c(60241, 43054, 31073, 27853, 25736,
15856), Twitter_Id = c("783866366", "1424086592", "2367932244",
"3352977681", "2580703352", "521094407")), .Names = c("Twitter_name",
"Declared_followers", "Real_followers", "Twitter_Id"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
myId <- c("867211097882804224", "868806957133688832", "549124465","822580282452754432",
"109344546", "482666188", "61716107", "3642392237", "595318933",
"833365943044628480", "1045015087", "859830740669800448", "860562940059045888",
"2854457294", "871784135983067136", "866922354554814464", "4839343547",
"849451474572759040", "872084673526214656", "794841530053853184")
N:B: df1 has been shortened and has indeed 128 observations.
I am looking to test all row elements of df1$Twitter_Id and see if they are in myId. I can run this:
> match(myId[1], df1$Twitter_Id)
but:
it stops at the first occurrence
I need to apply the match() function to all elements of myId.
I can't find a clean and simple way to do this, using lapply() or other functions from dplyr, tydiverse packages.
Thank you for help.
EDIT I need to be more explicit with the whole real case.
myTw <- structure(list(id_str = c("893445199661330433", "893116842558050304",
"892739336466305024", "892401780105019393", "892401594272296963",
"892365572486430720", "891964139756818432")), .Names = "id_str", row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
these are tweets ID.What I am looking for is to obtain which twitter users have retweeted these ones. To do this, I use the retweeters() function from package twitteR.
library(twitteR)
MyRtw <- retweeters(myTw[1])
MyRtw <- c("889135428028084224", "867211097882804224", "868806957133688832",
"549124465", "822580282452754432", "109344546", "482666188",
"61716107", "3642392237", "595318933", "833365943044628480",
"1045015087", "859830740669800448", "860562940059045888", "2854457294",
"871784135983067136", "866922354554814464", "4839343547", "849451474572759040",
"872084673526214656")
This is a list of Twitter user Id.
Now finally I want to see which users from df1$Twitte_Id have retweeted MyTw[1].
You can use the '%in%' operator.
Edit: Probably this is what you want. Here I used the data posted in your original post (before editing).
matchVector = NULL
for (id in df1$Twitter_Id) {
matchCounter <- sum(myId %in% id)
matchVector <- c(matchVector, matchCounter)
}
df1$numberOfMatches <- matchVector

TIme series data in R, problems with dates

Date T1V T2V T3V T1MV T2MV T3MV
1997-12-31 2.631202 2.201695 -0.660092 -0.77492483 0.282662305 4.66506798
1998-01-30 2.193793 3.763458 5.565432 3.50711734 2.874381814 5.14118430
1998-02-27 5.173496 8.727646 6.333820 2.59892279 8.363146480 9.27289259
This is the table I am working with in R. It is much bigger. Data is on monthly basis up until 2014.The different columns are just the return dates on different portfolios. I always get errors if I want to use it as a time series data. I downloaded the PerformanceAnalytics package. For example for the SharpeRatio function it gives me.
> SharpeRatio(T1V)
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to passin names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
when you look at the date column in the table you see that the date format is exactly this format.
I tried a hundred things. It also doesn^t let me plot the charts with lines only with points.
Any help is much appreciated.
> dput(FactorR[1:5,])
structure(list(Date = structure(1:5, .Label = c("1997-12-31",
"1998-01-30", "1998-02-27", "1998-03-31", "1998-04-30", "1998-05-29",
"1998-06-30", "1998-07-31", "1998-08-31", "1998-09-30", "1998-10-30",
"1998-11-30", "1998-12-31", "1999-01-29", "1999-02-26", "1999-03-31",
"1999-04-30", "1999-05-31", "1999-06-30", "1999-07-30", "1999-08-31",
"1999-09-30", "1999-10-29", "1999-11-30", "1999-12-31", "2000-01-31",
"2000-02-29", "2000-03-31", "2000-04-28", "2000-05-31", "2000-06-30",
"2000-07-31", "2000-08-31", "2000-09-29", "2000-10-31", "2000-11-30",
"2000-12-29", "2001-01-31", "2001-02-28", "2001-03-30", "2001-04-30",
.
.
.
, class = "factor"),
T1V = c(2.631202, 2.193793, 5.173496, 8.033864, 1.369065),
T2V = c(2.201695, 3.763458, 8.727646, 11.375482, 3.097196
), T3V = c(-0.660092, 5.565432, 6.33382, 20.608638, 4.022475
), T1MV = c(-0.774924835, 3.507117337, 2.598922792, 16.26945887,
4.544096701), T2MV = c(0.282662305, 2.874381814, 8.36314648,
12.7091841, 1.078742371), T3MV = c(4.665067984, 5.141184302,
9.27289259, 10.62133318, 2.791853987), T1BTM = c(0.617378168,
3.498582776, 3.332624722, 8.802164975, 1.366229683), T2BTM = c(1.101407825,
5.578394125, 8.910685728, 20.05317039, 1.258609942), T3BTM = c(2.454019461,
2.445706552, 7.991651412, 10.79096755, 5.464002646), T1MOM = c(2.99986853,
4.982808153, 8.657010689, 10.60637296, 4.44333707), T2MOM = c(0.011102554,
3.184165606, 7.55229158, 11.9341773, 0.328377299), T3MOM = c(1.161834369,
3.355709694, 4.025659592, 17.12665788, 3.55822744), Rm = c(1.390935,
3.840895, 6.744987, 13.262647, 2.753486), SMB = c(-5.439992819,
-1.634066965, -6.673969798, 5.648125694, 1.752242715), HML = c(-1.836641293,
1.052876225, -4.65902669, -1.988802574, -4.097772963), MOM = c(1.838034161,
1.62709846, 4.631351096, -6.520284921, 0.885109629)), .Names = c("Date",
"T1V", "T2V", "T3V", "T1MV", "T2MV", "T3MV", "T1BTM", "T2BTM",
"T3BTM", "T1MOM", "T2MOM", "T3MOM", "Rm", "SMB", "HML", "MOM"
), row.names = c(NA, 5L), class = "data.frame")
Two things are wrong:
Your Date column doesn't contain dates but factors.
SharpeRatio doesn't know how to convert your data.frame to a time series object.
By doing the conversion manually, we can specify which column to use as time index and on-the-fly convert it to Date:
library(PerformanceAnalytics)
FactorR_xts <- xts(x = FactorR[, -1], # use all columns except for first column (date) as data
order.by = as.Date(FactorR$Date) # Convert Date column from factor to Date and use as time index
)
SharpeRatio(FactorR_xts)

Issues with formatting header in R prior to using plot() function

I have a data set that I've successfully read into R. It's a simple data.frame with ONE ROW of data (I'm not sure how many columns, but its in the hundreds). It was read with column headers, but no row labels. So the data set looks something like this:
df=structure(list(X500000 = 0.0958904109589041, X1500000 = 0.10958904109589, X2500000 = 0.10958904109589, X3500000 = 0.164383561643836, X4500000 = 0.136986301369863, X5500000 = 0.205479452054795, X6500000 = 0.136986301369863, X7500000 = 0.0273972602739726, X8500000 = 0.0821917808219178, X9500000 = 0.178082191780822), .Names = c("X500000", "X1500000", "X2500000", "X3500000", "X4500000", "X5500000", "X6500000", "X7500000", "X8500000", "X9500000"), class = "data.frame", row.names = 79L)
Except that it is MUCH LARGER (I don't know if it matters, but it has around 300 columns going across). I'm trying to plot it so that the X##### labels are on the x axis, and the value of each data point is plotted on the y axis (say like a scatter plot on excel or even a line graph). Doing just plot(df) gives me an extremely bizarre graph that makes no sense to me (a bunch of boxes each with a dot right in the centre and no labels?).
I have a feeling it might work if I were to transform the data frame into a vector by removing the headings and then adding x-axis labels individually afterwards and doing a plot() on the vector, but if there is a way of avoiding that it would be great....
As explained in '?plot', 'x' and 'y' must be two vectors of numerics, of same size:
df=structure(list(X500000 = 0.0958904109589041, X1500000 = 0.10958904109589, X2500000 = 0.10958904109589, X3500000 = 0.164383561643836, X4500000 = 0.136986301369863, X5500000 = 0.205479452054795, X6500000 = 0.136986301369863, X7500000 = 0.0273972602739726, X8500000 = 0.0821917808219178, X9500000 = 0.178082191780822), .Names = c("X500000", "X1500000", "X2500000", "X3500000", "X4500000", "X5500000", "X6500000", "X7500000", "X8500000", "X9500000"), class = "data.frame", row.names = 79L)
plot(x=as.numeric(substr(names(df),2,nchar(names(df)))), as.numeric(df), xlab="This is xlab", ylab="This is y")

Resources