data_processed <- sqldf(" select a.permno, a.number, a.mean, b.ret as med, a.std
from data_processed as a
left join data_processed2 as b
on a.permno=b.permno")
The code above is not working. I an getting the error below:
Error in result_create(conn#ptr, statement) : no such column: b.ret
Here is my data:
data_processed:
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
data_processed2:
permno return
1 10107 0.0191920
2 11850 0.0015495
3 12060 -0.0040130
4 12490 0.0078245
5 14593 0.0231735
6 19561 0.0202610
7 25785 -0.0018760
8 27983 0.0027375
9 55976 0.0089435
10 59328 0.0166490
Related
I have the following code in R
v <- c("featureA", "featureB")
newdata <- unique(data[v])
print(unique(data[v])
print(predict(model, newdata, type='response', allow.new.level = TRUE)
And I got the following result
featureA featureB
1 bucket_in_10_to_30 bucket_in_90_to_100
2 bucket_in_10_to_30 bucket_in_50_to_90
3 bucket_in_0_to_10 bucket_in_50_to_90
4 bucket_in_0_to_10 bucket_in_90_to_100
7 bucket_in_10_to_30 bucket_in_10_to_50
10 bucket_in_30_to_100 bucket_in_90_to_100
19 bucket_in_0_to_10 bucket_in_0_to_10
33 bucket_in_0_to_10 bucket_in_10_to_50
36 bucket_in_30_to_100 bucket_in_10_to_50
38 bucket_in_10_to_30 bucket_in_0_to_10
52 bucket_in_30_to_100 bucket_in_0_to_10
150 bucket_in_30_to_100 bucket_in_50_to_90
1 2 3 4 7 10 19 33 36 38 52 150
0.001920662 0.005480186 0.000961198 0.000335883 0.006311521 0.004005570 0.000620979 0.001107773 0.013100210 0.003546136 0.007382468 0.011384935
And I'm wondering if it's possible in R that I can reshape and directly get a 3 x 4 tables similar to this
feature A / features B
bucket_in_90_to_100
bucket_in_50_to_90
bucket_in_0_to_30
...
...
bucket_in_0_to_30
...
...
Thanks for the help!
I will appreciate any help I can get. It says we have to read in the attached file and use it as an edge list to create a directed graph with weighted edges.
Then there are about 20 other things I have to do from there. This is part of the .txt file to import:
Columns are inbound locations, outbound locations, and travel time in minutes.
Inbound Outbound Minutes
ACY ATL 102
ACY FLL 136
ACY MCO 122
ACY MYR 90
ACY RSW 137
ACY TPA 129
ATL ACY 102
ATL BOS 132
ATL BWI 106
ATL CLE 104
.... and so on, there are probably 50+ locations in total, with around 400 lines
I tried using
read.graph(file.choose(), format="edgelist")
and when I select the .txt file I get the error:
"Error in read.graph.edgelist(file, ...) :
At foreign.c:101 : parsing edgelist file failed, Parse error"
-----EDIT-----
I just used the following code:
inbound <- c(data[, 1])
outbound <- c(data[, 2])
testing <- data.frame(inbound, outbound)
gd <- graph_from_data_frame(testing, directed=TRUE,vertices=NULL)
Which gave this output:
edges from d654854 (vertex names):
[1] 1 ->2 1 ->18 1 ->30 1 ->35 1 ->46 1 ->58 2 ->1 2 ->7 2 ->9 2 ->11 2 ->15 2 ->16 2 ->18 2 ->21 2 ->23 2 ->24 2 ->30
[18] 2 ->33 2 ->34 2 ->36 2 ->37 2 ->41 2 ->58 3 ->18 4 ->18 5 ->18 5 ->30 5 ->35 5 ->46 5 ->58 6 ->18 7 ->2 7 ->9 7 ->11
[35] 7 ->15 7 ->16 7 ->18 7 ->23 7 ->30 7 ->33 7 ->34 7 ->35 7 ->37 7 ->58 8 ->18 9 ->2 9 ->7 9 ->13 9 ->15 9 ->16 9 ->18
[52] 9 ->21 9 ->23 9 ->24 9 ->30 9 ->33 9 ->34 9 ->35 9 ->36 9 ->37 9 ->48 9 ->51 9 ->58 10->18 10->23 10->30 10->35 11->2
[69] 11->7 11->15 11->18 11->23 11->24 11->30 11->34 11->35 11->51 12->18 13->9 13->15 13->18 13->21 13->37 14->15 14->16
[86] 14->18 14->21 14->23 14->24 14->26 14->30 14->33 14->37 15->2 15->7 15->9 15->11 15->13 15->14 15->16 15->18 15->23
[103] 15->24 15->26 15->30 15->33 15->34 15->35 15->36 15->37 15->41 15->42 15->43 15->48 15->52 15->58 16->2 16->7 16->9
[120] 16->14 16->15 16->18 16->21 16->23 16->24 16->26 16->29 16->30 16->33 16->34 16->35 16->36 16->41 16->46 16->54 16->58
+ ... omitted several edges
Is that what I am supposed to get? Or am I still way off?
Using is.igraph(gd) returns true, and using V(gd) and E(gd) both return information.
So I guess my question is how do I properly import the "table" so that the pairs of inbound/outbound flight names are used as edges (I think) for this? I have to make a directed graph with weighted edges to finalize the set up.
Any information on where I should start? I looked through the igraph documentation but I can't find anything about importing from a table and using pairs of characters as edges.
You can import the data as a data.frame and coerce it to a graph. Once you have the graph, you can assign weights.
library(igraph)
xy <- read.table(text = "
ACY ATL 102
ACY FLL 136
ACY MCO 122
ACY MYR 90
ACY RSW 137
ACY TPA 129
ATL ACY 102
ATL BOS 132
ATL BWI 106
ATL CLE 104", header = FALSE, sep = " ")
colnames(xy) <- c("node1", "node2", "weight")
g <- graph_from_data_frame(xy[, c("node1", "node2")])
E(g)$weight <- xy$weight
plot(g, edge.width = E(g)$weight/50, edge.arrow.size = 0.1)
I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28
I'm looking for a way to produce descriptive statistics by group number in R. There is another answer on here I found, which uses dplyr, but I'm having too many problems with it and would like to see what alternatives others might recommend.
I'm looking to obtain descriptive statistics on revenue grouped by group_id. Let's say I have a data frame called company:
group_id company revenue
1 Company A 200
1 Company B 150
1 Company C 300
2 Company D 600
2 Company E 800
2 Company F 1000
3 Company G 50
3 Company H 80
3 Company H 60
and I'd like to product a new data frame called new_company:
group_id company revenue average min max SD
1 Company A 200 217 150 300 62
1 Company B 150 217 150 300 62
1 Company C 300 217 150 300 62
2 Company D 600 800 600 1000 163
2 Company E 800 800 600 1000 163
2 Company F 1000 800 600 1000 163
3 Company G 50 63 50 80 12
3 Company H 80 63 50 80 12
3 Company H 60 63 50 80 12
Again, I'm looking for alternatives to dplyr. Thank you
Using the sample data frame
dd<-read.csv(text="group_id,company,revenue
1,Company A,200
1,Company B,150
1,Company C,300
2,Company D,600
2,Company E,800
2,Company F,1000
3,Company G,50
3,Company H,80
3,Company H,60", header=T)
You could do something fancy like use ave() to create all the values per row for your different functions and then just combine that with the original data.frame.
ext <- with(dd, Map(function(x) ave(revenue, group_id, FUN=x),
list(avg=mean, min=min, max=max, SD=sd)))
cbind(dd, ext)
# group_id company revenue avg min max SD
# 1 1 Company A 200 216.66667 150 300 76.37626
# 2 1 Company B 150 216.66667 150 300 76.37626
# 3 1 Company C 300 216.66667 150 300 76.37626
# 4 2 Company D 600 800.00000 600 1000 200.00000
# 5 2 Company E 800 800.00000 600 1000 200.00000
# 6 2 Company F 1000 800.00000 600 1000 200.00000
# 7 3 Company G 50 63.33333 50 80 15.27525
# 8 3 Company H 80 63.33333 50 80 15.27525
# 9 3 Company H 60 63.33333 50 80 15.27525
but really a simple dplyr command would be easier.
dd %>% group_by(group_id) %>%
mutate(
avg=mean(revenue),
min=min(revenue),
max=max(revenue),
SD=sd(revenue))
Another function I like to use is: describeBy from package "psych".
library(psych)
description <- describeBy(data.frame$variable_to_be_described, df$group_variable)
Here are three columns:
indx vehID LocalY
1 2 35.381
2 2 39.381
3 2 43.381
4 2 47.38
5 2 51.381
6 2 55.381
7 2 59.381
8 2 63.379
9 2 67.383
10 2 71.398
11 2 75.401
12 2 79.349
13 2 83.233
14 2 87.043
15 2 90.829
16 2 94.683
17 2 98.611
18 2 102.56
19 2 106.385
20 2 110.079
21 2 113.628
22 2 117.118
23 2 120.6
24 2 124.096
25 2 127.597
26 2 131.099
27 2 134.595
28 2 138.081
29 2 141.578
30 2 145.131
31 2 148.784
32 2 152.559
33 2 156.449
34 2 160.379
35 2 164.277
36 2 168.15
37 2 172.044
38 2 176
39 2 179.959
40 2 183.862
41 2 187.716
42 2 191.561
43 2 195.455
44 2 199.414
45 2 203.417
46 2 207.43
47 2 211.431
48 2 215.428
49 2 219.427
50 2 223.462
51 2 227.422
52 2 231.231
53 2 235.001
54 2 238.909
55 2 242.958
56 2 247.137
57 2 251.247
58 2 255.292
59 2 259.31
60 2 263.372
61 2 267.54
62 2 271.842
63 2 276.256
64 2 280.724
65 2 285.172
I want to create a new column called 'Smoothed Y' by applying the following formula:
D=15, delta (triangular symbol) = 5, i = indx, x_alpha(tk) = LocalY, x_alpha(ti) = smoothed value
I have tried using following code for first calculating Z: (Kernel below means the exp function)
t <- 0.5
dt <- 0.1
delta <- t/dt
d <- 3*delta
indx <- a$indx
for (i in indx) {
initial <- i-d
end <- i+d
k <- c(initial:end)
for (n in k) {
kernel <- exp(-abs(i-n)/delta)
z <- sum(kernel)
}
}
a$z <- z
print (a)
NOTE: 'a' is the imported data frame containing the three columns above.
Although the values of computed function are fine but it doesn't sum up the values in variable z. How can I do summation over the range i-d to i+d for every indx value i?
You can use the convolve function. One thing you need to decide is what to do for indices closer to either end of the array than width of the convolution kernel. One option is to simply use the partial kernel, rescaled so the weights still sum to 1.
smooth<-function(x,D,delta){
z<-exp(-abs(-D:D)/delta)
r<-convolve(x,z,type="open")/convolve(rep(1,length(x)),z,type="open")
r<-head(tail(r,-D),-D)
r
}
With your array as y, the result is this:
> yy<-smooth(y,15,5)
> yy
[1] 50.70804 52.10837 54.04788 56.33651 58.87682 61.61121 64.50214
[8] 67.52265 70.65186 73.87197 77.16683 80.52193 83.92574 87.36969
[15] 90.84850 94.35809 98.15750 101.93317 105.67833 109.38989 113.06889
[22] 116.72139 120.35510 123.97707 127.59293 131.20786 134.82720 138.45720
[29] 142.10507 145.77820 149.48224 153.21934 156.98794 160.78322 164.60057
[36] 168.43699 172.29076 176.15989 180.04104 183.93127 187.83046 191.74004
[43] 195.66223 199.59781 203.54565 207.50342 211.46888 215.44064 219.41764
[50] 223.39908 227.05822 230.66813 234.22890 237.74176 241.20236 244.60039
[57] 247.91917 251.14346 254.25876 257.24891 260.09121 262.74910 265.16057
[64] 267.21598 268.70276
Of course, the problem with this is that the kernel ends up non-centered at the edges. This is a well-known problem, and there are ways to deal with it but it complicates the problem. Plotting the data will show you the effects of this non-centering:
plot(y)
lines(yy)