I wish to plot some count data (likely as a bubble plot). I've some different experiments and for each experiment, I've three replicates. The output from the table() command is given below.
> with(myData.df, table(ChargeGroup,Expt,Repx))
, , Repx = 1
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 540 512 567 204 642 648 71 2 2 6
+10:+15 219 258 262 156 283 16 0 1 0 7
+15:+20 119 118 14 200 14 0 0 7 0 51
+20:+25 57 38 0 84 1 0 0 31 7 87
+25: 30 16 0 17 0 0 0 24 19 18
, , Repx = 2
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 529 522 582 201 642 626 77 1 2 5
+10:+15 232 249 264 150 273 14 0 1 0 5
+15:+20 116 113 18 204 13 0 0 12 0 41
+20:+25 53 46 0 82 0 0 0 36 6 94
+25: 28 12 0 26 0 0 0 33 21 28
, , Repx = 3
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 536 525 591 224 671 641 63 1 2 6
+10:+15 236 238 257 170 276 16 0 2 1 10
+15:+20 113 108 15 212 12 0 0 10 0 47
+20:+25 57 40 0 77 0 0 0 34 3 107
+25: 32 11 0 25 0 0 0 26 15 26
Can anyone help in to further process the output so that I can go directly for plotting in either base graphics or ggplot?
Thanks
There are couple of methods - with base R, by looping over the third dmension and plotting with barplot
par(mfrow = c(3, 1))
apply(with(myData.df, table(ChargeGroup,Expt,Repx)), 3, barplot)
-testing
par(mfrow = c(3, 1))
apply(with(mtcars, table(cyl, vs, gear)), 3, barplot)
Or convert to a single data.frame with as.data.frame and using ggplot or directly get the data.frame/tibble output with count
library(dplyr)
library(ggplot2)
myData.df %>%
count(ChargeGroup,Expt,Repx) %>%
ggplot(aes(x=ChargeGroup, y = n, fill = Expt)) +
geom_col() +
facet_wrap(~ Repx)
-testing
mtcars %>%
count(cyl = factor(cyl), vs = factor(vs), gear = factor(gear)) %>%
ggplot(aes(x = cyl, y = n, fill = vs)) +
geom_col() +
facet_wrap(~ gear)
Related
I have a complex, directed graph with 2 way movements between vertices (see below for dummy example). I am trying to generate an output that would give me the sum of edge weights that are directed to a specific set of target vertices (in the example below, vertex = "22", colored purple in the figure) & those target vertices' neighbors. I want to determine this for k1 (colored blue) and k2 (colored green) neighbors of the target vertex.
In other words, I am trying to determine, for each vertex, the sum of all "out" edge values that are directed towards the target vertices & subsequently the sum of all edge values directed towards k1 neighbours of the target vertex.
The network I have is huge (905,352 edges & 141,861 vertices), so I was hoping to solve the problem with igraph functions as I assume that is the fastest approach, but perhaps I am wrong.
library(igraph)
# create sample data for reproducible example
from <- c(1,2,3,3,4,4,4,4,5,6,6,7,8,8,9,9,10,10,11,11,12,12,13,13,13,13,13,13,13,14,15,15)
to <- c(13,4,7,11,2,6,11,22,4,4,14,13,13,22,13,22,13,22,3,22,5,22,1,7,8,9,10,22,15,6,13,22)
set.seed(22)
weight <-sample(2:200,length(to))
#create dataframe & convert to igraph
graph_df <- data.frame(from,to,weight)
graph <- graph_from_data_frame(graph_df)
#distance to target vertex "22"
dist <- distances(graph,v="22",mode="in",weights=NA)
ggraph(graph, layout = "graphopt") +
geom_edge_link(arrow = arrow(length = unit(3, 'mm')),
end_cap = circle(3, 'mm'),
aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(color=factor(-dist),size = factor(-dist))) +
labs(edge_width = "size movement") +
theme_graph()
The desired output would be:
vertex 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 22
k1 0 0 0 129 0 0 0 63 66 115 111 162 86 0 92 0
k2 138 89 45 102 68 177 17 187 32 94 0 0 482 0 118 0
total 138 89 120 416 68 294 17 250 98 209 161 184 658 152 210 0
where
k1 = sum of edge weights per vertex on edges from k1 neighbors to target
k2 = sum of edge weights per vertex on edges from k1 neighbors to target
total = sum of all outgoing edge weights per vertex (i.e. the weighted out strength)
I have tried using the distances() function with weights, which gives the correct sum for k1 neighbours, but not for k2 or beyond.
distances(graph,v="22",mode="in")
#result of distances
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 22
224 218 156 129 197 306 103 63 66 115 111 162 86 458 92 0
I have had some luck with dplyr on the edge list dataframe via, but my assumption is there are faster ways to approach this issue:
#dataframe of k1 neighbors & summed weight
k1<- graph_df %>%
mutate(k1 = ifelse(to=="22",weight,NA)) %>%
group_by(from) %>%
summarise(total=sum(weight,na.rm=TRUE),
k1=sum(k1,na.rm=TRUE))
#data frame of k2 neighbors & summed weight
k2 <- graph_df %>%
mutate(k2=ifelse(to %in% k1$from[k1$k1>0],weight,NA)) %>%
group_by(from) %>%
summarise(k2 =sum(k2,na.rm=TRUE))
#join
out <- left_join(k1,k2,by="from") %>% rename(vertex=from)
# A tibble: 15 × 4
vertex total k1 k2
<dbl> <int> <int> <int>
1 1 138 0 138
2 2 89 0 89
3 3 120 0 45
4 4 416 129 102
5 5 68 0 68
6 6 294 0 177
7 7 17 0 17
8 8 250 63 187
9 9 98 66 32
10 10 209 115 94
11 11 161 111 0
12 12 184 162 0
13 13 658 86 482
14 14 152 0 0
15 15 210 92 118
Perhaps you can try this
graph_df %>%
group_by(from) %>%
summarise(total = sum(weight)) %>%
full_join(
graph_df %>%
filter(to %in% 22) %>%
group_by(from) %>%
summarise(K1 = sum(weight)) %>%
full_join(
graph_df %>%
filter(to %in% neighbors(graph, "22", mode = "in")) %>%
group_by(from) %>%
summarise(K2 = sum(weight))
)
) %>%
arrange(from) %>%
replace(is.na(.), 0) %>%
rename(vertex = from)
which gives
vertex total K1 K2
<dbl> <int> <int> <int>
1 1 138 0 138
2 2 89 0 89
3 3 120 0 45
4 4 416 129 102
5 5 68 0 68
6 6 294 0 177
7 7 17 0 17
8 8 250 63 187
9 9 98 66 32
10 10 209 115 94
11 11 161 111 0
12 12 184 162 0
13 13 658 86 482
14 14 152 0 0
15 15 210 92 118
I have a time series with the following data:
Provice Date Confirmed.cases virus fever Wuhan_Pneumonia temp wuhan sars
20 Anhui 02/09 779 30 0 0 10 25 0
21 Anhui 02/10 830 0 0 21 12 28 0
22 Anhui 02/11 860 43 0 21 12 0 0
23 Anhui 02/12 889 0 0 0 14 0 0
47 Chongqing 01/21 0 0 0 48 10 61 50
48 Chongqing 01/22 1 67 0 31 11 23 46
49 Chongqing 01/23 5 38 0 36 11 71 54
50 Chongqing 01/24 18 84 0 41 9 43 0
51 Chongqing 01/25 48 59 100 84 8 100 61
52 Chongqing 01/26 66 84 0 35 7 33 100
and would like to plot an overlapping time series plot using ggplot. However, when I do this the time series plot produced is distorted such as the one below. Can somebody please shed some light as to what I'm doing wrong?
Here is the code I am using to produce the plot below:
ggplot(dta3, aes(x = Date, y = Confirmed.cases, color = Provice, group = 1)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
I had suspicions, and Ronak Shah voiced the same.
Does this fix it?
ggplot(dta3, aes(x = Date, y = Confirmed.cases, color = Provice, group = Provice)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
I am new to r and I have a dataframe very close to the one below and I would love to find a general way that tells me how many times plus 1, the number "0" appears for each country (intro4) and id.
Intro4 number id
221 TAN 0 19
222 TAN 0 73
223 TAN 0 73
224 TOG 0 37
225 TOG 0 58
226 UGA 0 96
227 UGA 0 112
228 UGA 0 96
229 ZAM 0 40
230 ZAM 0 99
231 ZAM 0 139
I can do it by hand by it is a big data frame and would take forever, count () gives me the frequency but doesn't divide it between different countries. I have found a way to do it but I will have to select and filter for each individual county (intro4) and add 1 to the result. I was wondering if there was any quicker way to fo it. The code I have tried was this one:
projects <- finalr %>% select (Intro4,number,id)
projects1<-projects %>% filter (str_detect (number, "0"))
projects2<-projects1 %>%arrange (Intro4)
projects3<-sum(projects2$Intro4 == "TAN", na.rm = TRUE)
projects4<-sum(projects2$Intro4=="UGA",na.rm=TRUE)
I would be extremely grateful for any help, thank you :)
You can also do it as followed:
library(dplyr)
dat <- read.table(header = T, text =
"Intro4 number id
TAN 0 19
TAN 0 73
TAN 0 73
TOG 0 37
TOG 0 58
UGA 0 96
UGA 0 112
UGA 0 96
ZAM 0 40
ZAM 0 99
ZAM 0 139", stringsAsFactors = F)
dat %>% group_by(Intro4, id, number) %>% tally()
Which produces:
Intro4 id number n
<chr> <int> <int> <int>
1 TAN 19 0 1
2 TAN 73 0 2
3 TOG 37 0 1
4 TOG 58 0 1
5 UGA 96 0 2
6 UGA 112 0 1
7 ZAM 40 0 1
8 ZAM 99 0 1
9 ZAM 139 0 1
Assuming number can be anything like 0, 1, 2 etc. one can count occurrence of 0 by sum(number==0). A solution using dplyr can be as:
library(dplyr)
df %>% group_by(Intro4, id) %>%
summarise(count = sum(number==0))
# # A tibble: 9 x 3
# # Groups: Intro4 [?]
# Intro4 id count
# <chr> <int> <int>
# 1 TAN 19 1
# 2 TAN 73 2
# 3 TOG 37 1
# 4 TOG 58 1
# 5 UGA 96 2
# 6 UGA 112 1
# 7 ZAM 40 1
# 8 ZAM 99 1
# 9 ZAM 139 1
Data:
df <- read.table(text="
Intro4 number id
221 TAN 0 19
222 TAN 0 73
223 TAN 0 73
224 TOG 0 37
225 TOG 0 58
226 UGA 0 96
227 UGA 0 112
228 UGA 0 96
229 ZAM 0 40
230 ZAM 0 99
231 ZAM 0 139",
header = TRUE, stringsAsFactors = FALSE)
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
Using the data.frame below (Source: http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_PLSR_Software_Comparison.pdf)
Data
df <- read.table(text = c("
diesel twodoors sportsstyle wheelbase length width height curbweight enginesize horsepower horse_per_weight conscity price symboling
0 1 0 97 172 66 56 2209 109 85 0.0385 8.7 7975 2
0 0 0 100 177 66 54 2337 109 102 0.0436 9.8 13950 2
0 0 0 116 203 72 57 3740 234 155 0.0414 14.7 34184 -1
0 1 1 103 184 68 52 3016 171 161 0.0534 12.4 15998 3
0 0 0 101 177 65 54 2765 164 121 0.0438 11.2 21105 0
0 1 0 90 169 65 52 2756 194 207 0.0751 13.8 34028 3
1 0 0 105 175 66 54 2700 134 72 0.0267 7.6 18344 0
0 0 0 108 187 68 57 3020 120 97 0.0321 12.4 11900 0
0 0 1 94 157 64 51 1967 90 68 0.0346 7.6 6229 1
0 1 0 95 169 64 53 2265 98 112 0.0494 9.0 9298 1
1 0 0 96 166 64 53 2275 110 56 0.0246 6.9 7898 0
0 1 0 100 177 66 53 2507 136 110 0.0439 12.4 15250 2
0 1 1 94 157 64 51 1876 90 68 0.0362 6.4 5572 1
0 0 0 95 170 64 54 2024 97 69 0.0341 7.6 7349 1
0 1 1 95 171 66 52 2823 152 154 0.0546 12.4 16500 1
0 0 0 103 175 65 60 2535 122 88 0.0347 9.8 8921 -1
0 0 0 113 200 70 53 4066 258 176 0.0433 15.7 32250 0
0 0 0 95 165 64 55 1938 97 69 0.0356 7.6 6849 1
1 0 0 97 172 66 56 2319 97 68 0.0293 6.4 9495 2
0 0 0 97 172 66 56 2275 109 85 0.0374 8.7 8495 2"), header = T)
and this
Code
library(plsdepot)
df.plsdepot = plsreg1(df[, 1:11], df[, 14, drop = FALSE], comps = 3)
data<-df.plsdepot$cor.xyt
data<-as.data.frame(data)
I got this data.frame of the correlation between variables and components
data
# t1 t2 t3
#diesel -0.23513860 -0.38154681 0.439221649
#twodoors 0.71849247 0.45622386 0.055982798
#sportsstyle 0.51909329 -0.02381952 -0.672617464
#wheelbase -0.86843937 0.34114664 -0.254589548
#length -0.75311884 0.62404991 -0.085596033
#width -0.67444970 0.62282146 -0.158675019
#height -0.67228557 -0.14675385 0.317166599
#curbweight -0.59305898 0.73532560 -0.241983833
#enginesize -0.39475651 0.82353941 -0.252270394
#horsepower 0.04843256 0.96637015 -0.148407288
#horse_per_weight 0.50515322 0.81502376 -0.006045151
#symboling 0.64900253 0.23673633 0.346902434
and I managed to plot them as below
library(plsdepot)
df.plsdepot = plsreg1(df[, 1:11], df[, 14, drop = FALSE], comps = 3)
plot(df.plsdepot, comps = c(1, 2))
I had to use pls package instead of plsdepot. I need to get the correlations between variables and components and plot them
Using pls, I managed to plot the correlation between variables and components as below
library(pls)
Y <- as.matrix(df[,14])
X <- as.matrix(df[,1:11])
df.pls <- mvr(Y ~ X, ncomp = 3, method = "oscorespls", scale = T)
plot(df.pls, "correlation")
However, I couldn't find a way to get these values (correlation between variables and components) and convert them to data.frame using pls package.
Any help how can I get these correlation values using pls package will be highly appreciated?
Thanks to Bjørn-Helge Mevik (the maintainer of pls package), for his answer below
==========================================================================
If you look at the corrplot code:
> corrplot
function (object, comps = 1:2, labels, radii = c(sqrt(1/2), 1),
identify = FALSE, type = "p", xlab, ylab, ...) {
nComps <- length(comps)
if (nComps < 2)
stop("At least two components must be selected.")
if (is.matrix(object)) {
cl <- object[, comps, drop = FALSE]
varlab <- colnames(cl)
}
else {
S <- scores(object)[, comps, drop = FALSE]
if (is.null(S))
stop("`", deparse(substitute(object)), "' has no scores.")
cl <- cor(model.matrix(object), S)
varlab <- compnames(object, comps, explvar = TRUE)
}
you will see that it basically does
S <- scores(object)[, comps, drop = FALSE]
cl <- cor(model.matrix(object), S)
to calculate the correlation loadings. Using df.pls in place of object should give you a matrix of correlation loadings.
S <- scores(df.pls)[, comps= 1:2, drop = FALSE]
cl <- cor(model.matrix(df.pls), S)
df.cor <- as.data.frame(cl)
df.cor
# Comp 1 Comp 2
#diesel -0.23513860 -0.38154681
#twodoors 0.71849247 0.45622386
#sportsstyle 0.51909329 -0.02381952
#wheelbase -0.86843937 0.34114664
#length -0.75311884 0.62404991
#width -0.67444970 0.62282146
#height -0.67228557 -0.14675385
#curbweight -0.59305898 0.73532560
#enginesize -0.39475651 0.82353941
#horsepower 0.04843256 0.96637015
#horse_per_weight 0.50515322 0.81502376