I am attempting to join two data sets, one for crash data and the other for population by area. The crash data looks like the following and has 7 regions. Lets say the data looks like this and is called "crashes"
region
fatal
severe
minor
Canterbury
222
833
1022
West Coast
23
109
321
Southland
56
112
192
Nelson
63
156
345
Tasman
33
88
111
Otago
121
489
701
Marlborough
31
91
109
The population by areas looks like: "population"
TA
Population
Christchurch
211022
Selwyn
23000
Ashburton
56011
Timaru
63891
Queenstown-Lakes
45111
Central Otago
12113
Clutha
3111
For some reference, I would like to join the population to the crash data but first must combine some of the TA to the bigger regions. For region Canterbury = Christchurch + Selwyn + Ashburton + Timaru.
For the Otago region = Queenstown-Lakes + Central Otago + Clutha. I would like to combine the populations of these TA's that would then join the crashes data according to the region the TA's are in. Sorry for the messy example I am unsure how to go about in making a good example for this sort of data.
I think a way to do this is by building a translation vector for small region to big region and use it to add a region column in population on which you can group to have region population and them make a join of the table.
small_region <- c('Christchurch', 'Selwyn', 'Ashburton', 'Timaru',
'Queenstown-Lakes', 'Central Otago', 'Clutha')
small_to_big <- structure(c(rep("canterbury", 4), rep("Oregon", 3)),
names = small_region)
population %>%
mutate(region = small_to_big[TA]) %>%
group_by(region) %>%
summarise(Population = sum(Population)) %>%
right_join(crashes, by = "region")
I am trying to build a data frame so I can generate a Plot with a specific set of data, but I am having trouble getting the data into a table correctly.
So, here is what I have available from a data query:
> head(c, n=10)
EVTYPE FATALITIES INJURIES
834 TORNADO 5633 91346
856 TSTM WIND 504 6957
170 FLOOD 470 6789
130 EXCESSIVE HEAT 1903 6525
464 LIGHTNING 816 5230
275 HEAT 937 2100
427 ICE STORM 89 1975
153 FLASH FLOOD 978 1777
760 THUNDERSTORM WIND 133 1488
244 HAIL 15 1361
I then tried to generate a set of data variables to build a finished a data.frame like this:
a <- c(c[1,1], c[1,2], c[1,3])
b <- c(c[6,1], c[4,2] + c[6,2], c[4,3] + c[6,3])
d <- c(c[2,1], c[2,2], c[2,3])
e <- c(c[3,1], c[3,2], c[3,3])
f <- c(c[5,1], c[5,2], c[5,3])
g <- c(c[7,1], c[7,2], c[7,3])
h <- c(c[8,1], c[8,2], c[8,3])
i <- c(c[9,1], c[9,2], c[9,3])
j <- c(c[10,1], c[10,2], c[10,3])
k <- c(c[11,1], c[11,2], c[11,3])
df <- data.frame(a,b,d,e,f,g,h,i,j)
names(df) <- c("Event", "Fatalities","Injuries")
But, that is failing miserably. What I am getting is a long string of all the data variables, repeated 10 times. nice trick, but that is not what I am looking for.
I would like to get a finished data.frame with ten (10) rows of the data, like it was originally, but with my combined data in place. Is that possible.
I am using R version 3.5.3. and the tidyverse library is not available for install on that version.
Any ideas as to how I can generate that data.frame?
If a barplot is what you're after, here's a piece of code to get you that:
First, you need to get the data in the right format (that's probably what you tried to do in df), by column-binding the two numerical variables using cbindand transposing the resulting dataframe using t(i.e., turning rows into columns and vice versa):
plotdata <- t(cbind(c$FATALITIES, c$INJURIES))
Then set the layout to your plot, with a wide margin for the x-axis to accommodate your long factor names:
par(mfrow=c(1,1), mar = c(8,3,3,3))
Now you're ready to plot the data; you grab the labels from c$EVTYPE, reduce the label size in cex.names and rotate them with las to avoid overplotting:
barplot(plotdata, beside=T, names = c$EVTYPE, col=c("red","blue"), cex.names = 0.7, las = 3)
(You can add main =to define the heading to your plot.)
That's the barplot you should obtain:
I am trying to combine two CSV files together in a graph_from_data_frame using multiple cores.
I already have the code developed (see below), I just need to adapt it to use more than one core.
The two csv file examples are posted below. Due to the volume of data in the csv's multiple cores are needed.
id
123
321
231
423
353
534
345
646
346
from to weight
123 456 2
123 435 3
432 654 2
342 543 4
234 323 3
432 543 4
234 543 1
234 654 1
edges <- read.csv("/Users/holly/edgeR.csv", header=T, as.is=T)
nodes <- read.csv("/Users/holly/nodeR.csv", header=T, as.is=T)
#libraries
library(igraph)
library(tictoc)
library(network)
library(data.table)
#Edges data set includes from and to addresses for block 200k from Neo4j
#edges <- read.csv("/Users/jonathanbailey/edges.csv", header=T, as.is=T)
#Node data s et contains all the address nodes ids for block 200k
#nodes <- read.csv("/Users/jonathanbailey/nodes.csv", header=T, as.is=T)
#Show titles of data set
head(nodes)
head(edges)
#Remove the weights column
#edges$weights <- NULL
#Removes duplicate values in the nodes data set
nodes <- nodes[!duplicated(nodes$id),]
# persuades the data into a two-column matrix format for igraph
el=as.matrix(edges)
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
#Creates a graph in R with edges and nodes
clustergraph1 <- graph_from_data_frame(el, directed = FALSE, vertices = nodes)
#Assigns the louvain algoritm to the above graph
Community200k <- cluster_louvain(clustergraph1)
Is there a way to make the two csv files merge into the graph data frame using parallel cores?
Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)
I am trying to draw a boxplot in R:
I have a dataset with 70 attributes:
The format is
patient number medical_speciality number_of_procedures
111 Ortho 21
232 Emergency 16
878 Pediatrics 20
981 OBGYN 31
232 Care of Elderly 15
211 Ortho 32
238 Care of Elderly 11
219 Care of Elderly 6
189 Emergency 67
323 Emergency 23
189 Pediatrics 1
289 Ortho 34
I have been trying to get a subset to only include emergency, pediatrics in a boxplot (there are 10000+ datapoints in reality)
I thought that I could just do this:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
plot(newdata)
Since if I do a summary of newdata, all it has is the pediatrics and emergency results. But when it comes to plotting it still includes the ortho, OBGYN, care of elderly in the x axis with no boxplot.
I presume that there is a way to do this in ggplot by doing
ggplot(newdata, aes(x=medical_speciality, y=num_of_procedures, fill=cond)) + geom_boxplot()
but this gives me the error:
Don't know how to automatically pick scale for object of type data.frame.
Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:cond
Can someone help me out?
I believe your problem comes from the fact that the column medical_speciality is a factor.
So, even though you subset your data the right way, you still get all the levels (including "Ortho", "OBGYN", etc...).
You can get rid of them by using the function droplevels:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
newdata <- droplevels(newdata) ## THIS IS THE NEW ADDITION
plot(newdata)
Does this help?