Okay, so I'm really stuck. I have a data set which looks like this:
Species Latitude Longitude Oiling Condition BirdCount Date_ Oil_Cond Date week.number
1 Northern Gannet 30.32860 -89.19810 Not Visibly Oiled Live 1 2010-07-21 1 2010-07-21 30
2 Laughing Gull 30.23172 -88.32127 Not Visibly Oiled Live 1 2010-05-05 1 2010-05-05 19
3 Northern Gannet 30.26677 -87.59248 Visibly Oiled Live 1 2010-05-05 2 2010-05-05 19
4 American White Pelican 29.29649 -89.66432 Not Visibly Oiled Live 1 2010-05-05 1 2010-05-05 19
5 Brown Pelican 29.88244 -88.87624 Visibly Oiled Live 1 2010-05-08 2 2010-05-08 19
6 Brown Pelican 29.00290 -89.36961 Not Visibly Oiled Live 1 2010-05-14 1 2010-05-14 20
7 Northern Gannet 30.33390 -85.56565 Unknown Live 1 2010-05-17 6 2010-05-17 21
8 Common Loon 30.28177 -87.51028 Not Visibly Oiled Live 1 2010-05-17 1 2010-05-17 21
9 Brown Pelican 30.41410 -88.24542 Visibly Oiled Live 1 2010-05-18 2 2010-05-18 21
10 Northern Gannet 30.24063 -88.12451 Not Visibly Oiled Live 1 2010-05-18 1 2010-05-18 21
And I'm trying to get a faceted histogram plotting the variable Oil_Cond for the 5 most frequent species of birds (there are over 100 unique bird species).
At first I wanted to produce a facet with all the species and used the following code:
qplot(Oil_Cond, data = birds, facets = Species ~., geom = "histogram")
But of course, that overloaded and wouldn't work because there would have been over 100 facets. So then I decided that I really only care about the top 5 species anyways, and I worked out what they are and with what frequency they appear (Laughing Gull: 3036, Brown Pelican: 789, Northern Gannet: 546, Royal Tern: 321, Black Skimmer: 258). However, I am at a loss as to how to do that.
Any help would be much appreciated.
Thank you :)
Amy
The easiest thing to do here may be to simply plot a subset of your data. The only potential thing to be careful of is if the species variable is stored as a factor, rather than as strings. First create a subset:
birdsSub <- subset(birds, Species %in% c('Laughing Gull','Brown Pelican',
'Northern Gannet','Royal Tern','Black Skimmer'))
birdsSub$Species <- droplevels(birdsSub$Species)
and then you should be able to pass this data frame to qplot as you have before. The reason for the droplevels is that if that variable is stored as a factor, all the species that no longer appear will 'come along for the ride' as unused factor levels, and you'll just end up with all 100 panels, all but five them being empty.
You could tackle this using the excellent plyr package...
# If you don't already have plyr installed, uncomment the next line:
# install.packages('plyr')
require(plyr)
# First, find out how many of each species you have...
ns=ddply(birds,.(Species),summarise,n=length(Species))
# This will produce a table listing the number of each species you have
# (in the column 'n'). Type 'ns' to see the table.
# We can then rank the species occurrence, to see how important the different
# species are
ns$r = rank(-ns$n) # negative because 'rank' starts with the lowest number.
# have a look at the top 5 species:
subset(ns,r<=5)
# There are a couple of ways to proceed from here. Either we could get the
# top 5 species names from this 'ns' table:
# names=as.character(subset(ns,r>=5)$Species)
# and use joran's method, or we could merge the ns table and the original
# dataset (so that each species has an 'n' and 'r' attribute) and subset the
# data by species number or rank. I prefer the latter, as it allows you to
# flexibly change the species number threshold. i.e.:
birds=merge(birds,ns,by='Species')
# We've now added 'n' and 'r' columns to the birds data, so we can select
# our subset based on either of these columns:
birds.by.r=subset(birds,r<=5) # selects only the top 5 bird species
birds.by.n=subset(birds,r>=100) # selects all species with over 100 occurrences
# Then just plot away!
qplot(Oil_Cond,data=birds.by.r,facets=Species~.,geom='histogram')
# or
qplot(Oil_Cond,data=birds.by.n,facets=Species~.,geom='histogram')
Related
I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
.....
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields
I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.
You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545
This is a question of a noob in 'R' world. I tried searching and there were quite a few solutions that came close (e.g aggregate, by, etc), but I lacked the understanding to apply it to my problem. Would really appreciate if someone can guide me in a more detailed way.
Hypothetical Dataset
Name Wheels Color Mileage seat_capacity
1 2 Red 70 2
2 3 Black 60 7
3 4 Blue 12 5
4 4 White 15 6
5 3 Yellow 45 6
6 2 Green 70 2
7 3 Silver 45 6
8 6 Silver 5 4
9 14 Red 12 2
10 2 Black 70 7
11 4 Blue 70 5
12 3 White 60 6
13 4 Yellow 12 6
14 4 Green 15 2
I have initially created subsets of data based on color using split.
color <- split(df,df$color)
For each of the subsets created I would be doing more operations e.g
finding the vehicles with highest mileage among the vehicles with lowest number of wheels in each subset.....etc
I have written all the rules pertaining to the later half as well. I am struggling to find a way where I can run all the operations on each of the subset in the variable color.
Any help would be appreciated.
The following worked for me and I would sincerely want to thank #Imo and #aosmith for guiding me.
Assume, I would want to first group the df based on colour and then group further by wheels and then within each such subgroup(wheels) pick top 2 vehicles based on Mileage. Used the dplyr library to achieve the same.
my_list <- df %>% group_by(color, wheels) %>% top_n(2,Mileage)
HTH
I have asked this question before but haven't found an answer yet. I am trying to create a bar group in SAS which shows the percentage of patients that received a test by category and within in bar, show the location where the tests were received (location). My dataset looks like this:
Category Test Test_location
High Risk 1 Site 1
Intermediate Risk 1 Site 2
Low Risk 0 .
Intermediate Risk 0 .
High Risk 1 Site 3
Where each patient is listed with the risk classification they have been assigned to (variable 'Category'), an indicator variable that shows whether or not they received a test (variable 'test' where '1'=received test and '0'=did not receive test) and, if they received a test, where that test took place (variable 'test_location').
I want to create a bar graph with the categories on the x axis and the yaxis showing the percentage of patients who got a test (test=1), and then each bar shaded to show the composition of patients who got a test in each category for location (ie: how many tests occurred in Site 1, 2 and 3).
I have the below code, but it is not giving me the percentages that I want. It gives me a pct_col output of test*category, and I want pct_row. In other words, I want the y axis to measure the percentage of patients with testing out of the total number of patients in each category, not out of all patients who receiving testing in any category like it is giving me.
Example of what I want: In the dummy dataset below, for high risk patients, for example, I want a bar that shows 75% (12 patients with tests out of the total 16 high risk patients) received tests, and then have the bar shaded to show 41.66% of those test were at Site 1, 33.34% at Site 2 and 25% at Site 3. And so on for the intermediate and low risk categories. If there is a way to label the subsections with the exact percentages, that would be great too.
Dummy data set:
data test;
infile datalines missover;
input ID Category $ Test Test_location $;
datalines;
1 High 1 Site_1
2 High 1 Site_1
3 High 1 Site_1
4 High 1 Site_1
5 High 1 Site_1
6 High 1 Site_2
7 High 1 Site_2
8 High 1 Site_2
9 High 1 Site_2
10 High 1 Site_3
11 High 1 Site_3
12 High 1 Site_3
13 High 0
14 High 0
15 High 0
16 High 0
17 Intermediate 1 Site_1
18 Intermediate 1 Site_1
19 Intermediate 1 Site_2
20 Intermediate 0
21 Intermediate 0
22 Intermediate 0
23 Intermediate 0
24 Intermediate 0
25 Intermediate 0
26 Low 1 Site_1
27 Low 1 Site_1
28 Low 1 Site_1
29 Low 1 Site_2
30 Low 1 Site_2
31 Low 1 Site_2
32 Low 1 Site_3
33 Low 0
34 Low 0
35 Low 0
36 Low 0
37 Low 0
38 Low 0
;
Thank you!
EDIT;
Here is a sample graph of what I am looking to output in SAS (using the dummy data above):
Using this code:
proc sgplot data=test pctlevel=graph;
vbar category / response=test stat=percent
group=test_location groupdisplay=stack datalabel;
keylegend /title="Testing Location" position=bottom;
quit;
I get this output:
So what I have is not giving my the correct denominators for my percents. I also couldn't figure out a way to label the individual subsections of the graph like I have in my sample figure.
Thank you!
You can get exactly what you want by using a bit of data step and some formatting. It would be a little bit different from your working code. As others have pointed out, there are many useful examples at Robert Allison's site.
I'd go with the simple solution below, which is almost exactly what you asked for, and very close to your working code. The main difference is that the missing values are their own category.
The key lines are:
Use pctlevel=group
Use missing
Here is the code:
proc sgplot data = test
pctlevel = group
;
vbar category /
stat = percent
group = test_location
grouporder = data
missing
seglabel
;
keylegend /
title = "Testing Location"
position = bottom
;
quit;
I get:
What I am trying to do is merge my dataframe by rows. For instance let's say my data.frame is called data and it looks like this: I have 5 columns- subject contains 5s and 6s, Phase contains Post-Lure and Pre-Lure, Type contains Visual and Auditory and Memory contains a list of scores. Ex:
Subject Phase Type Memory
1 5 Post-Lure Visual 0.80000000
2 5 Post-Lure Auditory 0.70666667
3 5 Pre-Lure Visual 0.40000000
4 5 Pre-Lure Auditory 0.61333333
5 6 Post-Lure Visual 0.80000000
6 6 Post-Lure Auditory 0.54666667
As you can see from the code above, the subject is repeated (subject 5 is the same person but the phase and/or type are now different). Thus, I am looking for a code that will make all of the data for each subject on the same row. Hence, the memory scores, and the different types and phases each subject were exposed to will just now become additional columns on the same row. I feel aggregate may do the trick but is it possible to use that code without applying a function to each of the numbers. Any help would be greatly appreciated. Thank you.
As mentioned in the comment, you need to add an "indicator" variable of some sort (for example, how many "times" there are for each subject).
That can be done with ave and seq_along:
mydf$time <- with(mydf, ave(Subject, Subject, FUN=seq_along))
Next, you can use reshape() to go from "long" to "wide".
reshape(mydf, direction = "wide",
idvar="Subject", timevar="time")
# Subject Phase.1 Type.1 Memory.1 Phase.2 Type.2 Memory.2
# 1 5 Post-Lure Visual 0.8 Post-Lure Auditory 0.7066667
# 5 6 Post-Lure Visual 0.8 Post-Lure Auditory 0.5466667
# Phase.3 Type.3 Memory.3 Phase.4 Type.4 Memory.4
# 1 Pre-Lure Visual 0.4 Pre-Lure Auditory 0.6133333
# 5 <NA> <NA> NA <NA> <NA> NA
If you wanted to use the "reshape2" or "tidyr" packages, you would first have to get the data into a "long" form using melt or gather, but note that in the process, your variable types would be converted since a single column would be containing several data types.
Do you just want to reshape your data? The question isn't clear. Let's call your dataframe df. Then
library(reshape2)
dcast(df, Subject ~ Phase + Type)
will produce
Subject Post-Lure_Auditory Post-Lure_Visual Pre-Lure_Auditory Pre-Lure_Visual
1 5 0.7066667 0.8 0.6133333 0.4
2 6 0.5466667 0.8 NA NA