Using a column that contains a frequency/weight/count in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
This is an easy question to ask, but a hard one to search for. Frequency is used all over the place. I tried a synonym (weight), but since mtcars is so widely used, I get a lot of false negatives as well. Same thing for counts.
I'm looking at datasets::HairEyeColor, partly reproduced here:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11
6 Brown Blue Male 50
7 Red Blue Male 10
8 Blond Blue Male 30
9 Black Hazel Male 10
10 Brown Hazel Male 25
.
.
.
I can across this when trying to show someone how to make a mosaic plot of any two of Hair, Eye, and Gender. On first read, I didn't see a way to specify a column to specify "this column represents 32 of the set members" but I didn't read too carefully.
I suppose I could reshape the data using melt() and reshape() every time I receive data with a frequency column, but that seems kind of drastic.
In other languages I know, I could add a parameter to the fitting function to let it know “there’s not just one row with this set of levels, there are n of them. So if I wanted to see a distribution, I might say
DISTR(Y=Hair, FREQ=freq)
...which would generate a histogram or density plot with n values per row
Alternately,
lm(hair ~ eye + sex, data = ‘HairEyeColor’, freq = ‘freq’)
Would fit a linear model with 32 replications if the first row rather than 1.
I’m asking about a way to use the 32 in the first row (for example) to tell the modeling or graphing function that there are 32 cases with this combination of levels, 53 with the combination in the second row, etc.
Surely this kind of data shows up a lot. I see it all the time, but there’s usually a way to say that this number specifies the frequency that this row represents in the actual data. Rather than a data table with 32 rows of Black, Brown, Male, there’s one row with frequency 32.
(No plyr please.)

No, there is not a standard way to use this type of data across all of R.
Many of the basic modeling functions, e.g., lm, glm, nls, loess, and more from the stats package accept a weights argument that will meet your needs. prop.test accepts data in either format. But many other modeling functions do not, e.g., knn, princomp, and many others not in base R.
barplot accepts input in either format. mosaicplot expects input as an aggregated contingency table. Other types of plots would require more custom handling, because there are a lot of different things you could do with frequency.
Of course, anything not in base R is up to whoever writes it.
ggplot2 (which is not base R) generally handles this really well, e.g., geom_bar will stack up values by default, or in the case of scatterplots you could map size or color or alpha to visually convey the intensity.
randomForest and xgboost do not accept weights
I will say that I very rarely find this to be a problem. I'd encourage you to ask specific questions about methods where it is causing you issues. I think mosaicplot is a bad example as it expects a contingency table, so the problem would be the opposite: using it with disaggregated data would require first aggregating it up to a frequency table.

Related

Why is R adding empty factors to my data?

I have a simple data set in R -- 2 conditions called "COND", and within those conditions adults chose between one of 2 pictures, we call house or car. This variable is called "SAW"
I have 69 people, and 69 rows of data
FOR SOME Reason -- R is adding an empty factor to both, How do I get rid of it?
When I type table to see how many are in each-- this is the output
table(MazeData$SAW)
car house
2 9 59
table(MazeData$COND)
Apples No_Apples
2 35 33
Where the heck are these 2 mystery rows coming from? it wont let me make my simple box plots and bar plots or run t.test because of this error - can someone help? thanks!!

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

How to use image() function to plot the data in R

I have a clinical dataset and I would like to plot it using image() function to see if I can spot out the different groups within my data.
The structure of this data is a List of 2: 56 samples and 5000 gene expressions.
When I use image(lung), all I see a just a plot of orange color and I do not see pattern or any group standing out to me.
Basically, there are four types of clinical conditions in the dataset: Colon cancer (13 samples), smallcell (6 samples), etc.
I wanted to see, for instance, ```smallcell" with 6 samples has its own pattern compared to the rest of the groups/conditions within this dataset.
load(url("https://github.com/hughng92/dataset/raw/master/lung.RData"))
rownames(lung)
image(lung)
This is all I see:
I am wondering if I can combine the four different plots of these 4 conditions from the data set, it will look different.
Any tip would be great!
I'd suggest looking at the image output after rearranging the like types together. I think I now see some group differences in those gene expression profiles. Specifically the "Normal" category has generally fewer red bands although there are a couple where "normal" is red and the others are not. I think it is interesting, and not particularly surprising, that the appears to be less variability within the Normal columns (in the image) than there is within each the tumor types. I have a friend who's a molecular biologist who characterizes tumors as "genetic train wrecks":
table( rownames( lung[order(rownames(lung)), ]))
Carcinoid Colon Normal SmallCell
20 13 17 6
------------------
image( lung[order(rownames(lung)), ])
This would give a better indication of the boundaries of the type grouping:
image( lung[order(rownames(lung)), ], xaxt="n")
axis(1, at=(cumsum( table( rownames( lung[order(rownames(lung)), ])))-1)/56 ,
labels=names(table( rownames( lung[order(rownames(lung)), ]))),las=2)

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources