R: Split data in ggplot based on other factor - r

I am a beginner with R so I don't have much experience. I ran into a problem when trying to split my scatterplot in groups based on infection status. My dataset consists of log transformed antibody levels logapfhap2 in this example. Infection status any Pf inf is coded as Yes or No and gives information on if someone has been infected during the follow-up period. I am plotting timepoints (x) against antibody levels (y). For time point 1 and 14 I would like to make 2 groups based on infection status.
This is the main part of the code I use to plot the data without splitting in groups:
ggplot() +
geom_jitter(data=data2, aes(x='1', y=logapfhap2, colour='PfHAP2A')) +
geom_jitter(data=data2,aes(x='14', y=logbpfhap2, colour='PfHAP2B')) +
geom_jitter(data=TRC, aes(x='C', y=PfHAP2, colour='PfHAP2C'))
which results in this graph:
Then I tried to split it (I only show the first time point here) which returns an error.
ggplot() +
geom_jitter(data=data2[data2$any_Pf_inf=='Yes'],
aes(x='1inf', y=logapfhap2[data2$any_Pf_inf=='Yes'],
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No'],
aes(x='1un', y=logapfhap2[data2$any_Pf_inf=='No'],
colour='PfHAP2B'))
I wanted to create this graph but I get this error:
Error: Length of logical index vector must be 1 or 55, got: 482
Hope this is clear! Could anyone help me with this problem? Thanks!
EDIT
Not sure if this makes it clearer, but this is what my data looks like:

I just tried some other things and I have solved it now!
ggplot()+
geom_jitter(data=data2[data2$any_Pf_inf=='Yes',],
aes(x='1inf', y=logapfhap2,
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No',],
aes(x='1un', y=logbpfhap2,
colour='PfHAP2B'))
Apparently you have to add a comma after [data2$any_Pf_inf=='Yes',] to extract rows instead of columns.

Related

R - Making a ggplot while using survey package

I am stuck with a real problem.
My dataset comes from a survey and to make it usable to find statistics about the whole French population, I must weight it with weights.
For this purpose, I used the survey package, but the syntax is not really easy to use with R.
Is there a way to use ggplot while having weights?
To explain it a bit better, here is my dataset:
head(df)
Id Weight Var1
1 30 0
2 12.4 0
3 68.2 1
So my individual 1 accounts for 30 people in the French population.
I create a df_weighted dataset using the survey package.
How can I use ggplot now? df_weighted is a list!
I did something like this to try to escape the list problem but I did not work at all...
df_weighted_ggplot$var1 <- svytable(~var1, df_weighted)
df_weighted_ggplot$var_fill <- svytable(~var_fill, df_weighted)
ggplot(df_weighted_ggplot, aes(fill = var_fill , x =var1)) + geom_bar(position = "fill")
I received this predictable error:
Erreur : `data` must be a data frame, or other object coercible by `fortify()`, not a list
Do you know any other package which should help me? But I read many forums and it seems to be the most helpful...

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

Issue with Boxplot formula or variable definition

I have a csv file having 4 columns labeled AGE, DIASTOLIC, BMI and EVER.PREGNANT and 700 rows. The last column consists of only yes or no. I wish to plot the data BMI vs EVER.PREGNANT with an intent to comparing BMI of those with yes in the fourth column and no in the same column. What code should I write to get the required boxplot?
I have tried the following code:
Sheet=read.csv(/Downloads/1739230_1284354330_PIMA.csv - 1739230_1284354330_PIMA.csv.csv, sep=",")
boxplot(BMI~EVER.PREGNANT,data=sheet, main="BMI vs PREG",xlab="BMI",ylab="PREGNANT")
The error that I get is
Error in eval(expr,envr,enclos): object 'Sheet' not found
Similarly, what modifications can be done to plot AGE vs DIASTOLIC, where both columns are numbers? Will I get the 700 odd values nicely?
I answer here because it tells me not to extend the discussion :-).
I think you haven't loaded correctly your data set. You need to add header = T when loading to tell the program that your first row corresponds with the names of the variables.
Sheet=read.csv("/Downloads/1739230_1284354330_PIMA.csv", sep=",", header = T)

Questions associated with "Error: Aesthetics must be either length 1 or the same as the data"

I understand the subject "Error: Aesthetics must be either length 1 or the same as the data" has been done a lot (plenty of reading available online), however, I still have some unresolved questions
I am working with a dataset regarding all calls made to the Seattle Police Department in 2015. After I am done cleaning the data into an acceptable format I wind up with a dataset that is 62,092 rows and 13 columns (dataset name is SPD_2015). I would add a portion of the dataset to this question but I'm not entirely sure how to do it in a clean and legible format.
I used package lubridate to extract the times associated with my data set. I then created a bar graph that showed what time the crimes occur
ggplot(SPD_2015, aes(hour(date.reported.time))) +
geom_bar(width = 0.7)
and that works perfectly.
Since Car Prowls were the most frequently reported crime, I wanted to graph what time these car prowls occurred. And this is when I come across the error ""Error: Aesthetics must be either length 1 or the same as the data".
I read that ggplot2 does not like it when you subset within the ggplot code, so I subsetted my data by creating a separate data frame.
car.prowl <- filter(SPD_2015, summarized.offense.description == "CAR PROWL")
So here is my question. Why is it that when I look at the dimensions of my newly created dataset "car.prowl" I see that it has a dimension of 11,539 rows and 13 columns. But when I examine the length of the hours in the occurred.time column (the time that the crime occurred) I get a length of 62,092 which is the length of the original dataset?
In my mind I am picturing that the following code would work:
ggplot(car.prowl, aes(hour(occured.time))) +
geom_bar()
The length of the car.prowl$occured.time is correct:
> length(car.prowl$occured.time)
[1] 11539
but when I apply the hour function I get the length of the original dataset:
> length(hour(car.prowl$occured.time))
[1] 62092
when it should be 11,539.
Thank you. Please let me know what I can do to make my question more clear.
It could be a caching issue as Jeremy said above. I'm not sure this would work, but you could try the below, chaining things together.
SPD_2015%>%
filter(summarized.offense.description == "CAR PROWL")%>%
ggplot(aes(hour(occured.time)))+
geom_bar()

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources