R - Making a ggplot while using survey package - r

I am stuck with a real problem.
My dataset comes from a survey and to make it usable to find statistics about the whole French population, I must weight it with weights.
For this purpose, I used the survey package, but the syntax is not really easy to use with R.
Is there a way to use ggplot while having weights?
To explain it a bit better, here is my dataset:
head(df)
Id Weight Var1
1 30 0
2 12.4 0
3 68.2 1
So my individual 1 accounts for 30 people in the French population.
I create a df_weighted dataset using the survey package.
How can I use ggplot now? df_weighted is a list!
I did something like this to try to escape the list problem but I did not work at all...
df_weighted_ggplot$var1 <- svytable(~var1, df_weighted)
df_weighted_ggplot$var_fill <- svytable(~var_fill, df_weighted)
ggplot(df_weighted_ggplot, aes(fill = var_fill , x =var1)) + geom_bar(position = "fill")
I received this predictable error:
Erreur : `data` must be a data frame, or other object coercible by `fortify()`, not a list
Do you know any other package which should help me? But I read many forums and it seems to be the most helpful...

Related

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

Questions associated with "Error: Aesthetics must be either length 1 or the same as the data"

I understand the subject "Error: Aesthetics must be either length 1 or the same as the data" has been done a lot (plenty of reading available online), however, I still have some unresolved questions
I am working with a dataset regarding all calls made to the Seattle Police Department in 2015. After I am done cleaning the data into an acceptable format I wind up with a dataset that is 62,092 rows and 13 columns (dataset name is SPD_2015). I would add a portion of the dataset to this question but I'm not entirely sure how to do it in a clean and legible format.
I used package lubridate to extract the times associated with my data set. I then created a bar graph that showed what time the crimes occur
ggplot(SPD_2015, aes(hour(date.reported.time))) +
geom_bar(width = 0.7)
and that works perfectly.
Since Car Prowls were the most frequently reported crime, I wanted to graph what time these car prowls occurred. And this is when I come across the error ""Error: Aesthetics must be either length 1 or the same as the data".
I read that ggplot2 does not like it when you subset within the ggplot code, so I subsetted my data by creating a separate data frame.
car.prowl <- filter(SPD_2015, summarized.offense.description == "CAR PROWL")
So here is my question. Why is it that when I look at the dimensions of my newly created dataset "car.prowl" I see that it has a dimension of 11,539 rows and 13 columns. But when I examine the length of the hours in the occurred.time column (the time that the crime occurred) I get a length of 62,092 which is the length of the original dataset?
In my mind I am picturing that the following code would work:
ggplot(car.prowl, aes(hour(occured.time))) +
geom_bar()
The length of the car.prowl$occured.time is correct:
> length(car.prowl$occured.time)
[1] 11539
but when I apply the hour function I get the length of the original dataset:
> length(hour(car.prowl$occured.time))
[1] 62092
when it should be 11,539.
Thank you. Please let me know what I can do to make my question more clear.
It could be a caching issue as Jeremy said above. I'm not sure this would work, but you could try the below, chaining things together.
SPD_2015%>%
filter(summarized.offense.description == "CAR PROWL")%>%
ggplot(aes(hour(occured.time)))+
geom_bar()

R: Split data in ggplot based on other factor

I am a beginner with R so I don't have much experience. I ran into a problem when trying to split my scatterplot in groups based on infection status. My dataset consists of log transformed antibody levels logapfhap2 in this example. Infection status any Pf inf is coded as Yes or No and gives information on if someone has been infected during the follow-up period. I am plotting timepoints (x) against antibody levels (y). For time point 1 and 14 I would like to make 2 groups based on infection status.
This is the main part of the code I use to plot the data without splitting in groups:
ggplot() +
geom_jitter(data=data2, aes(x='1', y=logapfhap2, colour='PfHAP2A')) +
geom_jitter(data=data2,aes(x='14', y=logbpfhap2, colour='PfHAP2B')) +
geom_jitter(data=TRC, aes(x='C', y=PfHAP2, colour='PfHAP2C'))
which results in this graph:
Then I tried to split it (I only show the first time point here) which returns an error.
ggplot() +
geom_jitter(data=data2[data2$any_Pf_inf=='Yes'],
aes(x='1inf', y=logapfhap2[data2$any_Pf_inf=='Yes'],
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No'],
aes(x='1un', y=logapfhap2[data2$any_Pf_inf=='No'],
colour='PfHAP2B'))
I wanted to create this graph but I get this error:
Error: Length of logical index vector must be 1 or 55, got: 482
Hope this is clear! Could anyone help me with this problem? Thanks!
EDIT
Not sure if this makes it clearer, but this is what my data looks like:
I just tried some other things and I have solved it now!
ggplot()+
geom_jitter(data=data2[data2$any_Pf_inf=='Yes',],
aes(x='1inf', y=logapfhap2,
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No',],
aes(x='1un', y=logbpfhap2,
colour='PfHAP2B'))
Apparently you have to add a comma after [data2$any_Pf_inf=='Yes',] to extract rows instead of columns.

Circular-linear regression with covariates in R

I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources