I have a dataframe as following
year month increment
113 6 464
113 7 132
113 8 165
113 9 43
113 10 658
113 11 54
113 12 463
114 1 231
114 2 21
Despite being ordered as indicated, when I plot increment~factor(month), the resulting x axis in the plot starts from month 1, instead of starting with month 6 like the dataframe
qplot(month,data=monthly,fill=treatment,weight=increment,position="dodge")
What should I do to make x axis respect the order of the month I need?
Something like this, perhaps:
qplot(interaction(year, month, lex.order=TRUE), data=monthly, fill=treatment,weight=increment,position="dodge")
Removing the fill=treatment argument (as I do not have the data) results in this:
qplot(interaction(year, month, lex.order=TRUE), data=monthly, weight=increment,position="dodge")
Related
Data given are a sample of cholesterol levels taken from 24 hospital employees who were on a standard American diet and who agreed to adopt a vegetarian diet for 1 month. Serum-cholesterol measurements were made before adopting the diet and 1 month after.
Subject Before After Difference
1 1 195 146 49
2 2 145 155 -10
3 3 205 178 27
4 4 159 146 13
5 5 244 208 36
6 6 166 147 19
7 7 250 202 48
8 8 236 215 21
9 9 192 184 8
10 10 224 208 16
11 11 238 206 32
12 12 197 169 28
13 13 169 182 -13
14 14 158 127 31
15 15 151 149 2
16 16 197 178 19
17 17 180 161 19
18 18 222 187 35
19 19 168 176 -8
20 20 168 145 23
21 21 167 154 13
22 22 161 153 8
23 23 178 137 41
24 24 137 125 12
Now here is the question I am trying to answer. Some investigators believe that the effects of diet
on cholesterol are more evident in people with high rather than low cholesterol levels. If you split the data according to whether baseline cholesterol is above or below the median, can you comment descriptively on this issue?
Now, I am thinking of creating boxplot based on two categories here. I wish to use dplyr for data manipulation here. So, I will create a new column based on if Before is less than or greater than median of Before. So, I will have a new character vector with "high" for high Before cholesterol and low for low Before cholesterol. And, then I will do a boxplot of Difference based on the categorical new column. So, here is my code. I call the original data set as df2.
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
group_by(new_col) %>%
ggplot(aes(x= new_col, y=Difference)) +
geom_boxplot()
And following is the boxplot I get
So, based on this, I conclude that investigators are right and effects of diet on cholesterol are more evident in people with high rather than low cholesterol levels. I want to know if this can be done more effectively.
This is more a statistical plan question rather than a programming question, therefore it would belong more to stats.stackexchange than StackOverflow.
Anyway, categorizing a variable depending on the median is not the recommended way of visualizing associations, as you are suppressing a lot of information. You can read about this in this very good article by Peter Flom.
It is better to keep all the points and apply some spline or smoothing algorithm.
For instance, you could consider something like this:
ggplot(df2, aes(x= Before, y=Difference)) +
geom_point() +
geom_smooth()
Here, the relationship is clearly seeable, while keeping all the information you want.
If you really have to generate subgroups, you could also try something like this:
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
ggplot(aes(x= Before, y=Difference, group=new_col, color=new_col)) +
geom_point() +
geom_smooth(span=3) #try some other values here
However, using the median is still not a very good idea, especially with that amount of data points. You might want to assess the functional form of the relationship, but that would need a specific question on stats.stackexchange.com.
not really an answer, but more of a different approach in visualisation of the data..
library( data.table )
library( ggplot2 )
DT.melt <- melt( DT, id.vars = "Subject", measure.vars = c( "Before", "After" ) )
ggplot() +
geom_line( data = DT.melt,
aes( x = variable, y = value, group = Subject ) ) +
geom_line( data = DT.melt[, .(mean = mean(value)), by = variable ],
aes( x = variable, y = mean, group = 1 ), color = "red", size = 2 ) +
labs( x = "", y = "" )
sample data used
DT <- fread(" Subject Before After Difference
1 195 146 49
2 145 155 -10
3 205 178 27
4 159 146 13
5 244 208 36
6 166 147 19
7 250 202 48
8 236 215 21
9 192 184 8
10 224 208 16
11 238 206 32
12 197 169 28
13 169 182 -13
14 158 127 31
15 151 149 2
16 197 178 19
17 180 161 19
18 222 187 35
19 168 176 -8
20 168 145 23
21 167 154 13
22 161 153 8
23 178 137 41
24 137 125 12")
I have an excel sheet that I imported into RStudio which contains data for every subject of a certain population. Each subject has their own set of data with corresponding dates, but I only want to look at the data and perform statistical analyses on the dates past a unique date for each subject.
I'm assuming I can use the split function to create smaller dataframes, with each corresponding to that of each subject, and then use some function to analyze the data in a loop to run on all of the smaller dataframes I created from the split.
Some of these subjects with have over 1000 data points. My two main questions are:
1) Is there a function I can use to analyze the data for each subject past a specific unique date to each subject?
2) Is the strategy I proposed above a viable one?
I unfortunately have very little experience in data analyses or extensive any background in computer science. Thanks for any help.
Edit: So there was a request about the type of data I was talking about. I was wondering if I had data similar to this, could I still use the above strategy. Where P1 and P2 have their own data sets that I want to analyze after the TxDate.
>data
1 Date BMI Glucose Cholesterol TxDate
2 P1 3/3/15
3 12/1/14 24 145 99
4 3/18/15 26 123 101
5 4/21/15 28 111 85
6 6/2/15 25 133 90
7
8
9 P2 4/6/16
10 1/3/16 33 145 200
11 3/30/16 31 162 178
12 5/13/16 34 190 134
13 6/12/16 34 183 168
14 7/9/16 35 200 189
15 9/10/16 31 175 190
16 11/23/17 27 121 120
17
18
Here are some suggestions to get you started:
1) Tidy your data. To do this you could look into ways to modify your input data so it looks more like this:
ID Date BMI Glucose Cholesterol TxDate
3 P1 12/1/14 24 145 99 3/3/15
4 P1 3/18/15 26 123 101 3/3/15
5 P1 4/21/15 28 111 85 3/3/15
6 P1 6/2/15 25 133 90 3/3/15
10 P2 1/3/16 33 145 200 4/6/16
11 P2 3/30/16 31 162 178 4/6/16
12 P2 5/13/16 34 190 134 4/6/16
13 P2 6/12/16 34 183 168 4/6/16
14 P2 7/9/16 35 200 189 4/6/16
15 P2 9/10/16 31 175 190 4/6/16
16 P2 11/23/17 27 121 120 4/6/16
Notice the ID and TxDate column are filled in with the appropriate value and several rows were dropped. And row for ID, Date, etc. are actually 'headers', and not a data row. Don't be too surprised if the tidying step takes longer than the analysis.
Now, for the purpose of this example lets use this as your data:
df <- data.frame(
ID = c(rep("P1",4), rep("P2", 7)),
Date = as.Date(mdy(c("12/1/14", "3/18/15", "4/21/15" , "6/2/15", "1/3/16", "3/30/16", "5/13/16", "6/12/16", "7/9/16", "9/10/16", "11/23/17"))),
BMI = c(24,26,28,25,33,31,34,34,35,31,27),
Glucose = c(145,123,111,133,145,12,190,183,200,175,121),
Cholesterol = c(99,101,85,90,200,178,134,168,189,190,120),
TxDate = as.Date(mdy(c("3/3/15", "3/3/15","3/3/15","3/3/15","4/6/16", "4/6/16","4/6/16","4/6/16","4/6/16","4/6/16","4/6/16"))),
stringsAsFactors = F)
2) Check to see if your Date and TxDate columns are being represented as a date object. If your data.frame is named 'df' then something like is.date(df$Date) and is.date(df$TxDate) will tell you. Or str(df).
If not, read about ways to convert them to date objects, perhaps with the as.Date() function combined with mdy() from the lubridate package.
3) Once you have the dates represented as date objects you could subset the data frame with a simple logical statement, like this
# subset dataframe
df1 <- df[df$Date > df$TxDate, ]
Now df1 should look like this:
ID Date BMI Glucose Cholesterol TxDate
2 P1 2015-03-18 26 123 101 2015-03-03
3 P1 2015-04-21 28 111 85 2015-03-03
4 P1 2015-06-02 25 133 90 2015-03-03
7 P2 2016-05-13 34 190 134 2016-04-06
8 P2 2016-06-12 34 183 168 2016-04-06
9 P2 2016-07-09 35 200 189 2016-04-06
10 P2 2016-09-10 31 175 190 2016-04-06
11 P2 2017-11-23 27 121 120 2016-04-06
What's left is the data you seem to need for your analysis.
i need to distribute some days along the year.
I have 213 activities and 247 days.. i need to plan this activities, but i need to cover the maximum time what can be possible.
I am substracting the total days - activities, in this case 34, i divide the total days with the previous result: 247/34= 7,26...
With this number i know what every seven days y have one without activity.
To code this part i doing this:
where day is a "for" variable what its looping a dataframe with dates and integer its the integer part of 7,26, in this case 7
if(day%%integer==0) {
aditional <- 0
} else {
aditional <- 1
}
#
if(day%%7==0) {
aditional <- 0
} else {
aditional <- 1
}
The result will be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
In bold font the day without activity
This way its cool, but its not so precise how i want.
I know i need to use the decimal part of the result of 7,26... 26, but i dont know how do it.
Can you help me please?
Thanks and sorry for my english
Make these 34 days the non-activity days:
round((247/34) * seq(34))
giving:
[1] 7 15 22 29 36 44 51 58 65 73 80 87 94 102 109 116 124 131 138
[20] 145 153 160 167 174 182 189 196 203 211 218 225 232 240 247
I have a data frame (named 'mdf') which includes two columns. The basic information is below:
> head(mdf); tail(mdf)
Country Rank
1 ABW 161
2 AFG 105
3 AGO 60
4 ALB 125
5 ARE 32
6 ARG 26
Country Rank
184 WSM 181
185 YEM 90
186 ZAF 28
187 ZAR 112
188 ZMB 104
189 ZWE 134
> str(mdf)
'data.frame': 189 obs. of 2 variables:
$ Country: Factor w/ 229 levels "","ABW","ADO",..: 2 4 5 6 7 8 9 11 12 13 ...
$ Rank : Factor w/ 195 levels "",".. Not available. ",..: 72 10 149 32 118 111 41 84 26 112 ...
My purpose is to rearrange it by ordering 'Rank' variable, but the result is:
> mdf[order(mdf$Rank),]
Country Rank
178 USA 1
78 IND 10
153 SLV 100
170 TTO 101
43 CYP 102
54 EST 103
188 ZMB 104
2 AFG 105
175 UGA 106
130 NPL 107
73 HND 108
60 GAB 109
31 CAN 11
67 GNQ 110
As you see, it is not what I need (i.e. increasing order). How can I do it? Thanks!
To get the answer you are looking for, use:
mdf[order(as.numeric(as.character(mdf$Rank))),]
The reason your original code doesn't work is that your Rank variable is a factor, so it will be sorted by the levels of the factor. For example, if you had a data frame such that:
DF
# x
# 1 2
# 2 22
# 3 11
# 4 1
and order the data
DF[order(DF$x),]
and you look at the levels:
levels(DF$x)
# [1] "1" "2" "11" "22"
We can reorder the levels such that
levels(DF$x) <- relevel(DF$x, ref = '11')
Now,
levels(DF$x)
# [1] "2" "22" "11" "1"
So ordering with the new factor levels we get different results:
DF[order(DF$x),]
To answer your question of why as.numeric doesn't work, it's because each factor level has an associated integer, which you get with as.numeric. If you want the number that is the factor label, you must first convert to a character and then convert to numeric, thus requiring as.numeric(as.character(x))
For example, calling as.numeric(DF$x) gives the integer values for each level, but not the actual label for each level:
# [1] 2 4 3 1
One way to avoid this in the future if you are loading your data frame from a .csv file is to use read.csv(..., stringsAsFactors=FALSE), or also I like the fread function in data.table which uses safer default types.
I have a data table which looks like this-
pos gtt1 gtt2 ftp1 ftp2
8 100 123 49 101
9 85 93 99 110
10 111 102 53 113
11 88 110 59 125
12 120 118 61 133
13 90 136 64 145
14 130 140 104 158
15 78 147 74 167
16 123 161 81 173
17 160 173 88 180
18 117 180 94 191
19 89 188 104 199
20 175 197 107 213
I want to make a line graph with pos (position) on the x-axis using ggplot. I am trying to show gtt1 and gtt2 lines in one colour and ftp1 and ftp2 in another colour, because they are separate groups (gtt and ftp) of samples. I have successfully created the graph, but all four lines are in different colours. I would like to keep only gtt and ftp in the legend (not all four). Bonus, how can I make these lines little smooth.
Here is what I did so far:
library(reshape2);library(ggplot2)
data <- read.table("myfile.txt",header=TRUE,sep="\t")
data.melt <- melt(data,id="pos")
ggplot(data.melt,aes(x=pos, y=value,colour=variable))+geom_line()
Thanks in advance
The easiest way is to re-shape your data in a slightly different way:
dd1 = melt(dd[,1:3], id=c("pos"))
dd1$type = "gtt"
dd2 = melt(dd[,c(1, 4:5)], id=c("pos"))
dd2$type = "ftp"
dd.melt = rbind(dd1, dd2)
Now we have a column specifying the variable "type":
R> head(dd.melt, 2)
pos variable value type
1 8 gtt1 100 gtt
2 9 gtt1 85 gtt
Once the data is in this format, the ggplot command is straightforward:
ggplot(dd.melt,aes(x=pos, y=value))+
geom_line(aes(colour=type, group=variable)) +
scale_colour_manual(values=c(gtt="blue", ftp="red"))
You can add smoothed lines using stat_smooth:
##span controls the smoothing
g + stat_smooth(se=FALSE, span=0.5)