Keep only the first sequence of a column - r

I have a dataset where level_no may repeat (see dput below). I only want to keep the "first round" of data, so as to speak. For example, ID = 1 has level_no 0,1,2,3,1, I want to keep only the first round.
So far, I'm using distinct to remove the subsequent rounds but I'm not sure if this is the correct approach.
puzzleData_mandatory %>% arrange(ID, total_played_time) %>% select(ID, level_no, total_played_time) %>%
distinct(.keep_all = TRUE, ID, level_no)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), level_no = c(0L,
1L, 2L, 3L, 1L, 0L, 1L, 2L, 3L, 0L, 1L, 2L, 3L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 2L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, NA), total_played_time = c(285.54, 542.94,
856.8, 1129.1, 1226.98, 282.28, 457.42, 947.78, 1073.8, 161.66,
293.38, 548.26, 682.66, 818.18, 976.86, 1008.76, 1019.34, 59.06,
93.14, 223.1, 485.24, 644.2, 2002.74, 2249.74, 2417.84, 2481.99,
2614.9, 2818.64, 2913.61, 3039.14, 3057.44, 3217.52, 3359.48,
3480.78, 3638.04, 3764.88, 3883.16, 4025.9, NA)), row.names = c(NA,
-39L), class = c("tbl_df", "tbl", "data.frame"))
Is there a better way to do this?

If you want all of the rows where the level_no is minimum for an ID, then you can do this
puzzledata %>% arrange(level_no) %>% group_by(ID) %>% slice_min(level_no)
If you want only one row per ID, where you take the row with the minimum total_played_time within the level_no that is minimum, you can do this
puzzleData_mandatory %>%
arrange(ID, level_no, total_played_time) %>%
group_by(ID) %>%
filter(row_number()==1)
If you want the first increasing sequence of level_no for each id, you can do this:
data %>% group_by(ID) %>%
mutate(change=sign(level_no -lag(level_no)),
change=if_else(is.na(change),1,change),
id = data.table::rleid(change)) %>%
ungroup() %>%
filter(id==1) %>%
select(1:3)

Related

Calculating number of observations per group in R

I would like to calculate column D based on the date column A. Column D should represent the number of observations grouped by column B.
Edit: fake data below
data <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 9L,
10L, 11L, 12L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("1/1/2015",
"1/2/2015", "1/3/2015", "1/4/2015", "1/5/2015", "1/6/2015", "5/10/2015",
"5/11/2015", "5/6/2015", "5/7/2015", "5/8/2015", "5/9/2015"), class = "factor"),
Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B",
"C"), class = "factor"), Value = c(215630672L, 1650864L,
124017368L, 128073224L, 97393448L, 128832128L, 14533968L,
46202296L, 214383720L, 243346080L, 85127128L, 115676688L,
79694024L, 109398680L, 235562856L, 235473648L, 158246712L,
185424928L), Number.of.Observations.So.Far = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L
)), class = "data.frame", row.names = c(NA, -18L))
What function in R will create a column D like so?
We can group by 'Country' and create sequence column with row_number()
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(NumberOfObs = row_number())
Or with base R
df1$NumberOfObs <- with(df1, ave(seq_along(Country), Country, FUN = seq_along))
Or with table
df1$NumberOfObs <- sequence(table(df1$Country))
Or in data.table
library(data.table)
setDT(df1)[, NumberOfObs := rowid(Country)][]
data
df1 <- read.csv('file.csv')

Normalization of data within ggplot

I have my data as
melted.df <- structure(list(organisms = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("Botrytis cinerea", "Fusarium graminearum",
"Human", "Mus musculus"), class = "factor"), types = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("AllMismatches",
"mismatchType2", "MismatchesType1", "totalDNA"), class = "factor"),
mutations = c(30501L, 12256L, 58357L, 366531L, 3475L, 186907L,
253453L, 222L, 24906L, 2775L, 247990L, 12324L, 4395L, 25324L,
77862L, 1862L, 112217L, 163117L, 100L, 17549L, 1057L, 20331L,
18177L, 7861L, 33033L, 288669L, 1613L, 74690L, 90336L, 122L,
7357L, 1718L, 227659L, 635951L, 229493L, 868052L, 2418724L,
65833L, 1081903L, 1339758L, 4318L, 59387L, 15199L, 2134229L
)), row.names = c(NA, -44L), class = "data.frame")
The values totalDNA in type column indicates total DNAs in the data whereas mismatches are the mutations. I would like to normalize this data based on totalDNA values and plot it. The way I am plotting right now doesn't give me the accurate picture of the data as todalDNA inflates the whole Y-axis and other three types(mismatchType2, mismatchesType1 and AllMismatches) are not properly visible with respect to totalDNA. What would be the better way to plot this? Should I first calculate the percentage? or Perhaps do log scaling? Thanks for helping me out.
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Try a log scale?
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
# ylim(c(90,130))+
scale_y_log10()+ #add log scale
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
How would you normalise on total DNA? Would you use the (geometric) mean?

how to count the number of rows of specific column that has specific character

I have data that I want to know the number of specific rows that are with specific character. The data looks like the following
df<-structure(list(Gene.refGene = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1BG", "A1BG-AS1", "A1CF",
"A1CF;PRKG1"), class = "factor"), Chr = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr10", "chr19"
), class = "factor"), Start = c(58858232L, 58858615L, 58858676L,
58859052L, 58859055L, 58859066L, 58859510L, 58863162L, 58864479L,
58864150L, 58864867L, 58864879L, 58865857L, 52566433L, 52569637L,
52571047L, 52573510L, 52576068L, 52580561L, 52603659L, 52619845L,
52625849L, 52642500L, 52650951L, 52675605L, 52703952L, 52723140L,
52723638L), End = c(58858232L, 58858615L, 58858676L, 58859052L,
58859055L, 58859066L, 58859510L, 58863166L, 58864479L, 58864150L,
58864867L, 58864879L, 58865857L, 52566433L, 52569637L, 52571047L,
52573510L, 52576068L, 52580561L, 52603659L, 52619845L, 52625849L,
52642500L, 52650958L, 52675605L, 52703952L, 52723140L, 52723638L
), Ref = structure(c(3L, 5L, 2L, 2L, 3L, 2L, 5L, 7L, 6L, 6L,
2L, 1L, 5L, 6L, 5L, 3L, 2L, 5L, 6L, 3L, 3L, 6L, 3L, 4L, 3L, 6L,
6L, 3L), .Label = c("-", "A", "C", "CTCTCTCT", "G", "T", "TTTTT"
), class = "factor"), Alt_df1 = structure(c(1L, 1L, 4L, 4L, 1L,
4L, 5L, 1L, 3L, 3L, 4L, 4L, 3L, 1L, 2L, 5L, 1L, 2L, 1L, 5L, 5L,
2L, 5L, 1L, 4L, 3L, 4L, 2L), .Label = c("-", "A", "C", "G", "T"
), class = "factor")), class = "data.frame", row.names = c(NA,
-28L))
I want to know how many rows of the column named "alt_df1" is missing or - or NA
Here is an answer using which and utilising base R's LETTERS data:
length(which(!df$Alt_df1%in%LETTERS))
#[1] 8
Or using just which:
length(which(df$Alt_df1=="-"))
#[1] 8
One way would be to create a logical vector using %in% and then sum over them to count the number of occurrences.
sum(df$Alt_df1 %in% c("-", NA))
#[1] 8
Or we can also subset and count the number of rows.
nrow(subset(df, Alt_df1 %in% c("-", NA)))
which can also be done in dplyr by
library(dplyr)
df %>% filter(Alt_df1 %in% c("-", NA)) %>% nrow
Another option using grepl
with(df, sum(grepl("-", Alt_df1)) + sum(is.na(Alt_df1)))
and I am sure there are multiple other ways.

interpret estimated marginal means (emmans aka lsmeans): negative response values

I am working on a a model with lmer where I would like to get estimated marginal means with the emmeanslibrary. This is my dataframe:
df <- structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("CCF", "UN"), class = "factor"), level = structure(c(2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L), .Label = c("A", "F", "H", "L"
), class = "factor"), random = structure(c(3L, 3L, 3L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L), .Label = c("1.6", "2", "3.2", "5", NA), class = "factor"),
continuous = c(72.7951770264767, 149.373765810534, 1.64153094886205,
54.6697408615215, 25.5801835808851, 1.45794117443253, 25.3660934894788,
91.2321704897132, 2.75353217433675, 44.1995276851725, 33.1854545470435,
5.36536076058866, 29.6807620242672, 80.6077496067764, 0.833434180091457,
13.6789475327185, 77.4930412025109, 3.65998714174906, 25.2848344605563,
136.632099849828, 2.56715261161435, 28.6733878840584, 66.800616194317,
1.37475468782539, 23.007491380183, 84.980285774607, 1.13569710795522,
33.8610875632139, 56.1234827517798, 1.32327007970416, 60.0843812879313,
43.4487832450889, 1.14942423621912, 53.6673704529947, 146.746167255051,
3.91593723271292, 27.0321687961004, 89.5925729244878, 1.47707078226047,
44.0523211310831, 115.087908243373, 1.94039630728038, 86.4074806697431,
43.3266206881612, 2.81456503996437, 66.868588961071, 229.797526052566,
1.07971524769264, 30.3390107111747, 116.680801084036, 1.67711446647817,
69.0961010697534, 78.5454363192614, 1.92137892126384, 53.5708546850303,
37.7175476710608, 1.96087397451467, 25.5166981770257, 37.3755071788757,
2.21602000526086, 10.3266195584378, 38.1458490762217, 2.7508022340832,
44.5864920143771, 8.45382647692274, 2.63204944520792, 87.5376946978593,
27.2354119098268, 3.38134648323956, 26.8815471706502, 14.5539972194568,
2.0556994322415, 27.4619977737491, 32.8546665896602, 2.66809379088059,
42.3815445857533, 21.3359802201685, 2.19167325121191, 53.3189825439001,
13.5708790223439, 2.22274607227071, 88.297423835906, 8.50554349658773,
3.5764241495006, 29.284865737912, 21.1213079519954, 2.3070166819956,
10.7659615128225, 33.4813413290485, 2.49896565066211, 59.0935696616465,
13.2863515051715, 4.36424795471221, 72.1627847396763, 9.09326343200557,
2.13701784901259, 27.5824079679471, 8.84486812842272, 1.98293342019671,
17.5321126287485, 19.1806349705231, 5.03952187899644, 58.3473975730234,
9.17287686145614, 2.99575072457674)), class = "data.frame", row.names = c(NA,
105L))
This is my model:
library(lme4)
model <- lmer((continuous) ~ treatment + level + (1|random), data= df, REML = TRUE)
The data as it is does not meet the model assumptions, but still I am wondering why I get a negative estimated marginal mean (response) on treatment "UN" level "L" (see lettering table) when I don't have any negative numbers in df$continuous?
library(multcompView)
library(emmeans)
lsm.mixed_C <- emmeans::emmeans(my_model,pairwise ~ treatment * level, type="response")
lettering <- CLD(lsm.mixed_C,alpha=0.05,Letters=letters,
adjust= "tukey")
The short answer is because you badly need to include the interaction in your model. Compare:
model2 <- lmer((continuous) ~ treatment * level + (1|random),
data= df, REML = TRUE)
emmip(model2, treatment ~ level)
with:
emmip(model, treatment ~ level)
In model2, both EMMs at level L are close to zero. If you remove the interaction from the model, you force those two profiles to be parallel, while maintaining a sizeable positive difference between treatments CCF and UN, forcing the estimate for UN to go negative. In actual fact, though, all six estimates for treatment x level combinations are seriously distorted.
I can't repeat it enough. emmeans() summarizes a model. If you give it a bad model, you get dumb results. Thanks for the great illustration of this point.

Automatically adjusting ylim with stat_summary

ggplot2 adjust the ylim automatically for the data points. Is there any way to adjust the ylim for stat_summary too?
df <- structure(list(Varieties = structure(c(2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L), .Label = c("F9917", "Hegari", "JS263",
"JS2002"), class = "factor"), Priming = structure(c(2L, 2L, 2L,
2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L), .Label = c("CaCl2", "Dry",
"Hydropriming", "KNO3", "OnFarmpriming"), class = "factor"),
PH = c(225.8, 224.26, 228.9, 215.82, 230.3, 227.7, 232.8,
221.1, 260.2, 230.8, 236.75, 230.5, 250.56, 230.74, 240.64,
226.7, 268.4, 233.4, 243.33, 232.7, 252.04, 233.1, 237.14,
220.6, 265.55, 234.93, 240.04, 218.21, 300.55, 245, 243.5,
234.65, 253.3, 233.5, 238.62, 225.93, 255.74, 233.64, 238.1,
230.93, 246, 240.33, 246.08, 221.7, 250.54, 242.87, 251,
225.32, 251.47, 245.4, 266.74, 227.73, 290.62, 246.68, 256.4,
225.83, 282.67, 240.58, 258.35, 235.87)), .Names = c("Varieties",
"Priming", "PH"), class = "data.frame", row.names = c(NA, 60L
))
p1 <- ggplot(data=df, aes(x=Varieties, y=PH, group=Priming, shape=Priming, colour=Priming))+
stat_summary(fun.y=mean, geom="point", size=2, aes(group=Priming, shape=Priming, colour=Priming))+
theme_bw()
p1 <- p1 + stat_summary(fun.y=mean, geom="line", aes(group=Priming, shape=Priming, colour=Priming))
print(p1)
See extra space in ylim for stat_summary values. Thanks in advance for your help and time.
Here is one approach, using plyr to prep the data before plotting
df <- ddply(df, .(Varieties, Priming), transform, meanPH = mean(PH))
ggplot(df, aes(Varieties, meanPH)) +
geom_point() +
geom_line(aes(group = Priming, color = Priming))
The current "official" answer for 0.8.9 is, I believe, that you can't, at least not automatically, and not without preprocessing the data as Ramnath indicates. Most people asking this question, or some variant of it, are pointed towards setting the limits manually using coord_cartesian.
The reason stat_summary behaves this way is that it sort of assumes that you aren't going to just plot the summaries, but at least some of the underlying data as well, so it sets up the plotting area using the underlying data frame.
However, I found this thread on the ggplot2 list that suggests this behavior might change in the upcoming 0.9.0 release. (The thread is a little vague, but I read it as implying that in the next version, if the only layer you add is form stat_summary then the plot limits will be calculated based on the summaries, not the original data. I could be wrong though.)

Resources