Modify values in a column subject to condition? - r

I know that this is a very silly question but I cannot work out how to do it.
I want to subtract 300 from the values on each row on the trial1 column if they are larger than 299.
I tried:
sums[sums$trial1 > 299, ][,"trial1"] -= 300
but didn't work. SO far the only way I managed to get it to work is by splitting the data.frame using subset and then modifying it with :
sums$trial1 = sums$trial1 - 300
and then using rbind(). I am pretty sure that using subset and rbind is overkill, but I haven't been able to find a direct way yet...
I used dput to get a sample of my data.frame.
structure(list(part_no = c(10L, 10L, 10L, 10L, 10L, 10L), trial1 = c(294L,
296L, 298L, 300L, 302L, 304L), trial2 = c(295L, 297L, 299L, 301L,
303L, 305L), id1 = c(1.5, 1.5, 1.5, 2, 2, 2), id2 = c(1.5, 1.5,
1.5, 2, 2, 2), dist1 = c(141L, 141L, 115L, 126L, 177L, 141L),
width1 = c(77L, 77L, 63L, 42L, 59L, 47L), dist2 = c(143L,
135L, 146L, 255L, 327L, 369L), width2 = c(78L, 74L, 80L,
85L, 109L, 123L), ttime1 = c(1752L, 1681L, 1664L, 1798L,
1664L, 1697L), ttime2 = c(2563L, 1849L, 2067L, 1933L, 2118L,
2245L), no_clicks1 = c(8L, 8L, 8L, 8L, 8L, 8L), no_clicks2 = c(8L,
8L, 8L, 8L, 8L, 8L), no_ontarget1 = c(7L, 8L, 8L, 8L, 8L,
8L), no_ontarget2 = c(8L, 8L, 8L, 4L, 7L, 8L), e1 = c(1L,
0L, 0L, 0L, 0L, 0L), e2 = c(0L, 0L, 0L, 4L, 1L, 0L), rating = c(252,
252, 252, 252, 252, 252), prat = c(0.8, 0.8, 0.8, 0.8, 0.8,
0.8), ptim = c(-46.2899543378995, -9.9940511600238, -24.21875,
-7.5083426028921, -27.2836538461538, -32.2922804949912),
ptdiff = c(-47.0899543378995, -10.7940511600238, -25.01875,
-8.3083426028921, -28.0836538461538, -33.0922804949912),
pdist = c(-1.41843971631206, 4.25531914893617, -26.9565217391304,
-102.380952380952, -84.7457627118644, -161.702127659574),
pddiff = c(-2.21843971631206, 3.45531914893617, -27.7565217391304,
-103.180952380952, -85.5457627118644, -162.502127659574),
perr = c(100, NaN, NaN, -Inf, -Inf, NaN), pediff = c(99.2,
NaN, NaN, -Inf, -Inf, NaN)), .Names = c("part_no", "trial1",
"trial2", "id1", "id2", "dist1", "width1", "dist2", "width2",
"ttime1", "ttime2", "no_clicks1", "no_clicks2", "no_ontarget1",
"no_ontarget2", "e1", "e2", "rating", "prat", "ptim", "ptdiff",
"pdist", "pddiff", "perr", "pediff"), row.names = 148:153, class = "data.frame")
Thanks!

Your first attempt was close but there is no -= operator in base R, so you need to supply the subset on the right hand side as well.
sums[sums$trial1 > 299,"trial1"] <- sums[sums$trial1 > 299,"trial1"]-300

Why not just use mod(x,300) ?
[charcountfiller]

Related

How to insert new rows for missing data with intervals that could vary by a few minutes in R

I would like to insert rows when there are missing data within a 5 minute interval glucose sensor dataset. I have managed to complete this using the tsibble package but there can be time drifts in the data e.g. the sensor records a value at 4 minutes instead of 5. This causes the inserted time stamps to become unsynchronised throughout the remainder of the data frame.
Is there a way to complete this for a time interval that should be 5 minutes, but could be between 4 and 6 minutes? The dataset also includes multiple different IDs.
The ultimate aim is then to fill in the missing data gaps based upon a set criteria (i.e. max fill <= 3 rows) using the existing data.
Reprex pasted below.
library(tsibble, warn.conflicts = FALSE)
#> Warning: package 'tsibble' was built under R version 4.1.1
Data <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
gl = c(125L, 133L, 132L, 130L, 133L, 135L, 166L, 161L, 67L, 66L, 67L, 69L, 67L),
time = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
min = c(42L, 47L, 51L, 56L, 6L, 11L, 11L, 16L, 2L, 17L, 22L, 27L, 32L),
hour = c(9L, 9L, 9L, 9L, 10L, 10L, 11L, 11L, 0L, 0L, 0L, 0L, 0L),
mday = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
year = c(121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L,121L),
wday = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 0L, 0L, 0L, 0L,0L),
yday = c(92L, 92L, 92L, 92L, 92L, 92L, 92L, 92L, 93L, 93L,93L, 93L, 93L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L, 0L, 0L)),
class = c("POSIXlt", "POSIXt"), tzone = "GMT"),
dif = structure(c(NA, 5, 4, 5, 10, 5, 60, 5, NA, 15, 5, 5, 5),
units = "mins", class = "difftime")),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -13L), groups = structure(list(id = 1:2, .rows = structure(list(1:8, 9:13),
ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L), .drop = TRUE))
x <- new_interval(minute = 5)
tsdata <- build_tsibble(Data, key = id, index = time, interval = x)
tsdata <- fill_gaps(tsdata, .full = FALSE)
This is probably not a final answer to what you are looking for, but it might get you started in getting what you want..
library(data.table)
library(zoo)
# Split to list by id
L <- split(DT, by = "id")
# Interpolate gl based on time
ans <- lapply(L, function(x) {
# build time series by minute
temp <- data.table::data.table(
id = unique(x$id),
time = seq(min(x$time), max(x$time), by = 60))
# join in measured data
temp[x, gl_measured := i.gl, on = .(time)]
# imterpolate gl-values
temp[, gl_approx := zoo::na.approx(gl_measured)]
})
# Bind list together again
final <- data.table::rbindlist(ans)

how to fix x and y axis in combination with geom_smooth()?

I am trying to make square shaped plots with the same x and y tick-marks (i.e. aspect-ratio =1).
Originally I wanted to use facet_wrap with ggplot, but reading from a number of questions here on stackoverflow I realized this is not possible. So now I want to plot them one by one and use grid.arrange to organize the plots in the end.
BUT it is still not working for me. I can get the axis to be correct, but now the confidence interval from geom_smooth() is no longer correctly plotted.
dat <- structure(list(analyte = structure(c(2L, 8L, 9L, 5L, 6L, 4L,
1L, 7L, 10L, 3L, 9L, 10L, 7L, 7L, 10L, 10L, 10L, 10L, 6L, 6L,
10L, 6L, 4L, 6L, 7L, 4L, 2L, 10L, 10L, 4L, 2L, 6L, 6L, 8L, 10L,
1L, 1L, 3L, 8L, 2L, 1L, 10L, 7L, 6L, 3L, 3L, 7L, 7L, 6L, 6L,
9L, 5L, 9L, 7L, 6L, 7L, 8L, 7L, 5L, 7L, 5L), .Label = c("Alanine",
"Glutamic acid", "Glutamine", "Glycine", "Histidine", "Isoleucine",
"Leucine", "Phenylalanine", "Tyrosine", "Valine"), class = "factor"),
x = c(23.8, 51.5, 68.8, 83.5, 165.8, 178.6, 201.1, 387.4,
417.7, 550.1, 101.4, 103.1, 115.5, 119.9, 131.4, 156.9, 157.2,
169.9, 170.1, 174.6, 204.3, 21.8, 218.7, 22.2, 220.3, 226,
24.3, 259.3, 263.1, 301, 38.7, 39.8, 41.5, 42.4, 428.9, 431.7,
437.2, 440.1, 46.7, 47, 462.6, 470.1, 474.5, 51.3, 512.3,
516.4, 527.2, 547.3, 57.3, 58.5, 60.6, 63.9, 65.9, 69.9,
71.8, 771.9, 81.2, 82.4, 82.6, 823.5, 83.8), y = c(100L,
50L, 50L, 80L, 160L, 210L, 240L, 390L, 340L, 620L, 70L, 90L,
70L, 90L, 130L, 130L, 160L, 130L, 160L, 150L, 180L, 30L,
140L, 30L, 230L, 210L, 60L, 230L, 270L, 250L, 60L, 30L, 50L,
50L, 390L, 480L, 460L, 410L, 50L, 290L, 410L, 420L, 440L,
50L, 530L, 730L, 530L, 400L, 50L, 40L, 40L, 100L, 50L, 70L,
70L, 750L, 50L, 70L, 110L, 800L, 160L)), class = "data.frame", row.names = c(NA,
-61L))
and the plot:
my.formula <- y ~ x
p1 <- ggplot(dat[which(dat$analyte== 'Alanine'),], aes(x = x, y = y))+ geom_point()+
scale_x_continuous(limits=c(min(dat[which(dat$analyte== 'Alanine'),]$x, dat[which(dat$analyte== 'Alanine'),]$y), max(dat[which(dat$analyte== 'Alanine'),]$x,dat[which(dat$analyte== 'Alanine'),]$y))) +
scale_y_continuous(limits=c(min(dat[which(dat$analyte== 'Alanine'),]$x, dat[which(dat$analyte== 'Alanine'),]$y), max(dat[which(dat$analyte== 'Alanine'),]$x,dat[which(dat$analyte== 'Alanine'),]$y))) +
geom_smooth(method='lm') + stat_poly_eq(formula = my.formula, aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = T, size=3)
p1
UPDATE:
So I try to combine the suggested code and some of my own settings and I am getting closer. But it is driving me crazy, why the confidence intervals are not plotted in some of the plots and plotted wrong in one plot (Alanine) (see the last picture)?
The updated code:
dat_split <- split(dat, dat$analyte)
plots <-
lapply(dat_split, function(df)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
scale_x_continuous(expand= c(0,0), limits=c(min(as.numeric(min(df$x)-as.numeric(1/8*min(df$x))), as.numeric(min(df$y)-as.numeric(1/8*min(df$y)))), max(as.numeric(max(df$x)+as.numeric(1/8*max(df$x))), as.numeric(max(df$y)+as.numeric(1/8*max(df$y)))))) +
scale_y_continuous(expand= c(0,0), limits=c(min(as.numeric(min(df$x)-as.numeric(1/8*min(df$x))), as.numeric(min(df$y)-as.numeric(1/8*min(df$y)))), max(as.numeric(max(df$x)+as.numeric(1/8*max(df$x))), as.numeric(max(df$y)+as.numeric(1/8*max(df$y)))))) +
theme(aspect.ratio = 1) +
geom_smooth(method = 'lm', inherit.aes = T, se=T) +
ggtitle(df$analyte[1]) +
ggpmisc::stat_poly_eq(formula = my.formula,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE, size=3))
gridExtra::grid.arrange(grobs = plots)
This seems to do roughly what you're looking for. For some of the analyte factors, the x and y ranges are considerably different, so I'm not sure you really want to show them all with identical axes.
dat_split <- split(dat, dat$analyte)
plots <-
lapply(dat_split, function(df)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
coord_equal() +
geom_smooth(method = 'lm', inherit.aes = T) +
ggtitle(df$analyte[1]) +
ggpmisc::stat_poly_eq(formula = my.formula,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = T, size=3))
gridExtra::grid.arrange(grobs = plots)

Scatter plot with small pie charts with R

I have this data below called test1.melted. I also have the code to plot my data using package scatterpie, but due to inherent problem of scatterpie (if coordinates are not cartesian,i.e. equal horizontal and vertical distances), you would not get properly formatted plot. Is there a better way to plot this data without using scatterpie?
Data:
test1.melted<-structure(list(Wet_lab_dilution_A = structure(c(1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("A", "B", "C", "D", "E", "F",
"G", "H", "I", "J", "K", "L"), class = "factor"), TypeA = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("I", "II"), class = "factor"),
NA12878 = c(100L, 50L, 25L, 20L, 10L, 0L, 100L, 50L, 25L,
20L, 10L, 0L, 100L, 50L, 25L, 20L, 10L, 0L, 100L, 50L, 25L,
20L, 10L, 0L), NA12877 = c(0L, 50L, 75L, 80L, 90L, 100L,
0L, 50L, 75L, 80L, 90L, 100L, 0L, 50L, 75L, 80L, 90L, 100L,
0L, 50L, 75L, 80L, 90L, 100L), IBD = c(1.02, 0.619, 0.294,
0.244, 0.134, 0.003, 0.003, 0.697, 0.964, 0.978, 1, 1, 1.02,
0.619, 0.294, 0.244, 0.134, 0.003, 0.003, 0.697, 0.964, 0.978,
1, 1), variableA = c("tEst", "tEst", "tEst", "tEst", "tEst",
"tEst", "tEst", "tEst", "tEst", "tEst", "tEst", "tEst", "pair",
"pair", "pair", "pair", "pair", "pair", "pair", "pair", "pair",
"pair", "pair", "pair"), valueA = c(0.1, 59.8, 84.6, 89.2,
97.4, 100, 99.6, 56.4, 29.9, 24, 12.1, 0.1, 0.1, 51.08, 75.28,
80.09, 90.16, 100, 100, 48.09, 23.97, 18.81, 9.24, 0.08)), row.names = c(NA,
-24L), .Names = c("Wet_lab_dilution_A", "TypeA", "NA12878", "NA12877",
"IBD", "variableA", "valueA"), class = "data.frame")
code:
p<- ggplot() + geom_scatterpie(aes(x=valueA, y=IBD, group=TypeA), data=test1.melted,
cols=c("NA12878", "NA12877")) + coord_equal()+
facet_grid(TypeA~variableA)
p
Do you have to use a pie chart? (And you might; there's nothing wrong with them.)
Cause something like this could illustrate literally every variable in the dataset:
library(ggplot2)
test1.melted$NA12877 <- as.factor(test1.melted$NA12877)
test1.melted$NA12878 <- as.factor(test1.melted$NA12878)
p <- ggplot(data = test1.melted, aes(x=valueA, y=IBD, group=TypeA))
p <- p + geom_point(aes(colour=NA12877, fill = NA12878), stroke=3, size = 3, shape = 21)
p <- p + geom_text(aes(label = Wet_lab_dilution_A), size = 2)
p + facet_grid(TypeA ~ variableA) + theme_minimal()

Adding breaks to count (y axis) of a histogram according to the count min-max range in R?

I have a ggplot histogram plot.
On the x axis I have a factor variable (1,2,3,4,..)
On the y axis I have count.
I want my y axis to be from minimum count to maximum count, by 1.
I am playing with scale_y_discrete but I can't take min(count), max(count) and add by = 1.
Please advise.
df <- structure(list(user_id = c(1L, 1L, 3L, 3L, 4L, 4L, 4L, 6L, 8L,
8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L), obs_id = c(1L,
30L, 133L, 134L, 144L, 160L, 162L, 226L, 272L, 273L, 274L, 275L,
276L, 299L, 307L, 322L, 323L, 324L, 325L, 326L, 327L, 328L),
n = c(6L, 6L, 10L, 6L, 11L, 11L, 12L, 6L, 3L, 2L, 5L, 2L,
3L, 5L, 12L, 11L, 25L, 7L, 5L, 2L, 5L, 17L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -22L), vars = "user_id", drop = TRUE, .Names = c("user_id",
"obs_id", "n"), indices = list(0:1, 2:3, 4:6, 7L, 8:12, 13:21), group_sizes = c(2L,
2L, 3L, 1L, 5L, 9L), biggest_group_size = 9L, labels = structure(list(
user_id = c(1L, 3L, 4L, 6L, 8L, 9L)), class = "data.frame", row.names = c(NA,
-6L), vars = "user_id", drop = TRUE, .Names = "user_id"))
You can make a function for breaks that takes the limits of axis as the argument.
From the documentation of scale_continuous, breaks can take:
A function that takes the limits as input and returns breaks as output
Here is an example, where I go from 0 to the maximum y axis limit by 1. (I use 0 instead of the minimum count because histograms start at 0.)
The x in the function is the limits of the axis in the plot as calculated by ggplot() or as set by the user.
byone = function(x) {
seq(0, max(x), by = 1)
}
You can pas this function to breaks in scale_y_continuous(). The limits are pulled from directly from the plot and passed to the first argument of the function.
ggplot(df, aes(user_id)) +
geom_histogram() +
scale_y_continuous(breaks = byone)

Iteration over all variables in a dataframe

I have found a useful mean imputation technique here
.
More specifically:
variable[is.na(variable)] <- rowMeans(cbind(variable[which(is.na(variable))-1],
variable[which(is.na(variable))+1]))
Which takes values before and after the missing one and imputes their mean.
However, since I have a large data frame with lots of variables I was wondering is there a way to iterate this function over every variable (column) in the df?
dput:
dput(head(politbar_timeseries,10))
structure(list(Month = structure(c(8401, 8432, 8460, 8491, 8521,
8552, 8582, 8613, 8644, 8674), class = "Date"), Intention_CDU = c(246L,
223L, 222L, 232L, 261L, 240L, 241L, NA, 234L, 211L), Intention_SPD = c(304L,
323L, 276L, 274L, 238L, 290L, 291L, NA, 284L, 296L), Intention_FDP = c(47L,
44L, 46L, 36L, 35L, 50L, 31L, NA, 33L, 40L), Intention_Green = c(112L,
90L, 108L, 97L, 92L, 93L, 80L, NA, 131L, 97L), Intention_PDS = c(1L,
2L, 1L, 4L, 2L, 4L, 6L, NA, 3L, 1L), Intention_Right = c(40L,
45L, 51L, 44L, 48L, 26L, 30L, NA, 33L, 39L), CDU_CSU_Scale = c(5.53364976051333,
5.41668954145634, 5.41361737597252, 5.53237142973321, 5.90556125077522,
5.65325991093138, 5.66581907651607, NA, 5.7568395653053, 5.56722081960557
), SPD_Scale = c(6.68501038883942, 7.0740019675866, 6.31415136355633,
6.52447895467401, 6.29176231355408, 6.52870415235848, 6.73302006301497,
NA, 7.12547563426403, 7.17833309669175), FDP_Scale = c(5.34570000100596,
5.73343004031828, 5.52174547729524, 5.39618098094715, 5.81980921102384,
5.64326882828348, 5.70136552543044, NA, 5.3836387964029, 5.73726720856055
), Grüne_Scale = c(5.73191750379599, 6.03715643205545, 6.19893648691653,
5.96106479727169, 5.78436018957346, 5.54482751153172, 5.6213169156508,
NA, 6.42776109093573, 6.33016932291559), Republikaner_Scale = c(2.33415238404679,
2.40200426439232, 2.50591428720572, 2.45599753445912, 2.61170073660812,
2.26120872300811, 2.24409536048212, NA, 2.29699201198203, 2.25876734042663
), PDS_Scale = c(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NA, NaN,
NaN)), .Names = c("Month", "Intention_CDU", "Intention_SPD",
"Intention_FDP", "Intention_Green", "Intention_PDS", "Intention_Right",
"CDU_CSU_Scale", "SPD_Scale", "FDP_Scale", "Grüne_Scale", "Republikaner_Scale",
"PDS_Scale"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 249L,
8L, 9L), class = "data.frame")

Resources