I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
Related
I have a dataset with a number of cases. Every case has two observations. The first observation for case number 1 has value 3 and the second observation has value 7. The two observations for case number 2 have missing values. I need to write code to fill the empty cells with the same values from case number 1 so that the first row for case 2 will have the same value as case 1 for obs = 1 and the second row will have the same value for obs = 2. Of course, this is a very short version of a much bigger dataset so I need something that is flexible enough to accommodate for a couple of hundred cases and where the values to use as fillers change for every subjects.
Here is a toy data set:
# toy dataset
df <- data.frame(
case = c(1, 1, 2, 2),
obs = c(1, 2, NA, NA),
value = c(3, 7, NA, NA)
)
# case obs value
# 1 1 1 3
# 2 1 2 7
# 3 2 NA NA
# 4 2 NA NA
#Desired output:
case obs value
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
We may use fill with grouping on the row sequence (rowid) of case
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rowid(case)) %>%
fill(obs, value) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 3
case obs value
<dbl> <dbl> <dbl>
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
I'm trying to randomize the order for the receipt of 6 drinks (each in a different day) for 40 participants. I want to ensure that every participant get each drink once, and that every drink has roughly the same number of occurrences across participants in each day.
I create the data, with participants in columns and days in rows.
library(ggplot2)
set.seed(123)
random_order <- as.data.frame(replicate(40, sample(1:6, 6,
replace = F)))
random_order$trial <- c(1:6)
random_order
Then I check the number of occurrences of each drink within each row / trial, which shows that the frequency of different drinks within trials is not uniform:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n())
# # A tibble: 36 × 3
# # Groups: trial [6]
# trial drink_order count
# <int> <int> <int>
# 1 1 1 9
# 2 1 2 8
# 3 1 3 8
# 4 1 4 4
# 5 1 5 5
# 6 1 6 6
# 7 2 1 7
# 8 2 2 4
# 9 2 3 10
# 10 2 4 7
# # … with 26 more rows
and look at it with a density plot:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n()) |>
ggplot(aes(count)) +
geom_density()
Basically, I want to have a very thin normal curve. How can I make it so that the count column above has a small range during creating the data?
Thanks!
You’re looking for a variation on a Latin square, which is a set of ordered elements such that each element occurs exactly once per column and once per row. You can generate random Latin squares using agricolae::design.lsd(). In your case, instead of once per row, you want each element to occur the same number of times per row, which you can do by binding together multiple Latin squares.
library(agricolae)
set.seed(123)
# to get 40 columns, first get 7 Latin squares
# (7 squares x 6 columns per square = 42 columns)
orders <- replicate(
7,
design.lsd(1:6)$sketch,
simplify = FALSE
)
# then column-bind and subset to 40 columns
random_order <- data.frame(do.call(cbind, orders))[, 1:40]
random_order$trial <- c(1:6)
Using the code from your question, we can see that all trials include 6 or 7 of each drink:
# A tibble: 36 × 3
# Groups: trial [6]
trial drink_order count
<int> <chr> <int>
1 1 1 7
2 1 2 7
3 1 3 7
4 1 4 6
5 1 5 6
6 1 6 7
7 2 1 7
8 2 2 6
9 2 3 6
10 2 4 7
# … with 26 more rows
I am trying to rearrange a dataset with a few thousand observations (to eventually use the drm function in package DRC), and I am tired of doing it in excel. Within a dataframe I am looking to add "start" and "end" times (up to inf) based on the intervals found in a vector within the df. This means I would have to end up adding an observation (row) where there the last "end" time is inf. For that last row (the one with inf) I ALSO need to subtract the total of "value" from an arbitrary number (in my example below this would be 50). All this grouped by two variables ("Name", and "Rep" in my example). I am hoping there is a solution using group_by, but honestly I'll be overjoyed at any solution!
I have a data set that looks like this;
# data
names<-c(rep("Luke",30), rep("Han", 30), rep("Leia", 30), rep("OB1", 30))
reps<-c(rep("A", 10), rep("B", 10), rep("C", 10))
time<-rep(seq(1:10), 4)
value<-rep(sample(0:5,10,replace=T), 4)
df<-data.frame(names, reps, time, value)
but need it to look like this;
Example of the data structure I need.
I'm at a loss. Please help!
If I have understood you correctly, we can do
library(dplyr)
df1 <- df %>%
group_by(names, reps) %>%
mutate(start = lag(time, default = 0),
end = time)
bind_rows(df1, df1 %>%
group_by(names, reps) %>%
summarise(start = last(time),
end = Inf,
value = sum(value))) %>%
select(-time) %>%
arrange(names, reps)
# names reps value start end
# <fct> <fct> <int> <dbl> <dbl>
# 1 Han A 2 0 1
# 2 Han A 2 1 2
# 3 Han A 1 2 3
# 4 Han A 1 3 4
# 5 Han A 3 4 5
# 6 Han A 2 5 6
# 7 Han A 0 6 7
# 8 Han A 2 7 8
# 9 Han A 2 8 9
#10 Han A 5 9 10
#11 Han A 20 10 Inf
#.....
We can do this in data.table shifting the 'time' while appending 'Inf' at the end of 'time' to create the end and difference of 50 from the sum of 'value' for 'value' after grouping by 'names' and 'reps'
library(data.table)
setDT(df)[, {stL <- last(time)
enL <- Inf
vL <- 50- sum(value)
.(start = c(shift(time, fill = 0), stL),
end = c(time, enL),
value = c(value, vL))}, .(names, reps)]
# names reps start end value
# 1: Luke A 0 1 0
# 2: Luke A 1 2 3
# 3: Luke A 2 3 3
# 4: Luke A 3 4 4
# 5: Luke A 4 5 0
# ---
#128: OB1 C 6 7 3
#129: OB1 C 7 8 0
#130: OB1 C 8 9 2
#131: OB1 C 9 10 5
#132: OB1 C 10 Inf 27
I have a dataframe like this;
df <- data.frame(concentration=c(0,0,0,0,2,2,2,2,4,4,6,6,6),
result=c(0,0,0,0,0,0,1,0,1,0,1,1,1))
I want to count the total number of results for each concentration level.
I want to count the number of positive samples for each concentration level.
And I want to create a new dataframe with concentration level, total results, and number positives.
conc pos_c total_c
0 0 4
2 1 4
4 1 2
6 3 3
This is what I've come up with so far using plyr;
c <- count(df, "concentration")
r <- count(df, "concentration","result")
names(c)[which(names(c) == "freq")] <- "total_c"
names(r)[which(names(r) == "freq")] <- "pos_c"
cbind(c,r)
concentration total_c concentration pos_c
1 0 4 0 0
2 2 4 2 1
3 4 2 4 1
4 6 3 6 3
Repeating concentration column. I think there is probably a way better/easier way to do this I'm missing. Maybe another library. I'm not sure how to do this in R and it's relatively new to me. Thanks.
We need a group by sum. Using tidyverse, we group by 'concentration (group_by), then summarise to get the two columns - 1) sum of the logical expression (result > 0), 2) number of rows (n())
library(dplyr)
df %>%
group_by(conc = concentration) %>%
summarise(pos_c = sum(result > 0), # in the example just sum(result)
total_c = n())
# A tibble: 4 x 3
# conc pos_c total_c
# <dbl> <int> <int>
#1 0 0 4
#2 2 1 4
#3 4 1 2
#4 6 3 3
Or using base R with table and addmargins
addmargins(table(df), 2)[,-1]
I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7
It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7