I have a data frame d and I'd like to add a VALUE_GROUP column that looks at the value field and returns the upper limit of the bucket the value falls into
Value Value_group
0<=value<5 5
5<=value<10 10
10<=value<15 15
15<=value<20 20
You can see the Value_group is the max possible value in the bucket i.e. for value between 0 and 5 Value_group = 5
d =data.frame(group = rep("A",20),value = seq(1,20,1))
d
d$Value_Group = ??
Value_group can be added using multiple ifelse() statements but is there a better way?
The result would be:
group value Value_Group
1 A 1 5
2 A 2 5
3 A 3 5
4 A 4 5
5 A 5 5
6 A 6 10
7 A 7 10
8 A 8 10
9 A 9 10
10 A 10 10
11 A 11 15
12 A 12 15
13 A 13 15
14 A 14 15
15 A 15 15
16 A 16 20
17 A 17 20
18 A 18 20
19 A 19 20
20 A 20 20
Thank you.
Related
I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.
Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)
The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.
I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4
I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R),
dataframe below:
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
9 9 32
10 10 26
11 11 27
12 12 34
13 13 27
14 14 28
15 15 33
16 16 21
17 17 36
18 18 24
19 19 19
20 20 32
It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned.
Assuming you want all possible combinations of 8 classrooms from the dataset classrooms
combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,],
simplify = FALSE, data =classrooms )
head(combinations, n = 2)
[[1]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
[[2]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
9 9 32