I have power data (Power) collected every second (Sample). My data.frame is therefore structured as follows:
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
The number of power entries is dependent upon a human performing an effort on a bike and resting sporadically. Therefore, power does not appear in an ordered fashion. As there are no markers to indicate when an effort starts and stops, I want to include this detail. An effort can be characterised when power > 0 and the start/ stop of each effort can be assessed based on data group together.
I now wish to include a new column (Marker) that looks for power data grouped together and separated by zeroes. For example, my anticipated output would be:
Test$Marker <- c("Rest","Rest","Rest","Rest","Rest","Effort 1","Effort 1","Effort 1","Effort 1",
"Rest","Rest","Rest","Effort 2","Effort 2","Effort 2","Rest","Rest","Rest",
"Effort 3","Effort 3")
Unfortunately my raw data is > 3000 rows long, so to do this manually would be tedious! How do I please go about doing this in R?
An option with base R:
indx1 = with(rle(Test$Power>0),rep(values,lengths))
indx2 = with(rle(Test$Power>0),rep(cumsum(values),lengths))
Test$Effort[indx1] = paste0("Effort",indx2[indx1])
Test$Effort[!indx1]="Rest"
Output:
Sample Power Effort
1 1 0 Rest
2 2 0 Rest
3 3 0 Rest
4 4 0 Rest
5 5 0 Rest
6 6 50 Effort1
7 7 67 Effort1
8 8 100 Effort1
9 9 92 Effort1
10 10 0 Rest
11 11 0 Rest
12 12 0 Rest
13 13 36 Effort2
14 14 89 Effort2
15 15 36 Effort2
16 16 0 Rest
17 17 0 Rest
18 18 0 Rest
19 19 89 Effort3
20 20 90 Effort3
About 0.0038 seconds for 3,000 rows ;) Hope this helps!
An alternative base R version using cumsum:
mrk <- Test$Power==0
Test$New[!mrk] <- paste("effort", as.numeric(factor(cumsum(mrk)[!mrk])))
Test$New[mrk] <- "rest"
# Sample Power Marker New
#1 1 0 Rest rest
#2 2 0 Rest rest
#3 3 0 Rest rest
#4 4 0 Rest rest
#5 5 0 Rest rest
#6 6 50 Effort 1 effort 1
#7 7 67 Effort 1 effort 1
#8 8 100 Effort 1 effort 1
#9 9 92 Effort 1 effort 1
#10 10 0 Rest rest
#11 11 0 Rest rest
#12 12 0 Rest rest
#13 13 36 Effort 2 effort 2
#14 14 89 Effort 2 effort 2
#15 15 36 Effort 2 effort 2
#16 16 0 Rest rest
#17 17 0 Rest rest
#18 18 0 Rest rest
#19 19 89 Effort 3 effort 3
#20 20 90 Effort 3 effort 3
An option with dplyr from the tidyverse:
library(dplyr)
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
Test_df <- Test %>%
mutate(
Marker = case_when(
Power > 0 ~ "Effort",
Power == 0 ~"Rest"),
rleid = cumsum(Marker != lag(Marker, 1, default = "NA")),
Marker = case_when(
Marker == "Effort" ~ paste0(Marker, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
Test_df
#> Sample Power Marker
#> 1 1 0 Rest
#> 2 2 0 Rest
#> 3 3 0 Rest
#> 4 4 0 Rest
#> 5 5 0 Rest
#> 6 6 50 Effort1
#> 7 7 67 Effort1
#> 8 8 100 Effort1
#> 9 9 92 Effort1
#> 10 10 0 Rest
#> 11 11 0 Rest
#> 12 12 0 Rest
#> 13 13 36 Effort2
#> 14 14 89 Effort2
#> 15 15 36 Effort2
#> 16 16 0 Rest
#> 17 17 0 Rest
#> 18 18 0 Rest
#> 19 19 89 Effort3
#> 20 20 90 Effort3
An other option using a one-liner data.table :
library(data.table)
Test <- data.frame(Sample = c(1:20),
Power = c(0,0,0,0,0,50,67,100,92,0,0,0,36,89,36,0,0,0,89,90))
setDT(Test)
Test[, Marker := ifelse(Power > 0, paste0("Effort", rleidv(Power > 0) %/% 2), "Rest")]
Test
#> Sample Power Marker
#> 1: 1 0 Rest
#> 2: 2 0 Rest
#> 3: 3 0 Rest
#> 4: 4 0 Rest
#> 5: 5 0 Rest
#> 6: 6 50 Effort1
#> 7: 7 67 Effort1
#> 8: 8 100 Effort1
#> 9: 9 92 Effort1
#> 10: 10 0 Rest
#> 11: 11 0 Rest
#> 12: 12 0 Rest
#> 13: 13 36 Effort2
#> 14: 14 89 Effort2
#> 15: 15 36 Effort2
#> 16: 16 0 Rest
#> 17: 17 0 Rest
#> 18: 18 0 Rest
#> 19: 19 89 Effort3
#> 20: 20 90 Effort3
Related
I have a data from a simulation which has counts of interactions among 10 individuals, and there are 80 runs. I would like to make separate matrices for each run, and then use a function for calculating the ranking of individuals from the matric for each run
Is it possible to make for loops for-
making matrices for each run
running a function through all matrices
I am new to R so I don't really know how to make these iterative loops. I made separate matrices, and ran the function separately for each matrix. But this is very time consuming and prone to error.
This is what the data looks like :
head(A)
[run number] distribution who-won1 who-won2 won-battle
1 3 4 patches 7 5 17
2 3 4 patches 9 4 31
3 3 4 patches 0 1 11
4 3 4 patches 2 1 7
5 3 4 patches 2 9 4
6 3 4 patches 5 7 36
7 3 4 patches 9 6 10
8 3 4 patches 2 7 3
9 3 4 patches 1 0 19
10 3 4 patches 3 7 7
Then I used this to make the matrices, which is an actor-receiver matrix with the counts of fights won for each actor-receiver.
Alist <- vector("list", 40)
for(run in 1:40){
newmatrix <- matrix(nrow = 10, ncol = 10)
for (x in 1:90) { #90 rows per run
Actor = A$Actor[A$Group== run][x] + 1
Receiver = A$Receiver[A$Group== run][x] + 1
Won = A$`won-battle`[A$Group== run][x]
newmatrix[Actor,Receiver] = as.numeric(Won)
}
newmatrix[is.na(newmatrix)] <- 0
groomlosepatchylist[[run]] <- newmatrix
}
and it gives a matrix like this-
...1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 11 19 23 11 9 1 12 34 3
2 2 19 0 25 24 13 12 5 12 35 13
3 3 14 7 0 14 6 3 1 3 38 4
4 4 16 8 10 0 1 5 2 7 19 8
5 5 30 19 35 35 0 17 9 16 67 18
6 6 31 50 52 38 21 0 21 36 83 26
7 7 69 42 46 38 35 43 0 62 66 59
8 8 38 23 48 44 19 17 7 0 66 21
9 9 26 14 31 24 4 2 5 6 0 12
10 10 41 35 43 48 31 33 10 34 64 0
I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows
I have a rather large dataset named e. One continuous covariate e$rad.dose should range 0 - 60, however, I observed that four rows contain text and not numbers.
Question: how can I apply dplyr to remove these four rows?
I know several baseR solutions, but I am trying to improve my dplyr.
> table(e$rad.dose)
0 12 12,5 14 15 16 21,6
156 3291 4 1 1 6 2 1
22 24 25 26 27,5 28 33,3 35
1 14 7 1 1 7 1 1
36 45 48 49,4 5,4 50 50,4 52
1 2 1 1 17 12 9 9
52,2 53,2 53,24 54 54,4 54,6 55 55,5
1 1 1 94 1 1 1 1
55,8 56 56,7 57 57,6 58 59,4 60
1 14 1 1 2 3 26 41
60,9 64 68 gammaknife GK
1 1 1 2 2
I have tried
filter(simpson %in% 1:3, age>=18, rad.dose!= c("gammaknife","GK"))
But two rows remain
> table(e$rad.dose)
0 12 12,5 14 15 16 21,6
32 2276 0 0 0 0 0 0
22 24 25 26 27,5 28 33,3 35
0 7 0 0 0 0 0 1
36 45 48 49,4 5,4 50 50,4 52
0 1 0 0 5 3 1 9
52,2 53,2 53,24 54 54,4 54,6 55 55,5
0 0 1 21 0 0 0 0
55,8 56 56,7 57 57,6 58 59,4 60
0 4 0 0 0 1 7 19
60,9 64 68 gammaknife GK
0 1 0 1 1
I also tried str_detect but that did not solve it - or, at least, I have applied it wrongly:
filter(simpson %in% 1:3, age>=18, str_detect(rad.dose, c("gammaknife","GK")==FALSE))
Thank you in advance.
Keep numbers as numeric:
e %>%
mutate(
rad.dose.clean = as.numeric(sub(",", ".", rad.dose, fixed = TRUE))) %>%
filter(simpson %in% 1:3, age >= 18, !is.na(rad.dose.clean))
Let's grab some reproducible data first:
e <- data.frame(rad.dose = c(rnorm(10), LETTERS))
e$rad.dose <- gsub(".", ",", e$rad.dose, fixed = TRUE)
e
#> rad.dose
#> 1 -0,713359526629519
#> 2 1,17199694030257
#> 3 -0,255927297100446
#> 4 0,904135913625966
#> 5 0,44009842543169
#> 6 2,05720978460597
#> 7 0,0285732465139118
#> 8 -0,030136802990553
#> 9 -0,291809792426389
#> 10 0,30677039655244
#> 11 A
#> 12 B
#> 13 C
#> 14 D
#> 15 E
#> 16 F
#> 17 G
#> 18 H
#> 19 I
#> 20 J
#> 21 K
#> 22 L
#> 23 M
#> 24 N
#> 25 O
#> 26 P
#> 27 Q
#> 28 R
#> 29 S
#> 30 T
#> 31 U
#> 32 V
#> 33 W
#> 34 X
#> 35 Y
#> 36 Z
As far as I can see, you have three problems with the rad.dose column: you have text in your column, you have commas in the numbers and your numeric column is formatted as character (the whole thing, otherwise you wouldn't be able to store text in it). I would correct this first before continuing
library(dplyr)
library(stringr)
e %>%
filter(str_detect(rad.dose, "[0-9,]+")) %>% # only allow values which consist entirly of numbers and comma
mutate(rad.dose = as.numeric(str_replace(rad.dose, ",", "."))) # replace comma with dot and transform as numeric
#> rad.dose
#> 1 -0.71335953
#> 2 1.17199694
#> 3 -0.25592730
#> 4 0.90413591
#> 5 0.44009843
#> 6 2.05720978
#> 7 0.02857325
#> 8 -0.03013680
#> 9 -0.29180979
#> 10 0.30677040
Now you can use filter(simpson %in% 1:3, age>=18) on this data.
Exclude rows where rad.dose contains non-numeric characters (and comma) works, but is not perfect.
dplyr::filter(e, !grepl('[^0-9,-]', rad.dose))
Above line still does not filter out '---,--'
Other option: replace commas by decimal, and see if it is.numeric
e %>%
mutate(rad.dose_numeric = as.numeric(gsub(',','.', rad.dose))) %>% # extra column
dplyr::filter(!is.na(rad.dose_numeric)) %>% # filter out NAs
select(-rad.dosenumeric) # remove col
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
I have the following dataframe named "dataset"
> dataset
V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450
I have two helper functions as below
get_A <- function(p){
return(data.frame(Scorecard = p,
Results = dataset[nrow(dataset),(p+1)]))
} #Pulls the value from the last row for a given value of (p and offset by 1)
get_P <- function(p){
return(data.frame(Scorecard= p,
Results = dataset[p,ncol(dataset)]))
} #Pulls the value from the last column for a given value of p
I have the following dataframe on which I need to run the above helper functions. There will be NAs because I'm reading this "data_sub" dataframe from an excel file which can have unequal rows for the two columns.
> data_sub
Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA
When I call the helper functions, I get some strange results as shown below:
> get_P(data_sub[complete.cases(data_sub$Key_P),]$Key_P)
Scorecard Results
1 2 1837
2 3 315
3 4 621
> get_A(data_sub[complete.cases(data_sub$Key_A),]$Key_A)
Scorecard Results.V2 Results.V4 Results.V6
1 1 12 8 11
2 3 12 8 11
3 5 12 8 11
Warning message:
In data.frame(Scorecard = p, Results = dataset[nrow(dataset), (p + :
row names were found from a short variable and have been discarded
The call to the get_P() helper function is working the way I want. I'm getting the "Results" for each non-NA value in data_sub$Key_P as a dataframe.
But the call to the get_A() helper function is giving strange results and also a warning.I was expecting it to give a similar dataframe as given the call to get_P(). Why is this happening and how can I make get_A() to give the correct dataframe? Basically, the output of this should be
Scorecard Results
1 1 12
2 3 8
3 5 11
I found this link related to the warning but it's unhelpful in solving my issue.
The following works
get_P <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ]
data.frame(
Scorecard = data_sub$Key_P,
Results = df[data_sub$Key_P, ncol(df)])
}
get_P(df, data_sub)
# Scorecard Results
#1 2 1837
#2 3 315
#3 4 621
get_A <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ];
data.frame(
Scorecard = data_sub$Key_A,
Results = as.numeric(df[nrow(df), data_sub$Key_A + 1]))
}
get_A(df, data_sub)
# Scorecard Results
#1 1 12
#2 3 8
#3 5 11
To avoid the warning, we need to strip rownames with as.numeric in get_A.
Another tip: It's better coding practice to make get_P and get_A a function of both df and data_sub to avoid global variables.
Sample data
df <- read.table(text =
" V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450", header = T, row.names = 1)
data_sub <- read.table(text =
" Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA", header = T, row.names = 1)