I have a data looks like below, I would like to skip 2 rows after max index of certain types (3 and 4). For example, I have two 4s in my table, but I only need to remove 2 rows after the second 4. Same for 3, I only need to remove 2 rows after the third 3.
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 94 | 1 |
-----------------
| 57 | 1 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 99 | 1 |
-----------------
| 99 | 1 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
| 97 | 1 |
-----------------
| 96 | 1 |
-----------------
The desired output would be:
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
Here is the code of my example:
data <- data.frame(grade = c(93,90,54,36,31,94,57,16,11,12,99,99,9,10,97,96), type = c(2,2,2,4,4,1,1,3,3,3,1,1,3,3,1,1))
Could anyone give me some hints on how to approach this in R? Thanks a bunch in advance for your help and your time!
data[-c(max(which(data$type==3))+1:2,max(which(data$type==4))+1:2),]
# grade type
# 1 93 2
# 2 90 2
# 3 54 2
# 4 36 4
# 5 31 4
# 8 16 3
# 9 11 3
# 10 12 3
Using some indexing:
data[-(nrow(data) - match(c(3,4), rev(data$type)) + 1 + rep(1:2, each=2)),]
# grade type
#1 93 2
#2 90 2
#3 54 2
#4 36 4
#5 31 4
#8 16 3
#9 11 3
#10 12 3
Or more generically:
vals <- c(3,4)
data[-(nrow(data) - match(vals, rev(data$type)) + 1 + rep(1:2, each=length(vals))),]
The logic is to match the first instance of each value to the reversed values in the column, then spin that around to give the original row index, then add 1 and 2 to the row indexes, then drop these rows.
Similar to Ric, but I find it a bit easier to read (way more verbose, though):
idx = data %>% mutate(id = row_number()) %>%
filter(type %in% 3:4) %>% group_by(type) %>% filter(id == max(id)) %>% pull(id)
data[-c(idx + 1, idx + 2),]
Related
I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor
I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;
My data is like this:
Time | State | Event
01 | 0 |
02 | 0 |
03 | 0 |
04 | 2 | A_start
05 | 2 |
06 | 2 |
07 | 2 |
08 | 2 |
09 | 1 | A_end
10 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 2 | B_start
15 | 2 |
16 | 2 |
17 | 2 |
18 | 2 |
19 | 0 | B_end
20 | 0 |
21 | 0 |
22 | 0 |
23 | 0 |
24 | 2 | A_start
25 | 2 |
26 | 2 |
27 | 2 |
28 | 2 |
29 | 2 |
30 | 2 |
31 | 1 | A_end
32 | 1 |
33 | 1 |
34 | 1 |
35 | 1 |
36 | 1 |
37 | 2 | B_start
38 | 2 |
39 | 2 |
40 | 2 |
The cycle can repeat with any number of 0s, 1s and 2s in between. Sometimes, 0s, 1s or 2s can be missing entirely. I want to get the difference in the Time column between every A_start and the A_end immediately after it. Similarly, I want the difference in Time between every B_start and the B_end that immediately follows.
For this, I thought it would help if I made a "group" for each cycle, as follows:
Time | State | Event | Group
01 | 0 | |
02 | 0 | |
03 | 0 | |
04 | 2 | A_start | 1
05 | 2 | |
06 | 2 | |
07 | 2 | |
08 | 2 | |
09 | 1 | A_end | 1
10 | 1 | |
11 | 1 | |
12 | 1 | |
13 | 1 | |
14 | 2 | B_start | 1
15 | 2 | |
16 | 2 | |
17 | 2 | |
18 | 2 | |
19 | 0 | B_end | 1
20 | 0 | |
21 | 0 | |
22 | 0 | |
23 | 0 | |
24 | 2 | A_start | 2
25 | 2 | |
26 | 2 | |
27 | 2 | |
28 | 2 | |
29 | 2 | |
30 | 2 | |
31 | 1 | A_end | 2
32 | 1 | |
33 | 1 | |
34 | 1 | |
35 | 1 | |
36 | 1 | |
37 | 2 | B_start | 2
38 | 2 | |
39 | 2 | |
40 | 2 | |
However, because there are sometimes missing values in the State column, this isn't working out too well.
The correct cycle sequence is 0 -> 2 -> 1 -> 2 -> 0. Sometimes, a cycle may miss a 2 and be like this: 0 -> 1 -> 2 -> 0. Various combinations of the cycle 0 -> 2 -> 1 -> 2 -> 0 are possible (44 in total). How should I go about this?
Here is a base solution:
#identify the times where there is a change in the State
timeWithChanges <- which(abs(diff(dat$State)) > 0) + 1
#pivot those times into a m * 2 matrix
startEnd <- matrix(dat$Time[timeWithChanges], ncol=2, byrow=TRUE)
#calculate the time difference and label them as A, B
data.frame(AB=rep(c("A", "B"), nrow(startEnd)/2),
TimeDiff=startEnd[,2] - startEnd[,1])
Please let me know if this works generally enough for you.
data:
dat <- read.table(text="Time | State
01 | 0
02 | 0
03 | 0
04 | 2
05 | 2
06 | 2
07 | 2
08 | 2
09 | 1
10 | 1
11 | 1
12 | 1
13 | 1
14 | 2
15 | 2
16 | 2
17 | 2
18 | 2
19 | 0
20 | 0
21 | 0
22 | 0
23 | 0
24 | 2
25 | 2
26 | 2
27 | 2
28 | 2
29 | 2
30 | 2
31 | 1
32 | 1
33 | 1
34 | 1
35 | 1
36 | 1
37 | 2
38 | 2
39 | 2
40 | 2
41 | 0", sep="|", header=TRUE)
I have a dataset containing over 6,000 observations, each record having a score ranging from 0-100. Below is a sample:
+-----+-------+
| uID | score |
+-----+-------+
| 1 | 77 |
| 2 | 61 |
| 3 | 74 |
| 4 | 47 |
| 5 | 65 |
| 6 | 51 |
| 7 | 25 |
| 8 | 64 |
| 9 | 69 |
| 10 | 52 |
+-----+-------+
I want to bin them into equal deciles based upon their rank order relative to their peers within the score column with cutoffs being at every 10th percentile, as seen below:
+-----+-------+-----------+----------+
| uID | score | position% | scoreBin |
+-----+-------+-----------+----------+
| 7 | 25 | 0.1 | 1 |
| 4 | 47 | 0.2 | 2 |
| 6 | 51 | 0.3 | 3 |
| 10 | 52 | 0.4 | 4 |
| 2 | 61 | 0.5 | 5 |
| 8 | 64 | 0.6 | 6 |
| 5 | 65 | 0.7 | 7 |
| 9 | 69 | 0.8 | 8 |
| 3 | 74 | 0.9 | 9 |
| 1 | 77 | 1 | 10 |
+-----+-------+-----------+----------+
So far I've tried cut, cut2, tapply, etc. I think I'm on the right logic path, but I have no idea on how to apply them to my situation. Any help is greatly appreciated.
I would use ntile() in dplyr.
library(dplyr)
score<-c(77,61,74,47,65,51,25,64,69,52)
ntile(score, 10)
##[1] 10 5 9 2 7 3 1 6 8 4
scoreBin<- ntile(score, 10)
In base R we can use a combination of .bincode() and quantile():
df$new <- .bincode(df$score,
breaks = quantile(df$score, seq(0, 1, by = 0.1)),
include.lowest = TRUE)
# uID score new
#1 1 77 10
#2 2 61 5
#3 3 74 9
#4 4 47 2
#5 5 65 7
#6 6 51 3
#7 7 25 1
#8 8 64 6
#9 9 69 8
#10 10 52 4
Here is a method that uses quantile together with cut to get the bins:
df$scoreBin <- as.integer(cut(df$score,
breaks=quantile(df$score, seq(0,1, .1), include.lowest=T)))
as.integer coerces the output of cut (which is a factor) into the underlying integer.
One way to get the position percent is to use rank:
df$position <- rank(df$score) / nrow(df)
I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?
b2 <- b[order(b$ID, -b$age), ]
should do the trick.
The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))