Counting in R and preserving the order of occurence - r

Suppose I have generated a vector using the following statement:
x1 <- rep(4:1, sample(1:100,4))
Now, when I try to count the number of occurrences using the following commands
count(x1)
x freq
1 1 40
2 2 57
3 3 3
4 4 46
or
as.data.frame(table(x1))
x1 Freq
1 1 40
2 2 57
3 3 3
4 4 46
In both cases, the order of occurrence is not preserved. I want to preserve the order of occurrence, i.e. the output should be like this
x1 Freq
1 4 46
2 3 3
3 2 57
4 1 40
What is the cleanest way to do this? Also, is there a way to coerce a particular order?

You are looking for rle function
rle(x1)
## Run Length Encoding
## lengths: int [1:4] 12 2 23 52
## values : int [1:4] 4 3 2 1

You can order the table like this:
set.seed(42)
x1 <- rep(4:1, sample(1:100,4))
table(x1)[order(unique(x1))]
# x1
# 4 3 2 1
# 92 93 29 81

One way is to convert your variable to factor and specify the desired order with the levels argument. From ?table: "table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels"; "It is best to supply factors rather than rely on coercion.". So by converting to factor yourself, you are in charge over the coercion and the order set by levels.
x1 <- rep(factor(4:1, levels = 4:1), sample(1:100,4))
table(x1)
# x1
# 4 3 2 1
# 90 72 11 16

Related

Is there a way in R to specify which column in my data is groups and which is blocks in order to do a Friedman's test? I am comparing results to SPSS

I will give a sample below of how my data is organized but every time I run Friedman's using frieman.test(y =, groups = , blocks= ) it gives me an error that my data is not from an unreplicated complete block design despite the fact that it is.
score
treatment
day
10
1
1
20
1
1
40
1
1
7
2
1
100
2
1
58
2
1
98
3
1
89
3
1
40
3
1
70
4
1
10
4
1
28
4
1
86
5
1
200
5
1
40
5
1
77
1
2
100
1
2
90
1
2
33
2
2
15
2
2
25
2
2
23
3
2
54
3
2
67
3
2
1
4
2
2
4
2
400
4
2
16
5
2
10
5
2
90
5
2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
the code above gives me the error I described earlier.
However when I convert the csv data to a matrix, Friedman's works but the answer seems wrong as SPSS gives a different result for the degrees of freedom.
sample_data$treatment <- as.factor(sample_data$treatment) #converting to categorical independent variable
sample_data$day <- as.factor(sample_data$day) #converting to categorical independent variable
data = as.matrix(sample_data)
friedman.test(data)
friedman2 <- friedman.test(y = data$score, groups = data$treatment, blocks = data$day)
summary(friedman2)
Any idea what I am doing incorrectly?
I am aware that Friedman's gives me chi-square but I am also wondering how can I get the test statistic instead of the chi-square value.
I am using Rstudio and I am new to R. And I want to know how to specify groups as treatment and day as blocks.
We could summarise the data by taking the mean of the 'score' and then use that summarised data in friedman.test
sample_data1 <- aggregate(score ~ ., sample_data, FUN = mean)
friedman.test(sample_data1$score, groups = sample_data1$treatment,
blocks = sample_data1$day)

transposing data and sequence mining most common patterns in rows

I have a data frame that looks like this:
SFOpID Number MAGroupID
1 0032A00002cgs3XQAQ 1 99
2 0032A00002cgs3XQAQ 1 79
3 003F000001vyUGKIA2 2 8
4 0032A00002btWE6QAM 3 97
5 0032A00002btWE6QAM 3 86
6 0032A00002btWE6QAM 3 35
I need to transpose it so that it looks like this:
SFOpID Number MAGroupID
1 0032A00002cgs3XQAQ 1 99 79
3 003F000001vyUGKIA2 2 8
Then generate counts for the five most common sequences for example: 12 people (SFOpID) have the 97 86 35 sequence, but only 4 people have the 99 79 sequence. I think this may be possible with the arules package doing something like the following:
x <- read_baskets(con = system.file("misc", "zaki.txt", package =
"arulesSequences"),
info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")
The goal is to have output that looks like this:
items sequenceID eventID SIZE
1 {C,D} 1 10 2
2 {A,B,C} 1 15 3
3 {A,B,F} 1 20 3
4 {A,C,D,F} 1 25 4
5 {A,B,F} 2 15 3
Just, for items, it would be a sequence like {99, 79} or {97, 86, 35}
You can use group_by and next to collect values into one list. The list could be converted to text. Here is an example:
code <- read.csv("code.csv", stringsAsFactors = F)
library(dplyr)
output <- code[, 2:4]%>%
group_by(Number, MAGroupID) %>%
nest()
output$data <- as.character(output$data )

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources