Making sets from numbers in a dataframe

Making sets from numbers in a dataframe - r

I have this data.frame:
structure(list(X0 = c(9, 13, 13, 13, 35, 36, 37, 38, 39, 40,
40, 42, 43, 44), X0.1 = c(10, 40, 45, 46, 36, 37, 38, 40, 46,
45, 46, 43, 44, 46)), .Names = c("A", "B"), row.names = c(NA,
14L), class = "data.frame")
A B
1 9 10
2 13 40
3 13 45
4 13 46
5 35 36
6 36 37
7 37 38
8 38 40
9 39 46
10 40 45
11 40 46
12 42 43
13 43 44
14 44 46
I want to create sets like this: row 2,3 and 4 have 13, so they will be grouped into a set (13,40,45,46).
If any further row has even one member common with this set, both members of that row will be included in this set.
Since row 8 has 40 common with above set, the set will include them also: (13,40,45,46,38)
Now row 7 now has one member (38) common with this set, other member (37) will also be included in this set. The set will become (13,40,45,46,38,37)
If none of the 2 members of a row are common to any existing set, they will form their own set. Like row 1 has 9 and 10, none of which is there in any other row. So they form one set of (9,10)
At end I want to print out all sets.
Can I accompalish this in R programming? Thanks for your help.

Is this what you want?
f <- function(s, v) {
m <- which(s$A %in% v | s$B %in% v)
if (!any(m)) v
else Recall(s[-m, ], sort(unique(c(v, c(unlist(s[m, ]))))))
}
done <- c()
for(n in unique(unlist(d))) {
if (n %in% done) next
r <- f(d, n)
done <- c(done, r)
cat("(", r, ") ")
}
it outputs
( 9 10 ) ( 13 35 36 37 38 39 40 42 43 44 45 46 )
Updated
done <- c()
ret <- list()
for(n in unique(unlist(d))) {
if (n %in% done) next
r <- f(d, n)
done <- c(done, r)
cat("(", r, ") ")
ret <- c(ret, list(r))
}
then,
> ret
[[1]]
[1] 9 10
[[2]]
[1] 13 35 36 37 38 39 40 42 43 44 45 46

Related

How to create a new column based on other columns with if conditions in r

Not able to find a way to generate a new column based with if conditions for group of events in a column.
The column called "BF" represent the (i-3) of the flow column, and is going to be the same BF for each "event" group. For example, in row 5, the value of "BF" is 39, which is the previous 3rd value of the flow column (flow for row 2) for all the "2" in the event column.
The problem is that BF[i] can't be bigger than flow[i]. If BF[i] is bigger than flow[i], then the BF should be the (i-4) or (i-5) or (1-6)... of the flow until BF[i] will be equal or smaller than flow[i]. For example, in row 10 the value of the column "BF" is bigger than the value of the column "flow", therefore, the value of BF_1 (column I want to create) in row 10 is 37, which represent the closest lower value of flow, in this case the flow[i-6].
As an example, we have the following dataframe:
flow<- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event<- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF<- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a<- data.frame(flow, event, BF)
This is the desire output I'm looking for. I want to create the BF_1 column.
flow event BF BF_1
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
Is there a possible way to generate the column BF_1? please let me know any thoughts. I am working with for loops and using if conditions but I am not able to hold the BF value for the entire group of the event column.

coding a bit inefficient, could have use dplyr etc.., but it will do the work and matching the BF_1 column given
flow <- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event <- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF <- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a <- data.frame(flow, event, BF)
a$BF_1 <- NA #default to NA first
for(i in 1:length(unique(a$event))){
if(is.na(a[a$event == i, "BF"][1])) next
if(a[a$event == i, "BF"][1] < a[a$event == i, "flow"][1]) a[a$event == i, "BF_1"] <- a[a$event == i, "BF"][1]
if(a[a$event == i, "BF"][1] > a[a$event == i, "flow"][1]) {
head <- min(which(a$event==i))-6
if (min(head-6) < 0) head <- 1 #making sure it doesn't overflow to row 0
a[a$event == i, "BF_1"] <- min( a[ head:min(which(a$event==i)), "flow"] ) #fill the min of the subset flow column given position
}
}
a

One tidyverse possibility could be:
a %>%
left_join(crossing(a, a) %>%
filter(event > event1) %>%
group_by(event) %>%
filter(flow == first(flow)) %>%
slice(1:(n() - 3)) %>%
slice(which.max(cumsum(flow > flow1))) %>%
ungroup() %>%
transmute(event,
flow_flag = flow1), by = c("event" = "event")) %>%
mutate(BF_1 = ifelse(lag(flow, 3) > flow, flow_flag, lag(flow, 3))) %>%
group_by(event) %>%
mutate(BF_1 = first(BF_1)) %>%
select(-flow_flag)
flow event BF BF_1
<dbl> <dbl> <dbl> <dbl>
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
It could be overcomplicated, but what it does is, first, creating all combinations of values (as the desired value can be theoretically anywhere in the data). Second, it identifies the first case per group fulfilling the condition (not taking into account the previous 3rd value). Finally, it combines it with the original df and if the 3rd previous value per group is fulfilling the condition, then returns it, otherwise returns the value first fulfilling condition to be smaller than the actual value.

Create sequence of additions

I want to extract specific rows of my dataframe, following a sequence of rownumbers.
The sequence should be:
7, 14, 21, 31, 38, 45, 55, 62, 69.....until 8760.
So it always is starting from row 7 and then it goes +7 +7 +10 and this should be repeated until the end.
I know rep and seq, but I don't know how to deal with that +10 after the +7.
Any ideas?

Try
x <- rep(c(7, 10), c(2, 1))
out <- cumsum(c(7, rep(x, ceiling(8760 / sum(x)))))
Result
head(out, 10)
# [1] 7 14 21 31 38 45 55 62 69 79
tail(out)
# [1] 8726 8733 8743 8750 8757 8767
If you want out to end at 8760 you might do
c(out[out < 8760], 8760)

We can use rep
x1 <- rep(c(7, 10), c(2, 1))
out <- cumsum(c(7, rep(x1, 8760 %/% sum(x1)))))
out1 <- out[out < 8760]
head(out1, 10)
#[1] 7 14 21 31 38 45 55 62 69 79
tail(out1, 10)
#[1] 8685 8695 8702 8709 8719 8726 8733 8743 8750 8757

Algorithm to Ascribe Value RStudio

I am trying to create an algorithm that essentially is a function of this data frame.
This is the code I was using, but it doesn't seem to be working.
I need image_id to be the independent variable so that when I input 7 into the function, I get back 10 and 15. If I were to input 8, I would get back 11 and 13.
num = function(image_id, category_id, data = categories) {x->y}
This is the data frame that I am using.
category_id image_id cat_to_img_last_update
1 15 15 NULL
2 11 11 NULL
3 13 13 NULL
4 10 10 NULL
5 35 35 NULL
6 78 78 NULL
7 112 112 NULL
8 61 61 NULL
9 86 86 NULL
10 101 101 NULL
11 61 61 NULL
12 86 86 NULL

You probably don't need a function for this, but if you really want, here is what it would look like:
# Read in data
categories <-
data.frame(category_id = c(15,11,13,10,35,78,112,61,86,101,61,86),
image_id = c(7,8,8,7,9,9,10,10,11,11,12,12),
stringsAsFactors = FALSE)
num <- function(image_id, data = categories) {
data$category_id[data$image_id == image_id]
}
num(7) # 15 10
num(8) # 11 13

df <- data.frame(
category_id = c(15, 11, 13, 10, 25, 78, 112, 61, 86, 101, 61, 86),
image_id = c(7, 8, 8, 7, 9, 9, 10, 10, 11, 11, 12, 12)
)
myfun <- function(num) { sort(df[df$image_id == num, "category_id"]) }
myfun(7)
myfun(8)

Regroup lines of a data frame for which a column value is inferior to x

I have this data frame :
> df
Z freq proba
1 17 1 0.0033289263
2 18 4 0.0055569026
3 19 2 0.0087878028
4 20 3 0.0132023556
5 21 16 0.0188900561
6 22 12 0.0257995234
7 23 30 0.0337042731
8 24 41 0.0421963455
9 25 56 0.0507149437
10 26 65 0.0586089198
11 27 65 0.0652230449
12 28 93 0.0699913154
13 29 82 0.0725182432
14 30 94 0.0726318551
15 31 72 0.0703990113
16 32 74 0.0661024717
17 33 58 0.0601873020
18 34 66 0.0531896431
19 35 38 0.0456625487
20 36 45 0.0381117389
21 37 27 0.0309498221
22 38 17 0.0244723502
23 39 15 0.0188543771
24 40 13 0.0141629367
25 41 4 0.0103793600
26 42 1 0.0074254435
27 43 2 0.0051886582
28 45 1 0.0023658767
29 46 1 0.0015453804
30 49 2 0.0003792308
# Here are my datas :
> dput(df)
structure(list(Z = c(17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 45, 46, 49), freq = c(1, 4, 2, 3, 16, 12, 30, 41, 56, 65,
65, 93, 82, 94, 72, 74, 58, 66, 38, 45, 27, 17, 15, 13, 4, 1,
2, 1, 1, 2), proba = c(0.0033289262662263, 0.00555690264007235,
0.00878780282243439, 0.0132023555702843, 0.0188900560866825,
0.0257995234198431, 0.0337042730520012, 0.0421963455163949, 0.0507149437492447,
0.0586089198012906, 0.0652230449359029, 0.0699913153996099, 0.0725182432348992,
0.0726318551493006, 0.0703990113442269, 0.0661024716831246, 0.0601873020200862,
0.0531896430528685, 0.045662548708844, 0.0381117389181843, 0.030949822142559,
0.0244723501557229, 0.01885437705459, 0.0141629366839816, 0.0103793599644779,
0.00742544354411115, 0.00518865818999788, 0.00236587669133322,
0.00154538036835848, 0.000379230768851682)), .Names = c("Z",
"freq", "proba"), row.names = c(NA, -30L), class = "data.frame")
And I want to regroup lines for which the value "freq" is < 5 with the next line, and this while the next line is < 5.
Idk if I'm clear enough so this is the output I expect :
> df2
labels effectifs pi
1 17;20 10 0.03087599
2 21 16 0.01889006
3 22 12 0.02579952
4 23 30 0.03370427
5 24 41 0.04219635
6 25 56 0.05071494
7 26 65 0.05860892
8 27 65 0.06522304
9 28 93 0.06999132
10 29 82 0.07251824
11 30 94 0.07263186
12 31 72 0.07039901
13 32 74 0.06610247
14 33 58 0.06018730
15 34 66 0.05318964
16 35 38 0.04566255
17 36 45 0.03811174
18 37 27 0.03094982
19 38 17 0.02447235
20 39 15 0.01885438
21 40 13 0.01416294
22 41;49 11 0.02728395
I did it with nested while, but I find this solution very painful and so unoptimized.
i <- 1
freqs <- c()
labels <- c()
pi <- c()
while(i < nrow(df)) {
if (df$freq[i] >= 5) {
freqs <- c(freqs, df$freq[i])
labels <- c(labels, df$Z[i])
pi <- c(pi, df$proba[i])
i <- i + 1
}
else {
count <- df$freq[i]
countPi <- df$proba[i]
k <- i
j <- i
while(df$freq[i] < 5 & i < nrow(df)) {
if (df$freq[i+1] < 5) {
count <- count + df$freq[i+1]
countPi <- countPi + df$proba[i+1]
j <- i + 1
}
i <- i + 1
}
labels <- c(labels, paste0(df$Z[k], ";", df$Z[j]))
freqs <- c(freqs, count)
pi <- c(pi, countPi)
}
}
df2 <- data.frame(labels, freqs, pi)
I'm sure there is far better, maybe with dplyr. If you have a better solution.. Thanks !

We could use the "devel" version of "data.table" as new functions are introduced (rleid). Here, we convert the "data.frame" to "data.table" (setDT(df)), create a grouping variable ("gr") based on the logical index (freq <5) using rleid. 'Z' column is 'numeric/integer' class. Create a character column ("Z1") from the "Z". Grouped by 'gr', if the "freq" is less than 5 for all the elements of that group, summarise the rows to a single row by taking the first observation of columns (.SD[1L]), remove the unwanted columns (as .SD includes "Z1" which will result in duplicate columns), append it with the "Z1" that we get from pasting the min and max value of "Z" for that group. Otherwise, leave it unchanged (else .SD). Remove the columns that we don't need by assigning it to "NULL".
library(data.table) #data.table_1.9.5
res <- setDT(df)[, gr:=rleid(freq<5)][, Z1:= as.character(Z)][,
if(all(freq<5)) c(.SD[1L][,-4, with=FALSE],
list(Z1=toString(c(min(Z), max(Z)))))
else .SD, gr][,1:2 :=NULL][]
head(res,3)
# freq proba Z1
#1: 1 0.003328926 17, 20
#2: 16 0.018890056 21
#3: 12 0.025799523 22

Since this is a dplyr question, here is a dplyr solution. First I used a grouping function in order to define the groups (similar to the rleid function in data.table). Then the summary and is fairly simple.
# grouping function
grouping <- function(condition){
# calculate runs for grouping
run <- rle((!condition) * 1:length(condition))
# revalue
run$values <- seq_along(run$values)
# invert to get grouping
inverse.rle(run)
}
# load dplyr
require(dplyr)
df %>%
mutate(group = grouping(freq<5)) %>% # add groups
group_by(group) %>% # group data
summarize(freq = sum(freq), # sum freq
proba = sum(proba), # sum proba
Z = toString(unique(range(Z)))) %>% # rename Z
mutate(group=NULL) # remove groups
## Source: local data table [22 x 3]
##
## freq proba Z
## 1 10 0.03087599 17, 20
## 2 16 0.01889006 21
## 3 12 0.02579952 22
## 4 30 0.03370427 23
## 5 41 0.04219635 24
## 6 56 0.05071494 25
## 7 65 0.05860892 26
## 8 65 0.06522304 27
## 9 93 0.06999132 28
## 10 82 0.07251824 29
## .. ... ... ...

How to generate a frequency table in R with with cumulative frequency and relative frequency

I'm new with R. I need to generate a simple Frequency Table (as in books) with cumulative frequency and relative frequency.
So I want to generate from some simple data like
> x
[1] 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10 12 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10
[36] 12 15 19 20 22 20 19 19 19
a table like:
frequency cumulative relative
(9.99,11.7] 2 2 0.04545455
(11.7,13.4] 2 4 0.04545455
(13.4,15.1] 1 5 0.02272727
(15.1,16.9] 10 15 0.22727273
(16.9,18.6] 22 37 0.50000000
(18.6,20.3] 6 43 0.13636364
(20.3,22] 1 44 0.02272727
I know it should be simple, but I don't know how.
I got some results using this code:
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
as.matrix(table(factorx))

You're close! There are a few functions that will make this easy for you, namely cumsum() and prop.table(). Here's how I'd probably put this together. I make some random data, but the point is the same:
#Fake data
x <- sample(10:20, 44, TRUE)
#Your code
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
#Tabulate and turn into data.frame
xout <- as.data.frame(table(factorx))
#Add cumFreq and proportions
xout <- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
#-----
factorx Freq cumFreq relative
1 (9.99,11.4] 11 11 0.25000000
2 (11.4,12.9] 3 14 0.06818182
3 (12.9,14.3] 11 25 0.25000000
4 (14.3,15.7] 2 27 0.04545455
5 (15.7,17.1] 6 33 0.13636364
6 (17.1,18.6] 3 36 0.06818182
7 (18.6,20] 8 44 0.18181818

The base functions table, cumsum and prop.table should get you there:
cbind( Freq=table(x), Cumul=cumsum(table(x)), relative=prop.table(table(x)))
Freq Cumul relative
10 2 2 0.04545455
12 2 4 0.04545455
15 1 5 0.02272727
16 10 15 0.22727273
17 16 31 0.36363636
18 6 37 0.13636364
19 4 41 0.09090909
20 2 43 0.04545455
22 1 44 0.02272727
With cbind and naming of the columns to your liking this should be pretty easy for you in the future. The output from the table function is a matrix, so this result is also a matrix. If this were being done on something big it would be more efficient todo this:
tbl <- table(x)
cbind( Freq=tbl, Cumul=cumsum(tbl), relative=prop.table(tbl))

If you are looking for something pre-packaged, consider the freq() function from the descr package.
library(descr)
x = c(sample(10:20, 44, TRUE))
freq(x, plot = FALSE)
Or to get cumulative percents, use the ordered() function
freq(ordered(x), plot = FALSE)
To add a "cumulative frequencies" column:
tab = as.data.frame(freq(ordered(x), plot = FALSE))
CumFreq = cumsum(tab[-dim(tab)[1],]$Frequency)
tab$CumFreq = c(CumFreq, NA)
tab
If your data has missing values, a valid percent column is added to the table.
x = c(sample(10:20, 44, TRUE), NA, NA)
freq(ordered(x), plot = FALSE)

Yet another possibility:
library(SciencesPo)
x = c(sample(10:20, 50, TRUE))
freq(x)

My suggestion is to check the agricolae package... check it out:
library(agricolae)
weight<-c( 68, 53, 69.5, 55, 71, 63, 76.5, 65.5, 69, 75, 76, 57, 70.5,
+ 71.5, 56, 81.5, 69, 59, 67.5, 61, 68, 59.5, 56.5, 73,
+ 61, 72.5, 71.5, 59.5, 74.5, 63)
h1<- graph.freq(weight,col="yellow",frequency=1,las=2,xlab="h1")
print(summary(h1),row.names=FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Making sets from numbers in a dataframe - r

Related

How to create a new column based on other columns with if conditions in r

Create sequence of additions

Algorithm to Ascribe Value RStudio

Regroup lines of a data frame for which a column value is inferior to x

How to generate a frequency table in R with with cumulative frequency and relative frequency

Categories

Resources