Custom sorting of a dataframe in R - r

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,

Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)

Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Related

Assigning unique ID to records based on certain deference between values in consecutive rows using loop in r

This is my df (data.frame)
Time <- c("16:04:56", "16:04:59", "16:05:02", "16:05:04", "16:05:11", "16:05:13", "16:07:59", "16:08:09", "16:09:03", "16:09:51", "16:11:10")
Distance <- c(45,38,156,157,37,159,79,79,78,160,78)
df <-as.data.frame(cbind(Time,Distance));dat
Time Distance
16:04:56 45
16:04:59 38
16:05:02 156
16:05:04 157
16:05:11 37
16:05:13 159
16:07:59 79
16:08:09 79
16:09:03 78
16:09:51 160
16:11:10 78
I need to assign an ID to each record based on two conditions:
If the absolute difference between two consecutive rows of the Time column is 1 minute and
If the difference between two consecutive rows of the Distance column is 10.
Only when both conditions are satisfied then should assign a new ID.
Results should be like this
Time Distance ID
16:04:56 45 1
16:04:59 38 1
16:05:02 156 1
16:05:04 157 1
16:05:11 37 1
16:05:13 159 1
16:07:59 79 2
16:08:09 79 2
16:09:03 78 2
16:09:51 160 2
16:11:10 78 3
Thanks to all who contribute any thoughts.
Change Time column to POSIXct format. Take difference between consecutive rows for Time and Distance column and increment the count using cumsum.
library(dplyr)
df %>%
mutate(Time1 = as.POSIXct(Time, format = '%T'),
ID = cumsum(
abs(difftime(Time1, lag(Time1, default = first(Time1)), units = 'mins')) > 1 &
abs(Distance - lag(Distance, default = first(Distance))) > 10) + 1) %>%
select(-Time1)
# Time Distance ID
#1 16:04:56 45 1
#2 16:04:59 38 1
#3 16:05:02 156 1
#4 16:05:04 157 1
#5 16:05:11 37 1
#6 16:05:13 159 1
#7 16:07:59 79 2
#8 16:08:09 79 2
#9 16:09:03 78 2
#10 16:09:51 160 2
#11 16:11:10 78 3
data
df <-data.frame(Time,Distance)

R group data into equal groups with a metric variable

I'm struggeling to get a good performing script for this problem: I have a table with a score, x, y. I want to sort the table by score and than build groups based on the x value. Each group should have an equal sum (not counts) of x. x is a metric number in the dataset and resembles the historic turnover of a customer.
score x y
0.436024136 3 435
0.282303336 46 56
0.532358015 24 34
0.644236597 0 2
0.99623626 0 4
0.557673456 56 46
0.08898779 0 7
0.702941303 453 2
0.415717835 23 1
0.017497461 234 3
0.426239166 23 59
0.638896238 234 86
0.629610596 26 68
0.073107526 0 35
0.85741877 0 977
0.468612039 0 324
0.740704267 23 56
0.720147257 0 68
0.965212467 23 0
a good way to do so is adding a group variable to the data.frame with cumsum! Now you can easily sum the groups with e. g. subset.
data.frame$group <-cumsum(as.numeric(data.frame$x)) %/% (ceiling(sum(data.frame$x) / 3)) + 1
remarks:
in big data.frames cumsum(as.numeric()) works reliably
%/% is a division where you get an integer back
the '+1' just let your groups start with 1 instead of 0
thank you #Ronak Shah!

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

In R; I would like to do something in R rather than excel because excel can't handle the calculation. In excel the calculation is: =A2+SUM($B$2:B2)

I want col c phys_pos to be the value in col a position plus the accumulative value of col b length. In excel the calculation is: =A2+SUM($B$2:B2), but excel can't handle such a lot of data. Thanks all.
The data I would like:
position length phys_pos
12 45 57
97 0 142
135 0 180
498 0 543
512 0 557
16 67 128
76 0 188
89 0 201
101 0 213
152 0 264
3 103 218
19 0 234
76 0 291
88 0 303
Look into dplyr https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
install.packages("dplyr")
library(dplyr)
df <- df %>% mutate(phys_pos=cumsum(length)+position)
I am assuming your data.frame is named df
Or with base R
df$phys_pos <- cumsum(df$length) + df$position
Assuming your data is stored in a dataframe called "dat":
acc <- 0
for(i in 1:nrow(dat)){
acc <- acc + dat[i,"length"]
dat[i,"phys_pos"] <- dat[i,"position"]+acc
}
This is simple stuff. If you would do some tutorials you could learn to do it on your own pretty fast.

Alternative to for loop R

I have written a function that will compare the similarity of IP addresses, and will let the user select the level of detail in the octet. for example, in the address 255.255.255.0 and 255.255.255.1, a user could specify that they only want to compare the first, first and second, first second third etc. octets.
the function is below:
did.change.ip=function(vec, detail){
counter=2
result.vec=FALSE
r.list=strsplit(vec, '.', fixed=TRUE)
for(i in vec){
if(counter>length(vec)){
break
}
first=as.numeric(r.list[[counter-1]][1:detail])
second=as.numeric(r.list[[counter]][1:detail])
if(sum(first==second)==detail){
result.vec=append(result.vec,FALSE)
}
else{
result.vec=append(result.vec,TRUE)
}
counter=counter+1
}
return(result.vec)
}
and it's really slow once the data starts getting larger. for a dataset of 500,000 rows, the system.time() results are:
user system elapsed
208.36 0.59 209.84
are there any R power users who have insight on how to write this more efficiently? I know lapply() is the preferred method for looping over vectors/dataframes, but I'm stumped as to how to access the previous element in a vector for this purpose. I've tried to sketch something out quickly, but It returns a syntax error:
test=function(vec, detail){
rlist=strsplit(vec, '.', fixed=TRUE)
r.value=vapply(rlist, function(x,detail) ifelse(x[1:detail]==x[1:detail] TRUE, FALSE))
}
I've created some sample data for testing purposes below:
stack.data=structure(list(V1 = c("247.116.209.66", "195.121.47.105", "182.136.49.12",
"237.123.100.50", "120.30.174.18", "29.85.72.70", "18.186.76.177",
"33.248.142.26", "109.97.92.50", "217.138.155.145", "20.203.156.2",
"71.1.51.190", "31.225.208.60", "55.25.129.73", "211.204.249.244",
"198.137.15.53", "234.106.102.196", "244.3.87.9", "205.242.10.22",
"243.61.212.19", "32.165.79.86", "190.207.159.147", "157.153.136.100",
"36.151.152.15", "2.254.210.246", "3.42.1.208", "30.11.229.18",
"72.187.36.103", "98.114.189.34", "67.93.180.224")), .Names = "V1", class = "data.frame", row.names = c(NA,
-30L))
Here's another solution just using base R.
did.change.ip <- function(vec, detail=4){
ipv <- scan(text=paste(vec, collapse="\n"),
what=c(replicate(detail, integer()), replicate(4-detail,NULL)),
sep=".", quiet=TRUE)
c(FALSE, rowSums(vapply(ipv[!sapply(ipv, is.null)],
diff, integer(length(vec)-1))!=0)>0)
}
Here we use scan() to break up the ip address into numbers. Then we we look down each octet for differences using diff. It seems this is faster than the original proposal, but slightly slower than #josilber's stringr solution (using microbenchmark with 3,000 ip addresses)
Unit: milliseconds
expr min lq median uq max neval
orig 35.251886 35.716921 36.019354 36.700550 90.159992 100
scan 2.062189 2.116391 2.170110 2.236658 3.563771 100
strngr 2.027232 2.075018 2.136114 2.200096 3.535227 100
The simplest way I can think of to do this is to build a transformed vector that only includes the parts of the IP you want. Then it's a one-liner to check if each element is equal to the one before it:
library(stringr)
did.change.josilber <- function(vec, detail) {
s <- str_extract(vec, paste0("^(\\d+\\.){", detail, "}"))
return(s != c(s[1], s[1:(length(s)-1)]))
}
This seems reasonably efficient for 500,000 rows:
set.seed(144)
big.vec <- sample(stack.data[,1], 500000, replace=T)
system.time(did.change.josilber(big.vec, 3))
# user system elapsed
# 0.527 0.030 0.554
The biggest issue with your code is that you call append each iteration, which requires reallocation of your vector 500,000 times. You can read more about this in the second circle of the R inferno.
Not sure if all you want is counts, but this is potentially a solution:
library(dplyr)
library(tidyr)
# split ip addresses into "octets"
octets <- stack.data %>%
separate(V1,c("first","second","third","fourth"))
# how many shared both their first and second octets?
octets %>%
group_by(first,second) %>%
summarize(n = n())
first second n
1 109 97 1
2 120 30 1
3 157 153 1
4 18 186 1
5 182 136 1
6 190 207 1
7 195 121 1
8 198 137 1
9 2 254 1
10 20 203 1
11 205 242 1
12 211 204 1
13 217 138 1
14 234 106 1
15 237 123 1
16 243 61 1
17 244 3 1
18 247 116 1
19 29 85 1
20 3 42 1
21 30 11 1
22 31 225 1
23 32 165 1
24 33 248 1
25 36 151 1
26 55 25 1
27 67 93 1
28 71 1 1
29 72 187 1
30 98 114 1

Resources