I have a table of 3 columns:
Start of range
End of range
Number assigned to all values within the range.
I want to create a table with the first column having values 1-x (x being the total of all ranges) and the second column with the assigned number for each value. Any unassigned values need to be set to 0.
E.g. original table:
start
end
value
1
4
-1
6
8
4
So the final table would be:
Number
Value
1
-1
2
-1
3
-1
4
-1
5
0
6
4
7
4
8
4
But I have no idea where to start - any suggestions?
Thanks.
Does this do the trick? starting from your data example
library(dplyr)
a = data.frame(start= c(1,6),end=c(4,8),value=c(-1,4))
c= apply(a, 1,function(i){
b = i[1]:i[2]
return(as.data.frame(cbind(b, rep(i[3], length(b)))))
})
c = bind_rows(c, .id = "column_label")[,-1]
d= (c[1,1]:c[nrow(c),1])[!c[1,1]:c[nrow(c),1]%in%c$b]
d= cbind(d, rep(0, length(d)))
colnames(d)=colnames(c)
res = rbind(c,d)[order(rbind(c,d)[,1]),]
rownames(res)= 1:nrow(res)
colnames(res)=c('Number', 'Value')
res
output:
> res
Number Value
1 1 -1
2 2 -1
3 3 -1
4 4 -1
5 5 0
6 6 4
7 7 4
8 8 4
The obligatory "data.table" solution ;), a general solution can be obtained using "foverlaps"
library(data.table)
data <- data.frame(start = c(1, 6), end= c(4, 8), value = c(-1, 4))
number <- data.frame(start = c(1:8), end = c(1:8))
setDT(data)
setDT(number)
setkey(data, start, end)
df<-foverlaps(number, data)[, c("i.start", "value"),
with = FALSE]
df[is.na(df$value), ]$value <- 0
Here is a tidyverse solution:
library(dplyr)
library(tidyr)
df %>%
group_by(start) %>%
mutate(index = list(start:end)) %>%
unnest(cols = c(index)) %>%
ungroup() %>%
complete(index = 1:max(index), fill = list(value = 0)) %>%
select(Number=index, Value=value)
Number Value
<int> <dbl>
1 1 -1
2 2 -1
3 3 -1
4 4 -1
5 5 0
6 6 4
7 7 4
8 8 4
If you are looking for a generic solution, you can try this function
expand_integers <- function(start, end, value) {
n <- end - start + 1L
rng <- range(c(start, end))
pos <- sequence(n, start - rng[[1L]] + 1L)
val <- rep.int(value, n)
data.frame(
number = seq.int(rng[[1L]], rng[[2L]]),
value = `[<-`(integer(rng[[2L]] - rng[[1L]] + 1L), pos, value = val)
)
}
It works for any start and end values and is very efficient. Here is a simple test:
df <- data.frame(start = c(4L, 10L), end = c(7L, 19L), value = c(-1L, 4L))
df
expand_integers(df$start, df$end, df$value)
Output
> df
start end value
1 4 7 -1
2 10 19 4
> expand_integers(df$start, df$end, df$value)
number value
1 4 -1
2 5 -1
3 6 -1
4 7 -1
5 8 0
6 9 0
7 10 4
8 11 4
9 12 4
10 13 4
11 14 4
12 15 4
13 16 4
14 17 4
15 18 4
16 19 4
Related
In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13
I want to change the format of a dataset in a certain way. Say I have a list of data indicating when and how many times participants attended couselling sessions. They could attend a maximum of three sessions any time within a twelve week period. Say their data is recorded like so
set.seed(01234)
df1 <- data.frame(id = rep(LETTERS[1:4], each = 3),
session = rep(paste0("session", 1:3), length.out = 12),
week1 = c(sort(sample(1:12, 3, replace = F)),
sort(sample(1:12, 3, replace = F)),
sort(sample(1:12, 3, replace = F)),
sort(sample(1:12, 3, replace = F))))
df1$week1[c(3,8,9,12)] <- NA # insert some NAs representing sessions that weren't attended
And the dataset looks like this
# id session week1
# 1 A session1 2
# 2 A session2 7
# 3 A session3 NA
# 4 B session1 7
# 5 B session2 8
# 6 B session3 10
# 7 C session1 1
# 8 C session2 NA
# 9 C session3 NA
# 10 D session1 6
# 11 D session2 7
# 12 D session3 NA
But I want a long dataset where each person has a row for each of the twelve weeks they could have attended, like so
df2 <- data.frame(id = rep(LETTERS[1:4], each = 12),
week2 = rep(1:12, times = 4))
So participant A's data looks like this
df2[1:12,]
# id week2
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 A 7
# 8 A 8
# 9 A 9
# 10 A 10
# 11 A 11
# 12 A 12
I would like to merge the two somehow so that the numbers in the week1 column of df1 are matched to their appropriate row in df2, ideally something like this (example is participant A only)
data.frame(id = rep("A", 12),
week = 1:12,
attended = c(0,1,0,0,0,0,1,0,0,0,0,0))
# id week attended
# 1 A 1 0
# 2 A 2 1
# 3 A 3 0
# 4 A 4 0
# 5 A 5 0
# 6 A 6 0
# 7 A 7 1
# 8 A 8 0
# 9 A 9 0
# 10 A 10 0
# 11 A 11 0
# 12 A 12 0
One approach utilizing a merge:
# merge the 2 dataframes
names(df2)[2] <- "week"
names(df1)[3] <- "week"
df <- merge(df2, df1, by=c("id", "week"), all.x=T)
# replace 'session' with 1s and 0s
df$session <- !is.na(df$session)
do.call(rbind, lapply(split(df2, df2$id), function(x){
x$attended = as.integer(x$week2 %in% df1$week1[df1$id == x$id[1]])
x
}))
You could expand the original data.frame using tidyr::complete so you don't need to merge, just define week1 as a factor with the correct number of levels:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
mutate(week1 = factor(week1, levels = 1:12),
session = !is.na(session)) %>%
complete(week1, fill = list(session = 0))
# A tibble: 52 x 3
# Groups: id [4]
id week1 session
<fct> <fct> <dbl>
1 A 1 0
2 A 2 1
3 A 3 0
4 A 4 0
5 A 5 0
6 A 6 0
7 A 7 1
8 A 8 0
9 A 9 0
10 A 10 0
# ... with 42 more rows
I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)
I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.
Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)
Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5
Suppose I have this data frame:
times vals
1 1 2
2 3 4
3 7 6
set up with
foo <- data.frame(times=c(1,3,7), vals=c(2,4,6))
and I want this one:
times vals
1 1 2
2 2 2
3 3 4
4 4 4
5 5 4
6 6 4
7 7 6
That is, I want to fill in all the times from 1 to 7, and fill in the vals from the latest time that is not greater than the given time.
I have some code to do it using dplyr, but it is ugly. Suggestions for better?
library(dplyr)
foo <- merge(foo, data.frame(times=1:max(foo$times)), all.y=TRUE)
foo2 <- merge(foo, foo, by=c(), suffixes=c('', '.1'))
foo2 <- foo2 %>% filter(is.na(vals) & !is.na(vals.1) & times.1 <= times) %>%
group_by(times) %>% arrange(-times.1) %>% mutate(rn = row_number()) %>%
filter(rn == 1) %>%
mutate(vals = vals.1,
rn = NULL,
vals.1 = NULL,
times.1 = NULL)
foo <- merge(foo, foo2, by=c('times'), all.x=TRUE, suffixes=c('', '.2'))
foo <- mutate(foo,
vals = ifelse(is.na(vals), vals.2, vals),
vals.2 = NULL)
This is a standard rolling join problem:
library(data.table)
setDT(foo)[.(1:7), on = 'times', roll = T]
# times vals
#1: 1 2
#2: 2 2
#3: 3 4
#4: 4 4
#5: 5 4
#6: 6 4
#7: 7 6
The above is for devel version (1.9.7+), which is smarter about column matching during joins. For 1.9.6 you still need to specify column name for the inner table:
setDT(foo)[.(times = 1:7), on = 'times', roll = T]
With approx:
data.frame(times = 1:7,
vals = unlist(approx(foo, xout = 1:7, method = "constant", f = 0)[2], use.names = F))
times vals
1 1 2
2 2 2
3 3 4
4 4 4
5 5 4
6 6 4
7 7 6
A dplyr and tidyr option:
library(dplyr)
library(tidyr)
foo %>%
right_join(data_frame(times = min(foo$times):max(foo$times))) %>%
fill(vals)
# Joining by: "times"
# times vals
# 1 1 2
# 2 2 2
# 3 3 4
# 4 4 4
# 5 5 4
# 6 6 4
# 7 7 6
This is a bit longer and more verbose base R solution:
# calculate the number of repetitions needed for vals variable
reps <- c(with(foo, times[2:length(times)]-times[1:length(times)-1]), 1)
# get result
fooDoneIt <- data.frame(times = min(foo$times):max(foo$times),
vals = rep(foo$vals, reps))