Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13
Related
I have this type of data, with frequency data and position data grouped by rowid:
df
rowid word f position
1 2 i 700 1
2 2 'm 600 2
3 2 fine 1 3
4 3 how 400 1
5 3 's 500 2
6 3 the 700 3
7 3 weather 20 4
8 4 it 390 1
9 4 's 500 2
10 4 really 177 3
11 4 very 200 4
12 4 cold 35 5
13 5 i 700 1
14 5 love 199 2
15 5 you 400 3
The task I'm facing seems simple: in those rowids where there are more than 3 positions, I need to replace the frequencies of all middle positions with their average. The following approach works but seems over-convoluted, so I'm almost certain there will be a more straightforward dplyrway to get the desired output:
df %>%
group_by(rowid) %>%
# filter for 'middle' positions:
filter(position != first(position) & position != last(position)) %>%
# summarise:
summarize(across(position),
# create average frequency:
f_middle_position = mean(f, na.rm = TRUE),
# concatenate words:
word = str_c(word, collapse = " ")
) %>%
filter(!duplicated(f_middle_position)) %>%
# join with df:
left_join(df, ., by = c("rowid", "position")) %>%
# remove rows other than #1,#2, and last:
group_by(rowid) %>%
# create row count:
mutate(rn = row_number()) %>%
# filter first, second, and last row per group:
filter(rn %in% c(1, 2, last(rn))) %>%
# transfer frequencies for middle positions:
mutate(f = ifelse(is.na(f_middle_position), f, f_middle_position)) %>%
# make more changes:
mutate(
# change position labels:
position = ifelse(position == first(position), 1,
ifelse(position == last(position), 2, 1.5)),
# update word:
word = ifelse(is.na(word.y), word.x, word.y)
) %>%
# remove obsolete variables:
select(-c(f_middle_position, word.y, word.x,rn))
A tibble: 12 × 4
# Groups: rowid [4]
rowid f position word
<dbl> <dbl> <dbl> <chr>
1 2 700 1 i
2 2 600 1.5 'm
3 2 1 2 fine
4 3 400 1 how
5 3 600 1.5 's the
6 3 20 2 weather
7 4 390 1 it
8 4 292. 1.5 's really very
9 4 35 2 cold
10 5 700 1 i
11 5 199 1.5 love
12 5 400 2 you
How can this result be obtained in a more concise way in dplyr and, preferably without the left_join, which causes problems with my actual data?
Data:
df <- data.frame(
rowid = c(2,2,2,3,3,3,3,4,4,4,4,4,5,5,5),
word = c("i","'m","fine",
"how","'s","the","weather",
"it","'s","really", "very","cold",
"i","love","you"),
f = c(700,600,1,
400,500,700,20,
390,500,177,200,35,
700,199,400),
position = c(1,2,3,
1,2,3,4,
1,2,3,4,5,
1,2,3)
)
You can create a group variable pos that marks the first row with 1, the middle with 1.5, and the last with 2. Then group the data by rowid and pos and apply mean() and paste() on f and word respectively.
library(dplyr)
df %>%
group_by(rowid) %>%
mutate(pos = case_when(position == 1 ~ 1, position == n() ~ 2, TRUE ~ 1.5)) %>%
group_by(rowid, pos) %>%
summarise(f = mean(f), word = paste(word, collapse = ' '), .groups = 'drop')
# # A tibble: 12 × 4
# rowid pos f word
# <dbl> <dbl> <dbl> <chr>
# 1 2 1 700 i
# 2 2 1.5 600 'm
# 3 2 2 1 fine
# 4 3 1 400 how
# 5 3 1.5 600 's the
# 6 3 2 20 weather
# 7 4 1 390 it
# 8 4 1.5 292. 's really very
# 9 4 2 35 cold
# 10 5 1 700 i
# 11 5 1.5 199 love
# 12 5 2 400 you
I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4
I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)
I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.
Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)
Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5
I am working with gait-cycle data. I have 8 events marked for each id and gait trial. The values "LFCH" and "RFCH" occurs twice in each trial, as these represent the beginning and the end of the gait cycles from left and right leg.
Sample Data Frame:
df <- data.frame(ID = rep(1:5, each = 16),
Gait_nr = rep(1:2, each = 8, times=5),
Frame = rep(c(1,5,7,9,10,15,22,25), times = 10),
Marks = rep(c("LFCH", "LHL", "RFCH", "LTO", "RHL", "LFCH", "RTO", "RFCH"), times =10)
head(df,8)
ID Gait_nr Frame Marks
1 1 1 1 LFCH
2 1 1 5 LHL
3 1 1 7 RFCH
4 1 1 9 LTO
5 1 1 10 RHL
6 1 1 15 LFCH
7 1 1 22 RTO
8 1 1 25 RFCH
I wold like to create something like
Total_gait_left = Frame[The last time Marks == "LFCH"] - Frame[The first time Marks == "LFCH"]
My current code solves the problem, but depends on the position of the Frame values rather than actual values in Marks. Any individual not following the normal gait pattern will have wrong values produced by the code.
library(tidyverse)
l <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("L.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "left")
r <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("R.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "right")
val <- union(l,r, by=c("ID", "Gait_nr", "Side")) %>% arrange(ID, Gait_nr, Side)
Can you help me make my code more stable by helping me change e.g. Frame[4] to something like Frame[Marks=="LFCH" the last time ]?
If both LFCH and RFCH happen exactly twice, you can filter and then use diff in summarize:
df %>%
group_by(ID, Gait_nr) %>%
summarise(
left = diff(Frame[Marks == 'LFCH']),
right = diff(Frame[Marks == 'RFCH'])
)
# A tibble: 10 x 4
# Groups: ID [?]
# ID Gait_nr left right
# <int> <int> <dbl> <dbl>
# 1 1 1 14 18
# 2 1 2 14 18
# 3 2 1 14 18
# 4 2 2 14 18
# 5 3 1 14 18
# 6 3 2 14 18
# 7 4 1 14 18
# 8 4 2 14 18
# 9 5 1 14 18
#10 5 2 14 18
We can use first and last from the dplyr package.
library(dplyr)
df2 <- df %>%
filter(Marks %in% "LFCH") %>%
group_by(ID, Gait_nr) %>%
summarise(Total_gait = last(Frame) - first(Frame)) %>%
ungroup()
df2
# # A tibble: 10 x 3
# ID Gait_nr Total_gait
# <int> <int> <dbl>
# 1 1 1 14
# 2 1 2 14
# 3 2 1 14
# 4 2 2 14
# 5 3 1 14
# 6 3 2 14
# 7 4 1 14
# 8 4 2 14
# 9 5 1 14
# 10 5 2 14