Related
Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())
I have the below dataframe from which I intend to create a calculated field at each Code level or row level.
Code count_pol const_q
A028 12 3
B09 7 4
M017 5 2
S83 4 1
S1960 6 4
S179 2 2
S193 3 3
IN the above dataset, I want to create a calculated field y for which the following conditions apply:
If for a code the count_pol lies in 1,2,3 , y = count_pol/const_q else const_q/4
Thus the expected output is:
Code count_pol const_q y
A028 12 3 0.75
B09 7 4 1
M017 5 2 0.5
S83 4 1 0.25
S1960 6 4 1
S179 2 2 1
S193 3 3 1
I have tried the below code:
a_df <- mutate(a_df,
y = if_else(count_pol %in% c(1:3), as.integer(const_q)/count_pol,const_q/4))
but that does not give the desired output.
Can someone please help me rectify this?
We can use if_else to check for values in 1:3
library(dplyr)
df %>% mutate(y = if_else(count_pol %in% 1:3, count_pol/const_q, const_q/4))
# Code count_pol const_q y
#1 A028 12 3 0.75
#2 B09 7 4 1.00
#3 M017 5 2 0.50
#4 S83 4 1 0.25
#5 S1960 6 4 1.00
#6 S179 2 2 1.00
#7 S193 3 3 1.00
and in base R that would be
transform(df, y = ifelse(count_pol %in% 1:3, count_pol/const_q, const_q/4))
data
df <- structure(list(Code = structure(c(1L, 2L, 3L, 7L, 6L, 4L, 5L),
.Label = c("A028", "B09", "M017", "S179", "S193", "S1960", "S83"),
class = "factor"), count_pol = c(12L, 7L, 5L, 4L, 6L, 2L, 3L), const_q = c(3L,
4L, 2L, 1L, 4L, 2L, 3L)), class = "data.frame", row.names = c(NA, -7L))
With case_when() ...
df %>%
group_by(code) %>%
mutate(
y = case_when(
count_pol %in% c(1, 2, 3) ~ count_pol/const_q,
TRUE ~ const_q/4
)
)
I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)
I have this dataset:
ID Set Type Count
1 1 1 A NA
2 2 1 R NA
3 3 1 R NA
4 4 1 U NA
5 5 1 U NA
6 6 1 U NA
7 7 2 A NA
8 8 3 R NA
9 9 3 R NA
As dputs:
mystart <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
By using dplyr package how can I obtain this:
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
Again dputs:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 7L, 8L), .Label = c("A1", "A1R1", "A1R2",
"A1R2U1", "A1R2U2", "A1R2U3", "R1", "R2"), class = "factor")), .Names = c("ID",
"Set", "Type", "Count"), class = "data.frame", row.names = c(NA,
-9L))
In short, I want to count the observations of the column "type" within column "set" and print this count(text) cumulatively.
Examining similar posts, I got closely to this:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(1L,
1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
With the code:
library(dplyr)
myend <- read.table("mydata.txt", header=TRUE, fill=TRUE)
myend %>%
group_by(Set, Type) %>%
mutate(Count = seq(n())) %>%
ungroup(myend)
Thank you very much for your help,
Base R version :
aggregateGroup <- function(x){
vecs <- Reduce(f=function(a,b){ a[b] <- sum(a[b],1L,na.rm=TRUE); a },
init=integer(0),
as.character(x),
accumulate = TRUE)
# vecs is a list with something like this :
# [[1]]
# integer(0)
# [[2]]
# A
# 1
# [[3]]
# A R
# 1 1
# ...
# so we simply turn those vectors into characters using vapply and paste
# (excluding the first)
vapply(vecs,function(y) paste0(names(y),y,collapse=''),FUN.VALUE='')[-1]
}
split(mystart$Count,mystart$Set) <- lapply(split(mystart$Type,mystart$Set), aggregateGroup)
> mystart
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
A dplyr version:
mystart %>%
group_by(Set) %>%
mutate(Count = paste0('A', cumsum(Type == 'A'),
'R', cumsum(Type == 'R'),
'U', cumsum(Type == 'U'))) %>%
ungroup()
Which yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1R0U0
2 2 1 R A1R1U0
3 3 1 R A1R2U0
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1R0U0
8 8 3 R A0R1U0
9 9 3 R A0R2U0
If you want to omit the variables with count zero, you'd need to wrap a function around it like so
mygroup <- function(lst) {
name <- names(lst)
vectors <- lapply(seq_along(lst), function(i) {
x <- lst[[i]]
char <- name[i]
x <- ifelse(x == 0, "", paste0(char, x))
return(x)
})
return(do.call("paste0", vectors))
}
mystart %>%
group_by(Set) %>%
mutate(Count = mygroup(list(A = cumsum(Type == 'A'),
R = cumsum(Type == 'R'),
U = cumsum(Type == 'U')))) %>%
ungroup()
This yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
One line solve with data.table
you gotta first do
require(data.table)
mystart <- as.data.table(mystart)
then just use one line
mystart[, .(Type,
count = paste0(
'A',
cumsum(Type == 'A'),
'R',
countR = cumsum(Type == 'R'),
'U',
countU = cumsum(Type == 'U')
)),
by = c('Set')]
first you want cumsum each type and paste them together by 'set'
cumsum(Type=='A') equals the count, since when Type==A, it's 1, otherwise it's 0.
you wanted to paste them into one column also. So, paste0() is good to use.
you still wanted the Type column, so I included Type in the line.
The output:
Set Type count
1: 1 A A1R0U0
2: 1 R A1R1U0
3: 1 R A1R2U0
4: 1 U A1R2U1
5: 1 U A1R2U2
6: 1 U A1R2U3
7: 2 A A1R0U0
8: 3 R A0R1U0
9: 3 R A0R2U0
Hope this helps.
btw, if you want count 0 ignored, you gotta design some if-esle clause yourself.
basically you want this: if cumsum(something) ==0, NULL, esle paste0('something', cumsum(something)), then you paste0() them together.
It's gonna get nasty, I'm not writing it. you get the idea
Here's a base solution.
We can paste raw letters toseq_along of letter groups to get the last 2 characters, then paste the result to the last element of the previous result, using Reduce.
On top of this we use ave to compute by group.
fun <- function(x,y) paste0(x[length(x)],y,seq_along(y))
mystart$Count <- ave(as.character(mystart$Type),mystart$Set,
FUN = function(x) unlist(Reduce(fun,split(x,x),init=NULL,accumulate = TRUE)))
# ID Set Type Count
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
Details
split(x,x) splits letters as shown here for first Set:
with(subset(mystart,Set==1),split(Type,Type))
# $A
# [1] "A"
#
# $R
# [1] "R" "R"
#
# $U
# [1] "U" "U" "U"
Then fun does this type of operations, helped by Reduce :
fun(NULL,"A") # [1] "A1"
fun("A1",c("R","R")) # [1] "A1R1" "A1R2"
fun(c("A1R1","A1R2"),c("U","U","U")) # [1] "A1R2U1" "A1R2U2" "A1R2U3"
Bonus solution
This other base solution, using rle and avoiding split gives the same output for given example (and whenever Type values are grouped in Sets), but not with mystart2 <- rbind(mystart,mystart) for instance.
fun2 <- function(x){
rle_ <- rle(x)
suffix <- paste0(x,sequence(rle_$length))
prefix <- unlist(mapply(rep,
lag(unlist(
Reduce(paste0,paste0(rle_$values,rle_$lengths),accumulate=TRUE)
),rle_$lengths[1]),
each=rle_$lengths))
prefix[is.na(prefix)] <- ""
paste0(prefix,suffix)
}
mystart$Count2 <-ave(as.character(mystart$Type), mystart$Set,FUN=fun2)
Many elegant solutions have been provided for the problem. Still I was looking for something dplyr way (without-cumsum on fixed types). The function is generic enough to handle additional values of Type.
A solution with help of a custom function as:
library(dplyr)
mystart %>% group_by(Set, Type) %>%
mutate(type_count = row_number()) %>%
mutate(TypeMod = paste0(Type,type_count)) %>%
group_by(Set) %>%
mutate(Count = cumCat(TypeMod, type_count)) %>%
select(-type_count, -TypeMod)
cumCat <- function(x, y){
retVal <- character(length(x))
prevVal = ""
lastGrpVal = ""
for ( i in seq_along(x)){
if(y[i]==1){
lastGrpVal = prevVal
}
retVal[i] = paste0(lastGrpVal,x[i])
prevVal = retVal[i]
}
retVal
}
# # Groups: Set [3]
# ID Set Type Count
# <int> <int> <fctr> <chr>
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
I have actually two distinct issues.
The first issue is that I am having trouble in simply sampling a person-period file.
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.
I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)
I am struggling with the dplyr solution because I am not sure what should be the grouping variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
gives me an incorrect result because it does not keep the full sequence of each id.
Any clue how I could both sample and re-sample person-period file ?
data
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")
I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
This extends straightforwardly to more grouping variables and fancier sampling rules.
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff
I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:
library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)
#[[1]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 5 a 1
#5: 5 c 2
#6: 5 a 3
#[[2]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 4 c 1
#5: 4 a 2
#6: 4 c 3
We can use filter with sample
dt %>%
filter(id %in% sample(unique(id),2, replace = FALSE))
NOTE: The OP specified using dplyr method and this solution does uses the dplyr.
If we need to do replicate one option would be using map from purrr
library(purrr)
dt %>%
distinct(id) %>%
replicate(2, .) %>%
map(~sample(., 2, replace=FALSE)) %>%
map(~filter(dt, id %in% .))
#$id
# id code time
#1 1 a 1
#2 1 a 2
#3 1 a 3
#4 4 c 1
#5 4 a 2
#6 4 c 3
#$id
# id code time
#1 4 c 1
#2 4 a 2
#3 4 c 3
#4 5 a 1
#5 5 c 2
#6 5 a 3