Searched but haven't seen where this has been handled.I have a pairwise computation data frame of absolute differences between sites from a project and the data is like this
x y value
1 2 1 5
2 3 1 4
3 4 1 6
4 5 1 3
5 3 2 5
6 4 2 7
7 5 2 3
8 4 3 2
9 5 3 5
10 5 4 7
where x and y are paired sites and value is the difference. I would like to get the results of mean for each site displayed separately. Eg. site mean of all site 5 pairs (5|3, 5|4, 5|1, 5|2) = 4.5 so that my results will be like below:
site avg
1 4.5
2 5
3 4
4 5.5
5 4.5
Whose got the solution?
Here is another option with tidyverse
library(tidyverse)
df %>%
select(x, y) %>%
unlist %>%
unique %>%
sort %>%
tibble(site = .) %>%
mutate(avg = map_dbl(site, ~
df %>%
filter_at(vars(x, y), any_vars(. == .x)) %>%
summarise(value = mean(value)) %>%
pull(value)))
# A tibble: 5 x 2
# site avg
# <int> <dbl>
#1 1 4.5
#2 2 5
#3 3 4
#4 4 5.5
#5 5 4.5
data
df <- structure(list(x = c(2L, 3L, 4L, 5L, 3L, 4L, 5L, 4L, 5L, 5L),
y = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L), value = c(5L,
4L, 6L, 3L, 5L, 7L, 3L, 2L, 5L, 7L)), .Names = c("x", "y",
"value"), class = "data.frame",
row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))
A solution using dplyr and mapply.
library(dplyr)
data.frame(site = unique(c(df$x, df$y))) %>%
mutate(mean = mapply(function(v)mean(df$value[df$x==v | df$y==v]), .$site)) %>%
arrange(site)
# site mean
# 1 1 4.5
# 2 2 5.0
# 3 3 4.0
# 4 4 5.5
# 5 5 4.5
Data:
df <- read.table(text =
" x y value
1 2 1 5
2 3 1 4
3 4 1 6
4 5 1 3
5 3 2 5
6 4 2 7
7 5 2 3
8 4 3 2
9 5 3 5
10 5 4 7",
header = TRUE, stringsAsFactors = FALSE)
If we name your original data example as df:
df$site_pair <- paste(df$x, df$y, sep = "-")
all_sites <- unique(c(df$x, df$y))
site_get_mean <- function(site_name) {
yes <- grepl(site_name, df$site_pair)
mean(df$value[yes])
}
df.new <- data.frame(site = all_sites,
avg = sapply(all_sites, site_get_mean))
Result: (edited to order by site name)
> df.new[order(df.new$site), ]
site avg
5 1 4.5
1 2 5.0
2 3 4.0
3 4 5.5
4 5 4.5
Related
I would like to cut rows from my data frame by groups (Column "Group") based on the number asigned in the column "Count".
Data looks like this
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 Ane 3 8 5
5 Ane 3 7 8
6 John 2 9 NA
7 John 2 2 NA
8 John 2 4 2
9 John 2 3 2
Expected results
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
6 John 2 9 NA
7 John 2 2 NA
Thanks!
We may use slice on the first value of 'Count' after grouping by 'Group'
library(dplyr)
df1 %>%
group_by(Group) %>%
slice(seq_len(first(Count))) %>%
ungroup
-output
# A tibble: 5 × 4
Group Count Result Result2
<chr> <int> <int> <int>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 John 2 9 NA
5 John 2 2 NA
Or use filter with row_number() to create a logical vector
df1 %>%
group_by(Group) %>%
filter(row_number() <= Count) %>%
ungroup
data
df1 <- structure(list(Group = c("Ane", "Ane", "Ane", "Ane", "Ane", "John",
"John", "John", "John"), Count = c(3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L), Result = c(5L, 6L, 4L, 8L, 7L, 9L, 2L, 4L, 3L), Result2 = c(NA,
5L, 5L, 5L, 8L, NA, NA, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
I am converting old base R code into tidyverse and could use some help. I want to reverse code some vars in df1 conditional on the variable being tagged as positive==1 in a lookup table df2. Here's my base R solution:
library(tidyverse)
set.seed(1)
df1 <- data.frame(item1 = sample(1:4, 10, replace = TRUE),
item2 = sample(1:4, 10, replace = TRUE),
item3 = sample(1:4, 10, replace = TRUE))
df1
# item1 item2 item3
# 1 2 1 4
# 2 2 1 1
# 3 3 3 3
# 4 4 2 1
# 5 1 4 2
# 6 4 2 2
# 7 4 3 1
# 8 3 4 2
# 9 3 2 4
# 10 1 4 2
df2 <- data.frame(name = c("item1", "item2"),
positive = c(1, 0))
# name positive
# 1 item1 1
# 2 item2 0
vars <- c("item1", "item2")
# reverse code if positive==1
# 4=1, 3=2, 2=3, 1=4
for (i in vars) {
if (df2$positive[df2$name==i]==1) {
df1[i] <- 4 - df1[, i] + 1 # should reverse code item1
}
}
df1
# item1 item2 item3
# 1 3 1 4
# 2 3 1 1
# 3 2 3 3
# 4 1 2 1
# 5 4 4 2
# 6 1 2 2
# 7 1 3 1
# 8 2 4 2
# 9 2 2 4
# 10 4 4 2
We can use mutate_at where we specify the vars by subsetting the 'name' column based on the binary values of 'positive' converted to logical and subtract 4 from the column
library(dplyr)
dfn <- df1 %>%
mutate_at(vars(intersect(names(.),
as.character(df2$name)[as.logical(df2$positive)])), ~ 4 - . + 1)
dfn
# item1 item2 item3
#1 3 1 4
#2 3 1 1
#3 2 3 3
#4 1 2 1
#5 4 4 2
#6 1 2 2
#7 1 3 1
#8 2 4 2
#9 2 2 4
#10 4 4 2
Or with base R
vars1 <- with(df2, as.character(name[as.logical(positive)]))
df1[vars1] <- lapply(df1[vars1], function(x) 4 - x + 1)
data
df1 <- structure(list(item1 = c(2L, 2L, 3L, 4L, 1L, 4L, 4L, 3L, 3L,
1L), item2 = c(1L, 1L, 3L, 2L, 4L, 2L, 3L, 4L, 2L, 4L), item3 = c(4L,
1L, 3L, 1L, 2L, 2L, 1L, 2L, 4L, 2L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)
I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
I have actually two distinct issues.
The first issue is that I am having trouble in simply sampling a person-period file.
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.
I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)
I am struggling with the dplyr solution because I am not sure what should be the grouping variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
gives me an incorrect result because it does not keep the full sequence of each id.
Any clue how I could both sample and re-sample person-period file ?
data
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")
I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
This extends straightforwardly to more grouping variables and fancier sampling rules.
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff
I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:
library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)
#[[1]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 5 a 1
#5: 5 c 2
#6: 5 a 3
#[[2]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 4 c 1
#5: 4 a 2
#6: 4 c 3
We can use filter with sample
dt %>%
filter(id %in% sample(unique(id),2, replace = FALSE))
NOTE: The OP specified using dplyr method and this solution does uses the dplyr.
If we need to do replicate one option would be using map from purrr
library(purrr)
dt %>%
distinct(id) %>%
replicate(2, .) %>%
map(~sample(., 2, replace=FALSE)) %>%
map(~filter(dt, id %in% .))
#$id
# id code time
#1 1 a 1
#2 1 a 2
#3 1 a 3
#4 4 c 1
#5 4 a 2
#6 4 c 3
#$id
# id code time
#1 4 c 1
#2 4 a 2
#3 4 c 3
#4 5 a 1
#5 5 c 2
#6 5 a 3
I've tried searching for an answer for this but most data.frame/matrix transpoitions aren't as complicated as I am trying to accomplish. Basically I have a data.frame which looks like
F M A
2008_b 1 5 6
2008_r 3 3 6
2008_a 4 1 5
2009_b 1 1 2
2009_r 5 4 9
2009_a 2 2 4
I'm trying to transpose it and rename the column and row names as such:
F_b M_b A_b F_r M_r A_r F_a M_a A_a
2008 1 5 6 3 3 6 4 1 5
2009 1 1 2 5 4 9 2 2 4
Essentially every three rows are being collapsed in to a single row. I assume this can be done with some clever plyr or reshape2 commands but I'm at a total loss how to accomplish it.
You could try
library(dplyr)
library(tidyr)
lvl <- c(outer(colnames(df), unique(gsub(".*_", "", rownames(df))),
FUN=paste, sep="_"))
res <- cbind(Var1=row.names(df), df) %>%
gather(Var2, value, -Var1) %>%
separate(Var1, c('Var11', 'Var12')) %>%
unite(VarN, Var2, Var12) %>%
mutate(VarN=factor(VarN, levels=lvl)) %>%
spread(VarN, value)
row.names(res) <- res[,1]
res1 <- res[,-1]
res1
# F_b M_b A_b F_r M_r A_r F_a M_a A_a
#2008 1 5 6 3 3 6 4 1 5
#2009 1 1 2 5 4 9 2 2 4
data
df <- structure(list(F = c(1L, 3L, 4L, 1L, 5L, 2L), M = c(5L, 3L, 1L,
1L, 4L, 2L), A = c(6L, 6L, 5L, 2L, 9L, 4L)), .Names = c("F",
"M", "A"), class = "data.frame", row.names = c("2008_b", "2008_r",
"2008_a", "2009_b", "2009_r", "2009_a"))