I have a large data set with X and Y points. I want to animate it, so I want to remove the points that are fixed locations.I would like to remove rows where the same X and Y is repeated more than n times.
So far I did this, is there a more elegant solution? Thanks!
uniques <- unique(data[c("Lat","Long")])
uniques$values = row.names(uniques)
uniques2 <- inner_join(data,uniques,by=c("Lat","Long"))
reps <- data.frame(unclass(rle(uniques2$values)))
delete <- as.character(reps$values[(reps$lengths)>10])
data2 <- uniques2[! uniques2$values %in% delete),]
Tidyverse-way would be
data2 <- data %>%
group_by(Long, Lat) %>%
filter(n() <= 10) %>%
ungroup()
Assuming you want to keep the first of the rows where x and y start repeating, you could try the following approach with the dplyr package:
library(dplyr)
# Example data
df <- data.frame(
x = c(rep(1, 5), 2:6, rep(7, 5)),
y = c(rep(9, 5), 2:6, rep(8, 5))
)
# Cut-off value
n <- 3
# Remove unwanted rows
new_df <- df %>%
mutate(same_as_prev = x == lag(x) & y == lag(y)) %>%
group_by(x, y, same_as_prev) %>%
mutate(consec_count = n()) %>%
filter(consec_count <= n & same_as_prev) %>%
ungroup()
Using data.table I will try this one line solution:
library(data.table)
data < as.data.table(data)[, count:=.N, by=.(Lat,Long)][count<n][,count:=NULL]
Best!
Related
I have a dataframe with two colums (one contains a timestamp and the other a specific value).
The gaps between the timestamps are not equi-spaced but are approximately the same length (1000 ms +/- 15%).
Every now and then a timestamp + value is missing resulting in a timestamp-difference of approx 2*(previous difference).
Is there a way to find those gaps and just add an NA row to later fill it with imputated values?
Thanks!
How about this
library(tidyverse)
df <- tibble(id = as.character(1:20), t = seq(1000, 20000, by = 1000) + runif(20, -15, 15), x = rnorm(20)) %>%
filter(!(id %in% c(3, 7, 9)))
df
df <- df %>% mutate(delta_t = t - lag(t),
mean_t = (t + lag(t))/2) %>%
filter(delta_t > 1015 | delta_t < 985 ) %>%
mutate(t = mean_t, x = NA) %>%
select(id, t, x) %>%
bind_rows(df) %>%
arrange(t)
df
The id column is purely so that you can easily see that I've taken out three values in constructing the dummy data
My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))
I'm Brazilian, sorry about my english!
I would like to know if there is an function implemented in some R package to filter first "n" rows and group the remaining into an "Other" row and summarise the column.
Here is below an example of what I want:
library(tidyverse)
library(plotly)
library(scales)
data("lakers")
x = bind_rows(
lakers %>% count(player) %>% arrange(-n) %>% head(10),
lakers %>% count(player) %>% arrange(-n) %>% slice(11:n()) %>%
summarise(player = "Others", n = sum(n))) %>%
filter(!player == "") %>%
mutate(
player = factor(player, levels = rev(.$player)))
ggplot(x, aes(x=player, y=n))+
geom_col(fill = "DodgerBlue1", col = "DodgerBlue3")+
coord_flip()+
geom_text(aes(y=n, label = comma(n)),hjust = -.2)+
scale_y_continuous(limits = c(0, max( x$n*1.1 )))+
theme_minimal()
I need to create an ggplot like that. So I have a big query using dplyr and I don't want to repeat the query every time.
I would like some function like:
head.other(x, rows = 20, fun = sum, name = "Others")
Here is a function that I think will give you what you need:
library(tibble)
library(dplyr)
df <- data.frame(col1 = rnorm(10), col2 = rnorm(10)) # your data frame
n <- 6 # top n rows to keep
myfun <- function(df, n) {
# seperate keep rows and those to aggregate
preserve.df <- df[1:n, ]
summarise.df <- df[(n+1):nrow(df), ]
# create new df in required format
new.df <- bind_rows(preserve.df, sapply(summarise.df, sum))
# add a column to identify the rows and return
rownames(new.df) <- c(paste0("r", 1:n), "Other")
rownames_to_column(new.df)
}
myfun(df, 6)
I have the following data frame:
library(tidyverse)
set.seed(1234)
df <- data.frame(
x = seq(1, 100, 1),
y = rnorm(100)
)
Where I apply a smooth spline using different knots:
nknots <- seq(4, 15, 1)
output <- map(nknots, ~ smooth.spline(x = df$x, y = df$y, nknots = .x))
What I need to do now is to apply the same function using 2-point and 3-point averages:
df_2 <- df %>%
group_by(., x = round(.$x/2)*2) %>%
summarise_all(funs(mean))
df_3 <- df %>%
group_by(., x = round(.$x/3)*3) %>%
summarise_all(funs(mean))
In summary, I need to apply the function I used in output with the following data frames:
df
df_2
df_3
Of course, this is a minimal example, so I am looking for a efficient way of doing it. Preferably with the purrr package.
Using lapply, and the library zoo to calculate the moving average in a more simple and elegant manner:
library(zoo)
lapply(1:3,function(roll){
dftemp <- as.data.frame(rollmean(df,roll))
map(nknots, ~ smooth.spline(x = dftemp$x, y = dftemp$y, nknots = .x))
})
Here's one possible solution:
library(tidyverse)
set.seed(1234)
df <- data.frame(x = seq(1, 100, 1),
y = rnorm(100))
# funtion to get v-point averages
GetAverages = function(v) {
df %>%
group_by(., x = round(.$x/v)*v) %>%
summarise_all(funs(mean)) }
# specify nunber of knots
nknots <- seq(4, 15, 1)
dt_res = tibble(v=1:3) %>% # specify v-point averages
mutate(d = map(v, GetAverages)) %>% # get data for each v-point
crossing(., data.frame(nknots=nknots)) %>% # combine each dataset with a knot
mutate(res = map2(d, nknots, ~smooth.spline(x = .x$x, y = .x$y, nknots = .y))) # apply smooth spline
You can use dt_res$res[dt_res$v == 1] to see all results for your original daatset, dt_res$res[dt_res$v == 2] to see results for your 2-point estimate, etc.
This is a sample of my dataset.
library(tidyr)
library(dplyr)
resource <- c("good","good","bad","bad","good","good","bad","bad","good","good","bad","bad","good","good","bad","bad")
fertilizer <- c("none", "nitrogen","none","nitrogen","none", "nitrogen","none","nitrogen","none", "nitrogen","none","nitrogen","none", "nitrogen","none","nitrogen")
t1 <- sample(1:20, 16)
t2 <- sample(1:20, 16)
t3 <- sample(1:20, 16)
t4 <- sample(1:20, 16)
t5 <- sample(1:20, 16)
t6 <- sample(10:100, 16)
t7 <- sample(10:100, 16)
t8 <- sample(10:100, 16)
t9 <- sample(10:100, 16)
t10 <- sample(10:100, 16)
replicates <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
data <- data.frame(resource, fertilizer,replicates, t1,t2,t3,t4,t5,t6,t7,t8,t9,t10)
data$resource <- as.factor(data$resource)
data$fertilizer <- as.factor(data$fertilizer)
Where t0,t1,t2..etc are time points. I need to average adjacent time points (non-overlapping) across eg. (t1,t2), (t3,t4)..and the new column headings need to have the average of the times, so that the columns read as t1.5,t3.5,...etc.
Thus in the end I need to have only 5 columns reading t1.5, t3.5,t5.5, t7.5,t9.5
Is there anyway this can be achieved using dplyr function, or any other function in R?
Edited for OP's modified request:
If you put everything in a tidy format, you can take advantage of the lag/lead functions to average adjacent rows.
library(stringr)
library(forcats)
data %>%
gather(key = time, value = value, -replicates, -resource, -fertilizer) %>%
mutate(index = as.integer(str_extract(time, "[0-9]+"))) %>%
arrange(replicates, index) %>%
group_by(resource, fertilizer, replicates) %>%
mutate(mid_value = (value + lead(value))/2,
mid_index = (index + lead(index))/2,
mid_time = str_c("t",mid_index)) %>%
ungroup %>%
filter(!is.na(mid_value), index %% 2 == 1) %>%
select(replicates, resource, fertilizer, matches("mid")) %>%
rename(value = mid_value, time = mid_time, index = mid_index) %>%
arrange(index) %>%
mutate(time = as_factor(time)) %>%
select(-index) %>%
spread(key = time, value = value) %>%
arrange(replicates)
Solution using only base R: You need to somehow find the columns you want to calculate the average for. You can do this by searching the column names for the t + "somenumber" pattern. After that, create a sequence of sequences, corresponding to the column numbers of df you want to calculate the mean for.
relevant_cols <- grep("[0-9]{1,2}", names(df))
start <- min(relevant_cols)
end <- max(relevant_cols)
cols <- split(start:end, rep(1:5, each=2))
If you look at cols, you'll see that it is list of five, each element resembling a combination of columns you want to average. This smells like a use-case for sapply():
newdf <- sapply(cols, function(x) rowMeans(df[x]) )
colnames(newdf) <- paste0("t", seq(1, diff(range(relevant_cols)), 2) + 0.5)
Edit: I seem to have misunderstood what you want to maintain and what not. You can just cbind() (parts of) the old df to newdf:
cbind(df, newdf)
cbind(df[, -relevant_cols], newdf) # This is what you want. I think..
Here ya go:
transmute(data,
t1.5 = (t1 + t2) / 2,
t3.5 = (t3 + t4) / 2,
t5.5 = (t5 + t6) / 2,
t7.5 = (t7 + t8) / 2,
t9.5 = (t9 + t10) / 2)