I have data structured like this
set.seed(2)
require(tidyverse)
data <- data.frame("TIME" = c(sample(seq(1:20), 20, replace = F), seq(21:30)),
"ID" = c(rep("A", 10), rep("B", 10), rep("C", 10)),
"LOC" = c(sample(c("X", "Y"), 20, replace = T), c("X", rep("Y", 9))))
and I'm trying to use dplyr to create a variable to indicate whether or not a given ID has a change in time that is permanent (PERMANENT =1) or not (PERMANENT=0). I can't use first() or last() because that ignores the in between values. If for instance they go from X to Y and back to X again like in the case of A and B, the indicator should be 0 for every instance of A and B in the data. However C starts at X and stays at Y in all other instances.
I tried to use indexing in a mutate function, but something isn't working.
data %>%
arrange(ID, TIME) %>%
group_by(ID)%>%
mutate(LOC = as.character(LOC),
PERMANENT = ifelse(last(LOC) != "X" & any(LOC[2:length(ID) -1]) != "X"), 1, 0)
Like I said the output should indicate C moved permanently, while A and B bounced around in the data set.
the expected output is what happens if you run the following code:
data$PERMANENT<-ifelse(data$ID%in%c("A","B"),0,1)
Here's a go at it with dpylr. I am assuming an ID is permanent if it hasn't changed since it's second observed time period.
set.seed(2)
data<-data.frame("TIME" = c(sample(seq(1:20),20,replace = F),seq(21:30)),"ID" =c(rep("A",10),rep("B",10),rep("C",10)),"LOC" = c(sample(c("X","Y"),20,replace = T),c("X",rep("Y",9))) )
data %>% arrange(ID, TIME) %>%
group_by(ID) %>%
mutate(timeObs = row_number(), SecondLoc = LOC[timeObs == 2], Change = LOC != SecondLoc) %>%
filter(timeObs > 1) %>%
summarize(Permanent = sum(Change) == 0 ) %>%
right_join(data, by = 'ID')
Related
I want to use dplyr to calculate a field using it's previous value. A for loop would do the job but I want to calculate by different groups of st. I understand mutate (lag or ave) can't use an unitialized field.
df <- data.frame(st = rep(c('a','b', 'c'), each = 5),
z = rep(c(10,14,12), each = 5),
day = rep(1:5, 3),
GAI = rep(0:4, 3),
surfT = sample(1:15))
df %>%
group_by(st) %>%
mutate(soilT = lag(soilT, order_by = day) + (surfT - lag(soilT,
order_by = day))*0.24*exp(-z*0.017)*exp(-0.15*GAI))
or
df %>%
group_by(st) %>%
mutate(soilT = ave(soilT, c(st), FUN=function(x) c(0, soilT + (surfT - soilT)))
*0.24*exp(-z*0.017)*exp(-0.15*GAI))
how can a simple for loop be caculated in dplyr by group of st:
df$soilT <- 0
for (i in 2:dim(df)[1]){
df$soilT[i]=df$soilT[i-1] + (df$surfT[i] - df$soilT[i-1])
*0.24*exp(-z[i]*0.017)*exp(-0.15*GAI[i])
}
We can use accumulate from purrr to get the output of previous row as an input to current row.
library(dplyr)
result <- df %>%
group_by(st) %>%
mutate(soilT = purrr::accumulate(2:n(),
~.x + (surfT[.y] - .x)*0.24*exp(-z[.y]*0.017)*exp(-0.15*GAI[.y]),
.init = 0))
I have a set of audio clips from different sites and files that I have manually extracted and a set that have been extracted automatically. I want to identify any clips in the automatically extracted set at the same site and file that overlap at all with those in the manually extracted set.
For example:
library(dplyr)
set.seed(123)
group <- as.factor(round(runif(100, 1, 10),0))
file <- sample(LETTERS, 100, TRUE)
start <- round(runif(100, 0, 100), 2)
end <- round(start + (runif(100, 0, 1)), 2)
auto <- bind_cols(group, file, start, end) %>%
set_names(., nm = c("group", "file", "start", "end"))
man <- bind_rows(sample_n(auto, 10) %>%
mutate(start = round(start - runif(10, 0, 1), 2),
end = round(end - runif(10,0, 0.5), 2)),
sample_n(auto, 10) %>%
mutate(start = round(start + runif(10, 0, 1), 2),
end = round(start + runif(10,0, 0.5), 2)))
I have been trying to use dplyr::between like this:
res <- auto %>%
left_join(., man, by = c("group", "file")) %>%
drop_na() %>%
mutate(pos = (between(start.x, start.y, end.y) | between(end.x, start.y, end.y)))
But it is just returning FALSE. The code doesn't throw an error so I am guessing there is in issue in the way I have written it.
If you split out the or in the last mutate to check if either between function is the issue, like this:
res <- auto %>%
left_join(., man, by = c("group", "file")) %>%
drop_na() %>%
mutate(x = between(start.x, start.y, end.y),
y = between(end.x, start.y, end.y),
pos = x | y == "TRUE")
you can see that both return FALSE.
It seems like it should be a relatively simple thing, but I am struggling to figure it out.
Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")
I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)
I am using R and the package mongolite to get data from a MongoDB. This results in data consisting of lots of nested lists that cannot be simplified into a data frame by using unlist, rbindlist, nor bind_rows from dplyr (at least I did not manage to do it).
After a lot of trial and error I found a way to do it using the package reshape2 with the function melt and using dplyr and tidyr to get it into the form I want it to be. However, the melting takes a lot of time (up to 15 mins per list, and I have 6 of them).
Do you have any ideas how I can make this faster? (Of course another possible solution would be to query the MongoDB correctly such that it does not result in lists but in something more like my target data frame).
The following code creates a dummy dataset with similar attributes, the target form of the dataset and my solution to get there.
Dummy Data:
dummy_data <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02')
)
Target Form:
final_data = data.frame(c(30, 10), c(3, 2), c(NA, 5), c('2015-08-04', '2015-08-02'))
names(final_data) = c('actions: link_clicks', 'actions: post_likes', 'actions: page_engagement', 'date')
final_data
Temporary solution
Solution <- reshape2::melt(dummy_data)
Solution <- Solution %>%
select(L1, L2, L3, L4, value) %>%
mutate(L4 = ifelse(is.na(L4), L2, L4)) %>%
spread(key = L4, value = value) %>%
mutate(L2 = ifelse(!is.na(action_type), paste0(L2, ": ", action_type), L2),
value = ifelse(!is.na(value), value, date)) %>%
select(L1, L2, value) %>%
spread(key = L2, value = value) %>%
select(-L1)
If you have any advice on the mongolite query here is the simplest kind of queries I use:
M_DB <- mongolite::mongo(collection = "name", url = "url")
M_DB_List <- M_DB$iterate()$batch(size = 100000)
Thanks a lot
**Edit: **
A more complex data structure as this is closer to my problem
dummy_data_complex <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04',
currency = 'EUR'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02',
demographics = list(gender = "female",
list(actions = list(action_type = "link_clicks", value = 10)))
))
Here is an option with tidyverse
library(tidyverse)
dummy_data %>%
map_df(~ .x %>%
as_tibble(.) %>%
mutate(actions = map(actions, as_tibble)) %>%
unnest) %>%
group_by(date, action_type) %>%
mutate(n = row_number()) %>%
spread(action_type, value) %>%
select(-n)
# A tibble: 2 x 4
# Groups: date [2]
# date link_clicks page_engagement post_likes
#* <chr> <dbl> <dbl> <dbl>
#1 2015-08-02 10.0 5.00 2.00
#2 2015-08-04 30.0 NA 3.00
I was able to find a fast solution for my problem.
It was solved by using another query as SymbolixAU suggested.
Instead of using iterate() I used find() which then resulted in a dataframe with nested lists inside it.
From this point on I was easily able to get to my target using tidyr::unnest()
Thanks for your help.