I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.
I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).
ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
dummy_data <- data.frame(ID, obs, total_obs)
I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):
it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.
Here is the expected output:
ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6, 3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")
wanted_data <- data.frame(ID, segment, total_segments, interval)
I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?
You just need dplyr::mutate() and dplyr::case_when().
The following should give you something to work off of.
dummy_data %>%
mutate(interval = case_when(obs < (total_obs/3) ~ "start",
obs < 2*(total_obs/3) ~ "center",
TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false
Which gives slightly different results.
I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.
First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.
For the interval column, here is one approach of creating it:
group the data based on segment column
create a column, say tile, and fill it with ntile(segment) results.
create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
drop the tile column.
wanted_data <- dummy_data %>%
rename(segment = obs, total_segments = total_obs) %>%
group_by(total_segments) %>%
mutate(tile = ntile(segment, 3)) %>%
mutate(interval = case_when(tile == 1~"start",
tile == 2~"center",
tile == 3~"end")) %>%
select(-tile)
wanted_data
# A tibble: 34 × 4
# Groups: total_segments [5]
ID segment total_segments interval
<chr> <dbl> <dbl> <chr>
1 item_001 1 6 start
2 item_001 2 6 start
3 item_001 3 6 center
4 item_001 5 6 center
5 item_001 6 6 end
6 item_452 3 16 start
7 item_452 4 16 start
8 item_452 5 16 start
9 item_452 7 16 center
10 item_452 8 16 center
# … with 24 more rows
It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.
Related
The data I am working with is the number of people in a group. The columns in the dataset I'm concerned with are the date (column 1) and the number of people in a group (column 3 where there is a separate row for each group on a given day). I am looking for an output spreadsheet that gives me a column for a date, one for the sum of all the groups with one person in it on a day, and a column for the sum of all the people who are in groups larger than one on a day.
For example if this was my dataset:
Date People
10/18 1
10/18 3
10/18 1
10/18 8
10/20 1
10/20 4
10/20 2
My desired output would be:
Date p=1 p>1
10/18 2 11
10/20 1 6
My data frame is "DF" and a csv with the different dates is "times". I tried to use a for loop but the output was just zeros.
Here is what I tried:
ntimes = length(times$UniTimes)
for(i in 1:ntimes)
{
s<- sum(DF[which (DF[,3] > 1 & DF[,1]==i),3])
t<- sum(DF[which (DF[,3] < 2 & DF[,1]== i),3])
}
ndf<-data.frame(times,s,t)
write.csv(ndf,'groups_c.csv')
Thank you for your time and help!
You can use aggregate:
aggregate(People ~ Date, x, function(x) c("p=1" = sum(x[x==1]),
"p>1" = sum(x[x>1])))
# Date People.p=1 People.p>1
#1 10/18 2 11
#2 10/20 1 6
This should work, but without data to reproduce it's difficult to say:
library(dplyr)
DF %>%
group_by(Date) %>%
summarise(peq1 = sum(People == 1),
pgeq1 = sum(People[People > 1]))
An option with data.table
library(data.table)
setDT(DF)[, .(peq1 = sum(People == 1), pgeq1 = sum(People[People >1])), .(Date)]
This should be very simple but I can't figure out how to do It properly.
Given the following example dataframe:
telar <- data.frame(name=c("uno","dos","tres","cuatro","cinco"), id=c(1,2,3,1,2), test=c(10,11,12,13,14))
telar
name id test
1 uno 1 10
2 dos 2 11
3 tres 3 12
4 cuatro 1 13
5 cinco 2 14
I am trying to select all the rows that, for example, have a value of test that is bellow the average of al the values in the dataframe telar that have the same id value.
I have already properly grouped the values by id and computed their average like this, but I do not know how to perform the comparison.
> summarise(group_by(telar, id), test=mean(test))
A tibble: 3 x 2
id test
<dbl> <dbl>
1 1 11.5
2 2 12.5
3 3 12
Thank you!
You can simply create your groups and keep the values that are less than the mean, i.e.
library(dplyr)
telar %>%
group_by(name, id) %>%
filter(test < mean(test)) %>%
ungroup()
There is undoubtedly a way to do this without using data.table, but I offer it as a solution
library(data.table)
setDT(telar)
telar[, avg := mean(test), by = id][test < avg]
note I recommend if you're doing further analysis in data.frame after this, I recommend to return to a data.frame using setDF(telar)
Using base R, this can be done with ave
telar[with(telar, test < ave(test, id, name)),]
I try to create a simple function how to sum some variables in a nested data set.
Here is a much simpler example
df <- data.frame(ID=c(1,1,1,1,2,3,3,4,4,4,5,6,7,7,7,7,7,7,7,7),
var=c("A","B","C","D","B","A","D","A","C","D","D","D","A","D","A","A","A","B","B","B"),
N=c(50,50,50,50,298,156,156,85,85,85,278,301,98,98,98,98,98,98,98,98))
Think of this as a dataframe containing results of 7 different studies. Each study has investigated one or more Variables (A, B, C, D). The variables mean
ID = The ID of a respective study.
var = The respective variable measured in each study. Some studies have measured only one variable (e.g., ID=2, which only contained b), some several
N = The sample size of each study. That is, each ID has a sample size
I would like to create a function that summarizes three things:
k = how many studies measured each variable (e.g., "A")
m = how often each variable was measured (regardless whether some studies measured a variable more than once)--a simple frequency.
N = the sample size per variable--but only once per study. That is, no duplications per study ID are allowed.
My current version (I am a real noob, so please forgive the form), results in exactly what I want:
model km N
1 A 4 (7) 389
2 B 3 (5) 446
3 C 2 (2) 135
4 D 6 (6) 968
For instance, variable A was measured 7times, but only by 4 studies (i.e., study #7 measured it several times. The (non-redundant) sample size was N=389 (not counting the several measures of study #7 more than one time).
(Note: The parentheses in the table are helpful as I intend to copy the results into a document)
Here is the current version of the code. The problems begin with the part containing the pipes
kmn <- function(data, x, ID, N) {
m <-table(data[[x]])
k <-apply(table(data[[x]],data[[ID]]), 1, function(x) length(x[x>0]) )
model <- levels(data[[x]])
km <- cbind(k,m)
colnames(km)<-c("k","m")
km <- paste0(k," (",m,")")
smpsize <- data %>%
group_by(data[[x]]) %>%
summarise(N = sum(N[!duplicated(ID)])) %>%
select(N)
cbind(model,km,smpsize)
}
kmn(data=df, x="var", ID = "ID", N="N")
The above code works but only if the df-dataframe really contains the N-variable (but not with a different variable name). I guess the "data %>%" prompts R to look into the dataframe and not to use the "sum(N..." part as reference to the call.
I can guess that this looks horrible for someone with some idea :)
Thank you for any ideas
Holger
First, remove duplicates by using the unique function and sum by var.
Secondly take df and group by var, n() gives the count and n_distinct(ID) the number of unique IDs, then you join the dataframe stats_N
library(dplyr)
stats_N <- df %>%
select(ID,var,N) %>%
unique() %>%
group_by(var) %>%
summarise(N=sum(N))
df %>%
group_by(var) %>%
summarise(n=n(),km=n_distinct(ID)) %>%
left_join(stats_N)
# A tibble: 4 x 4
# var n km N
# <fct> <int> <int> <dbl>
#1 A 7 4 389
#2 B 5 3 446
#3 C 2 2 135
#4 D 6 6 968
in addition to the #fmarm's answer, it can be also done without a join, where do the group by 'var', get the number of distinct elements in 'D' (n_distinct), number of rows (n()) and the sum of non-duplicated 'N's
library(dplyr)
df %>%
group_by(model = var) %>%
summarise(km = sprintf("%d (%d)", n_distinct(ID), n()),
N = sum(N[!duplicated(N)]))
# A tibble: 4 x 3
# model km N
# <fct> <chr> <dbl>
#1 A 4 (7) 389
#2 B 3 (5) 446
#3 C 2 (2) 135
#4 D 6 (6) 968
I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.
I am trying to dynamically populate a variable, which requires me to reference rows.
Given are 3 columns: time, group, and val.
I want to populate rows 3, 4, 7, and 8's val which are initially NA.
Here is my toy data:
df <- expand.grid(time = rep(c(1,2,3,4)), group = rep(c("A", "B")))
df$val <- c(50,40,NA,NA)
df
> df
time group val
1 1 A 50
2 2 A 40
3 3 A NA
4 4 A NA
5 1 B 50
6 2 B 40
7 3 B NA
8 4 B NA
I have two grouping variables (time and group) and, as example, I need to populate row 3 above by this set of rules:
1. Order by group and time (in ascending order)
2. For time = 3, the value of **val** is the arithmetic average of two previous rows;
(2a). i.e. the average of time 2 and time 1 values, so it will be 1/2 * (40+50) = 45.
3. For time = 4, the value of **val** is the arithmetic average of two previous rows;
(3a). i.e. the average of time 3 and time 2 values, so it will be 1/2 * (45+40) = 42.5.
And so on, going down to the last row of each group as defined by time and group variables.
I want to avoid using loops and referencing row index to achieve this, and prefer to stay within dplyr, as the rest of my scripts are in the dplyr ecosystem. Is there an efficient way to achieve this?
This isn't the cleanest solution, but it gets the job done:
df2 = df %>%
arrange(group, time) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val)) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val))
Again, it's not pretty, but it seems to work. Hope that helps give you something to start from.