dataframe manupulation using ddply - r

I am having a dataframe named output
output dataframe
I want to generate mode(most repeating) of code for each distinct patientID and count of unique patientID with the above code for each distinct zipcode.
I tried this:
ddply(output,~zipcode,summarize,max=mode(code))
this code will generate mode of code for each distinct zipcode...but I want to generate mode of code for distinct patientID within distinct zipcode.
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
my output=
zipcode most_rep_code patient_count
1 718 E78.5 1
2 719 D64.9 1

If I understand you correctly that you need to find the code with the highest frequency by patientID and zipcode, then dplyr might be of use. I think you need to just have the above 3 columns as grouping variables and then summarise to get the count of each group. The highest in each row is the mode. The new column gives the count of the mode.
# Your reprex data
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
library(dplyr)
output %>%
dplyr::group_by(patientID, code, zipcode) %>%
dplyr::summarise(mode_freq = n())
# A tibble: 5 x 4
# Groups: patientID, code [5]
patientID code zipcode freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34423 I65.29 718 1
3 34423 N08 718 1
4 34424 Z68.29 718 1
5 34425 D64.9 719 1
I've included dplyr:: because I'm assuming you have plyr loaded and so function names will conflict.
Update:
To get to your suggested output of the mode, by definition it should be the highest frequency:
output %>%
group_by(patientID, code, zipcode) %>%
summarise(mode_freq = n()) %>%
ungroup() %>%
group_by(zipcode) %>%
filter(mode_freq == max(mode_freq))
# A tibble: 2 x 4
# Groups: zipcode [2]
patientID code zipcode mode_freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34425 D64.9 719 1

Related

Using dplyr, group data by id and date, find first and last location

EDIT: Sorry, it seems my example data was too simple/nice. The full data set is much larger. I cannot recover the order of events by ordering by date or anything else. And the on and off are ids, not event numbers, so do not have an order either. I've updated the example to better reflect this.
Here is some example data:
ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(234,234,93, 675,876,876) # these are ids
on <- c(93,111,234, 876,675,675) # these are ids
df <- data.frame(ids, dates, on, off)
This represents journeys, ie
individual 1 goes from 234 -> 93 -> 234 -> 111
individual 2 goes from 876 -> 675 -> 876 -> 675
The date information is not detailed enough to order the records on their own. I cannot just take first and last.
Grouping the data by id and date, and I want identify where the first off location was, and the last on location was, and aggregate this into one record.
I would expect an outcome in this instance of
ids <- c(1, 2)
date <- c(1,3)
off <- c(234, 111)
on <- c(876, 675)
I have tried many things but none have worked correctly.
It looks like your logic is that for each id, you just want the minimum value for off and the maximum value for on, so this should do it.
library(dplyr, warn.conflicts = FALSE)
ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(111,234,111, 675,876,675)
on <- c(234,111,876, 876,675,876)
df <- data.frame(ids = ids, date = date, on = on, off = off)
df %>%
group_by(ids) %>%
filter(on == max(on), off == min(off)) %>%
distinct()
#> # A tibble: 2 × 4
#> # Groups: ids [2]
#> ids date on off
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 876 111
#> 2 2 3 876 675
Created on 2023-02-02 by the reprex package (v2.0.1)
you may also use group_by and slice_head
data.frame(ids, date, on, off) %>% arrange(ids,date, on, off) %>% group_by(ids) %>%
slice_head(n=1)
Created on 2023-02-02 with reprex v2.0.2
# A tibble: 2 × 4
# Groups: ids [2]
ids date on off
<dbl> <dbl> <dbl> <dbl>
1 1 1 111 234
2 2 3 675 876
The first and last line of each group determines the beginning and the end. Therefore, select them and summarize the data. For example:
library(dplyr)
library(tidyr)
df %>%
group_by(ids, date) %>%
mutate(start = case_when(row_number() == 1 ~ off),
end = case_when(row_number() == n() ~ on)) %>%
select(-on, -off) %>%
filter(!(is.na(start) & is.na(end))) %>%
fill(start, .direction="down") %>%
fill(end, .direction="up") %>%
distinct()
This ends up with:
# A tibble: 2 × 4
# Groups: ids, date [2]
ids date start end
<dbl> <dbl> <dbl> <dbl>
1 1 1 111 876
2 2 3 675 876

Filtering multiple time series values in R

I have a problem with a time series which I don´t know to solve.
I have a tibble with 4 different variables. In my real dataset there are over 10.000 Documents.
document date author label
1 2018-04-05 Mr.X 1
2 2018-02-05 Mr.Y 0
3 2018-04-17 Mr.Z 1
So now my problem is that in the first step I want to count my articles which are occur in a specific month and a specific year for every month in my time series.I know that I can filter for a specific month in a year like this:
tibble%>%
filter(date > "2018-02-01" && date < "2018-02-28")
Result out of this would be a tibble with 1 Observation, but my problem is that I have 360 different time periods in my data. Can I write a function for this to solve this problem or do I need to make 360 own calculations?
The best solution for me would be a table with 360 different columns where in every column the amount of articles which are counted in this month are represented. Is this possible?
Thank you so much in advance.
If you want each result into a separate list, you can do something like this
suppressMessages(library(dplyr))
df %>% mutate(date = as.Date(date)) %>%
group_split(substr(date, 1, 7), .keep = F)
<list_of<
tbl_df<
document: integer
date : date
author : character
label : integer
>
>[2]>
[[1]]
# A tibble: 1 x 4
document date author label
<int> <date> <chr> <int>
1 2 2018-02-05 Mr.Y 0
[[2]]
# A tibble: 2 x 4
document date author label
<int> <date> <chr> <int>
1 1 2018-04-05 Mr.X 1
2 3 2018-04-17 Mr.Z 1
you can further use list2env() to save each item of this list as a separate item.
To count the number of rows for each month-year combination, in tidyverse you can do :
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date),
year_mon = format(date, '%Y-%m')) %>%
select(year_mon) %>%
pivot_wider(names_from = year_mon, values_from = year_mon,
values_fn = length, values_fill = 0)
# `2018-04` `2018-02`
# <int> <int>
#1 2 1
and in base R :
df$date <- as.Date(df$date)
table(format(df$date, '%Y-%m'))

How to get percentages of observations above a certain value for individuals and groups?

I am new to R and looking for some help with my thesis!
The data I have are participant ID, The group they belong to (control or patient) and coordinates in the column “gaze” where values >0 are right and <0 are left.
The goal is to calculate the percentage of coordinates at the right and left side of space for each participant and the two groups.
Sample data:
df <- data.frame(personID=rep(1,6),gaze=c(-0.104,-0.105,0.00550,0.00407,0.00119,0.0411),group=rep('control',6))
df
# personID gaze group
#1 1 -0.10400 control
#2 1 -0.10500 control
#3 1 0.00550 control
#4 1 0.00407 control
#5 1 0.00119 control
#6 1 0.04110 control
You can use the dplyr package to get your answer
library(dplyr)
# create a new boolean column with TRUE where gaze >=0
df <- df %>% mutate(positive_gaze=(gaze>=0))
# group by personID and calculate mean of the new column
df %>% group_by(personID) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 2
# personID pct_positive
# <dbl> <dbl>
#1 1 66.7
# similarly you could group by group
df %>% group_by(group) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 2
# group pct_positive
# <fct> <dbl>
#1 control 66.7
# or group by both group and personID
df %>% group_by(group,personID) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 3
# Groups: group [1]
# group personID pct_positive
# <fct> <dbl> <dbl>
#1 control 1 66.7

Ranking Values of data frame excluding same dates

I have a data frame with Dates and Values:
library(dplyr)
library(lubridate)
df<-tibble(DateTime=ymd(c("2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-03","2018-01-03")),
Value=c(5,10,12,3,9,11),Rank=rep(0,6))
I would like to Rank the values of the two last rows, each compared with the rest four Value rows (the ones of previous dates).
I have managed to do this:
dfReference<-df%>%filter(DateTime!=max(DateTime))
dfTarget<-df%>%filter(DateTime==max(DateTime))
for (i in 1:nrow(dfTarget)){
tempDf<-rbind(dfReference,dfTarget[i,])%>%
mutate(Rank=rank(Value,ties.method = "first"))
dfTarget$Rank[i]=filter(tempDf,DateTime==max(df$DateTime))$Rank
}
Desired output:
> dfTarget
# A tibble: 2 x 3
DateTime Value Rank
<date> <dbl> <dbl>
1 2018-01-03 9 3
2 2018-01-03 11 4
But I am looking for a more delicate way.
Thanks
This is basically the same idea as your for loop, but instead of a loop it uses map_int, and instead of creating a new data frame using rbind it creates a new vector with c().
library(tidyverse)
is.max <- with(df, DateTime == max(DateTime))
df[is.max,] %>%
mutate(Rank = map_int(Value, ~
c(df$Value[!is.max], .x) %>%
rank(ties.method = 'first') %>%
tail(1)))
# # A tibble: 2 x 3
# DateTime Value Rank
# <date> <dbl> <int>
# 1 2018-01-03 9 3
# 2 2018-01-03 11 4

Duplicate identifier error in tidyr

I am using tidyr from R and am running into an issue when using the spread() command with duplicate identifiers.
Here is a mock example that illustrates the problem:
X = data.frame(name=c("Eric","Bob","Mark","Bob","Bob","Mark","Eric","Bob","Mark"),
metric=c("height","height","height","weight","weight","weight","grade","grade","grade"),
values=c(6,5,4,120,118,180,"A","B","C"),
stringsAsFactors=FALSE)
tidyr::spread(X,metric,values)
So when I run this command I get the following error:
Error: Duplicate identifiers for rows (4, 5)
which makes sense why its an error because Bob is recorded twice for weight. It's actually nota mistake because Bob did have his weight recorded twice. What I would like to be able to do is have run the command and have it it give me back the following:
name height weight grade
Eric 6 NA A
Bob 5 120 B
Bob 5 118 B
Mark 4 180 C
Is spread not the command I should be using to accomplish this? And if there isn't an easy solution is there a simple way to remove the record with lowest weight for duplicates when running the spread() command?
After making unique identifiers, which can be done by making a new variable representing the index within each group, you can use fill to fill the second "Bob" row with a duplicate value for "height" and "grade".
You can remove the index variable at the end via select.
library(dplyr)
library(tidyr)
X %>%
group_by(name, metric) %>%
mutate(row = row_number() ) %>%
spread(metric, values) %>%
fill(grade, height) %>%
select(-row)
# A tibble: 4 x 4
# Groups: name [3]
name grade height weight
<chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Bob B 5 118
3 Eric A 6 <NA>
4 Mark C 4 180
To filter to the maximum value of each name/metric group:
X %>%
group_by(name, metric) %>%
filter(values == max(values)) %>%
spread(metric, values)
# A tibble: 3 x 4
# Groups: name [3]
name grade height weight
* <chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Eric A 6 <NA>
3 Mark C 4 180

Resources