In R how to find the first minimum value of a dataframe - r

How do I find the first minimum value in one column of a dataframe and output a new dataframe with just that row?
For example, for a dataframe named "hospital", for each node, I want to find the minimum time at which "H" is >=1.
node
time
H
1
1
0
2
1
0
3
1
0
1
2
0
2
2
0
3
2
2
1
3
0
2
3
1
3
3
2
1
4
1
2
4
4
3
4
0
The result I want to be able to output is:
node
time
H
1
4
1
2
3
1
3
2
2

One way is to filter your dataframe, and then take the first minimum element for each group:
library(dplyr)
df %>%
filter(H > 0) %>%
group_by(node) %>%
slice_min(time, n = 1)
node time H
<int> <int> <int>
1 1 4 1
2 2 3 1
3 3 2 2

Related

How do you duplicate rows n times by group and change one specific column value in R?

I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!
One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0

Building sum of dynamic number of rows in dplyr

My df looks something like the first three columns of the following:
ID VAL LENGTH SUM
1 1 1 1
1 1 1 1
1 1 2 2
1 1 2 2
2 0 1 0
2 3 1 0
2 4 2 3
I want to add a fourth column, which is defined as the sum of the group's first to LENGTH-st values in VAL.
How do I do that?
You could do:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(SUM = sapply(LENGTH, function(x) sum(VAL[1:x])))
Output:
# A tibble: 7 x 4
# Groups: ID [2]
ID VAL LENGTH SUM
<int> <int> <int> <dbl>
1 1 1 1 1
2 1 1 1 1
3 1 1 2 2
4 1 1 2 2
5 2 0 1 0
6 2 3 1 0
7 2 4 2 3

Add a column that divides another column into n chunks, R

There's no easy way to describe my question, that's probably why I was not able to find answer through search.
So I have a data frame with 3 columns, one of the columns is Subject number, the other two columns are Correctness and Block. There are 2 participants, each was exposed to 2 blocks of 3 stimuli in each block.
subj corr block
1 1 1 1
2 1 0 1
3 1 1 1
4 1 1 2
5 1 1 2
6 1 1 2
7 2 0 1
8 2 1 1
9 2 1 1
10 2 0 2
11 2 1 2
12 2 1 2
So what I want to do is to create another column that look at a specific subj number and divide the block columns corresponding to the subj into 3 even chunks (the original df has 2 chunks). In general, I want to know how to divide the stimuli each subj is exposed to in to N chunks and input the chunk number into another column.
subj corr block newblock
1 1 1 1 1
2 1 0 1 1
3 1 1 1 2
4 1 1 2 2
5 1 1 2 3
6 1 1 2 3
7 2 0 1 1
8 2 1 1 1
9 2 1 1 2
10 2 0 2 2
11 2 1 2 3
12 2 1 2 3
Something like this:
library(dplyr)
n_chunks = 3
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = ceiling(n() / n_chunks))[1:n()])
How much of this is necessary depends on your use case. If you can guarantee that n_chunks evenly divides the number of observations for each subject you can simplify to:
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = n() / n_chunks))

Get columns in frame based on values in second frame

I have 2 dataframes. One has a ID column with alot of arranged IDs.
The other one has just specific rows of the first column. Those are my markers.
I need to get the sum of the of the values in a specific column based on the id values of the second column.
The first column may be
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
the second one:
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
what i need to get:
id goals cards group points
1 2 2 1 2-(2+2)
2 3 2 1 0 cause in second list
3 4 2 1 4-(2+1+2)
4 5 1 1 5-(1+2)
5 1 2 1 0 cause in second list
1 2 2 2 2-(2+2)
2 3 2 2 0
3 4 2 2 0
4 5 1 3 5-(1+2)
5 1 2 3 0
Something like: ??
df1<- df1%>%
rowwise() %>%
mutate(points=
goals
-(sum( df1$cards[df1$id <= df2$id & df1$id>df1$id])))
df1 = read.table(text = "
id goals cards
1 2 2
2 3 2
3 4 2
4 5 1
5 1 2
", header=T)
df2 = read.table(text = "
id goals cards
2 3 2
5 1 2
", header=T)
library(dplyr)
# function that gets an id and returns the sum of cards based on df2
GetSumOfCards = function(x) {
ids = min(df2$id[df2$id >= x]) # for a given id of df1 find the minimum id in df2 that is bigger than this id
ifelse(x %in% df2$id, # if the given id exists in df2
0, # sum of cards is zero
sum(df1$cards[df1$id >= x & df1$id <= ids])) # otherwise get sum of cards in df1 from this id until the id obtained before
}
# update function to be vectorised
GetSumOfCards = Vectorize(GetSumOfCards)
df1 %>%
mutate(sum_cards = GetSumOfCards(id), # get sum of cards for each id using the function
points = goals - sum_cards) # get the points
# id goals cards sum_cards points
# 1 1 2 2 4 -2
# 2 2 3 2 0 3
# 3 3 4 2 5 -1
# 4 4 5 1 3 2
# 5 5 1 2 0 1
Based on your updated question, applying a similar function to every row makes the process very slow. So, this solution groups data in a way that you can just count the cards on chunks of data/rows:
df1 = read.table(text = "
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
", header=T)
df2 = read.table(text = "
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
", header=T)
library(dplyr)
df1 %>%
arrange(group, desc(id)) %>% # order by group and id descending (this will help with counting the cards)
left_join(df2 %>% # join specific columns of df2 and add a flag to know that this row exists in df2
select(id, group) %>%
mutate(flag = 1), by=c("id","group")) %>%
mutate(flag = ifelse(is.na(flag), 0, flag), # replace NA with 0
flag2 = cumsum(flag)) %>% # this flag will create the groups we need to count cards
group_by(group, flag2) %>% # for each new group (we need both as the card counting will change when we have a row from df2, or if group changes)
mutate(sum_cards = ifelse(flag == 1, 0, cumsum(cards))) %>% # get cummulative sum of cards unless the flag = 1, where we need 0 cards
ungroup() %>% # forget the grouping
arrange(group, id) %>% # back to original order
mutate(points = goals - sum_cards) %>% # calculate points
select(-flag, -flag2) # remove flags
# # A tibble: 10 x 6
# id goals cards group sum_cards points
# <int> <int> <int> <int> <dbl> <dbl>
# 1 1 2 2 1 4 -2
# 2 2 3 2 1 0 3
# 3 3 4 2 1 5 -1
# 4 4 5 1 1 3 2
# 5 5 1 2 1 0 1
# 6 1 2 2 2 4 -2
# 7 2 3 2 2 0 3
# 8 3 4 2 2 0 4
# 9 4 5 1 3 3 2
# 10 5 1 2 3 0 1

Create New Column With Consecutive Count Of First Series Based on ID Column

I work in the healthcare industry and I'm using machine learning algorithms to develop a model to predict when patients will not show up for their appointments. I'm trying to create a new feature that will be the sum of each patient's most recent consecutive no-shows. I've looked around a lot on stackoverflow and other resources, but cannot find exactly what I'm looking for. As an example, if a patient has no-showed her past two most recent appointments, then every row of the new feature's column with her ID will be filled in with 2's. If she no-showed three times, but showed up for her most recent appointment, then the new column will be filled in with 0's.
I tried using plyr's ddply with cumsum, but it did not give me the results I'm looking for. I used:
ddply(a, .(ID), transform, ConsecutiveNoshows = cumsum(Noshow))
Here is an example data set ('1' signifies a no-show):
ID Noshow
1 1
1 1
1 0
1 0
1 1
2 0
2 1
2 1
3 1
3 0
3 1
3 1
3 1
This is my desired outcome:
ID Noshow ConsecutiveNoshows
1 1 2
1 1 2
1 0 2
1 0 2
1 1 2
2 0 0
2 1 0
2 1 0
3 1 1
3 0 1
3 1 1
3 1 1
3 1 1
I'll be very grateful for any help. Thank you.
The idea is to sum() for each ID the number of Noshow before a 0 appears.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ConsecutiveNoshows = sum(!cumsum(Noshow == 0) >= 1))
Which gives:
#Source: local data frame [13 x 3]
#Groups: ID [3]
#
# ID Noshow ConsecutiveNoshows
# <int> <int> <int>
#1 1 1 2
#2 1 1 2
#3 1 0 2
#4 1 0 2
#5 1 1 2
#6 2 0 0
#7 2 1 0
#8 2 1 0
#9 3 1 1
#10 3 0 1
#11 3 1 1
#12 3 1 1
#13 3 1 1

Resources