Equivalent to first./last. SAS processing in R - r

I did find a thread on this (R equivalent of .first or .last sas operator) but it did not fully answer my question.
I come from a SAS background and a common operation is, for example, when you have your patient ID with several different values, and you want to keep only the row with the minimum/maximum value for another variable for each ID. For example, I might have data with dates of a certain medical problem for each ID, and I want a dataset with just the first/last problem date for each patient.
Here's a simple example that gets me what I'm want, but I want to know if there's a better way to do it. I sort by ID, and then count, and I want to just keep the row with the largest count for each ID.
testdata<-data.frame(id=c(1,1,1,2,3,3,4,3,4,4,4),
count=c(5,9,2,6,16,12,0,11,8,8,7))
library(dplyr)
testdata2<-arrange(testdata,id,count)
testdata3<-cbind(testdata2,!duplicated(testdata2$id,fromLast=TRUE))
testdata4<-subset(testdata3,testdata3[,3]=='TRUE')[,-3]
> testdata4
id count
3 1 9
4 2 6
7 3 16
11 4 8
Is there a more compact way to do this?
Thank you.

do.call(rbind.data.frame,
c(by(testdata, testdata$id, function(d) d[c(1L,nrow(d)),]), stringsAsFactors=FALSE))
# id count
# 1.1 1 5
# 1.3 1 2
# 2.4 2 6
# 2.4.1 2 6
# 3.5 3 16
# 3.8 3 11
# 4.7 4 0
# 4.11 4 7
Breaking it down:
d[c(1L,nrow(d)),] returns the first and last row from the dataframe. (I'm assuming the frame has already been ordered appropriately.)
by(testdata, testdata$id, function breaks the larger frame into smaller frames by $id, and passes each smaller frame to the anonymous function. This returns a by-list of each return value.
do.call(rbind.data.frame, grabs the list and row-binds them back together into a single frame. Since the default is to use factors, I added stringsAsFactors=FALSE.
If you want to use dplyr, you can do:
library(dplyr)
group_by(testdata, id) %>%
slice(c(1,n())) %>%
ungroup()
# # A tibble: 8 × 2
# id count
# <dbl> <dbl>
# 1 1 5
# 2 1 2
# 3 2 6
# 4 2 6
# 5 3 16
# 6 3 11
# 7 4 0
# 8 4 7
where n() is a special function within dplyr pipes that returns the number of rows in that (optionally-grouped) frame.

Related

Custom Data Set/Frame From List

Sample
A=data.frame("id"=c(1:10))
B=data.frame("id"=c(7:16))
C=data.frame("id"=c(-10:-1))
mylist=c(A,B,C)
What I want is a list which combindes these three data.frames into a single one:
WANT = data.frame("id"=c(1:10,7:16,-10:-1),
dataID=c(rep("A",10),rep("B",10),rep("C",10)))
If suppose I have list which contains a bunch of data frames (this is how I am given the data). I want to put them into one really big data frame/set like "WANT" that uses the names of the data sets in the list for dataID. I am able to do this with just a few for example A,B,C but I have like a hundred and am wondering how do i pull out the data frames in list and make a tall file like the "WANT" example.
you can add the dataID into the single dataframes and then bind them together:
EDIT: after some clarification, here is a new approach
listNAMES = letters[1:3]
library(tidyverse)
tibble(mydata = list(A, B, C),
dataID = listNAMES) %>%
unnest()
# A tibble: 30 x 2
names id
<chr> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
# ... with 20 more rows

Apply function that return data.frame/tibble on vector/data.frame column and bind results

I have a function that fetches some data from a database. It takes a single parameter and returns a data.frame. I would like to use an input vector of these parameters and pipe them to map or similar function that takes each elment and returns the db results. The results can differ in rows but columns are always the same. How do I go about without looping and row-binding? (for i in ..)
I tried the following route:
myfuncSingleRow<-function(nbr){
data.frame(a=nbr,b=nbr^2,c=nbr^3)}
myfuncMultipleRow<-function(nbr){
data.frame(a=rep(nbr,3),b=rep(nbr^2,3),c=rep(nbr^3,3))}
a<-data.frame(count=c(1,2,3))
myfuncSingleRow(2)
myfuncMultipleRow(2)
a %>% select(count) %>% map_dfr(.f=myfuncSingleRow) #output as expected
a %>% select(count) %>% map_dfr(.f=myfuncMultipleRow) #output not as expected
Now this does not work as intended either. Example myFuncMultipleRow, I was expecting the first 3 rows to be equal, the next 3 equal, and the same for the final 3. Example using myFuncMultipleRow:
Getting
a b c
1 1 1 1
2 2 4 8
3 3 9 27
4 1 1 1
5 2 4 8
6 3 9 27
7 1 1 1
8 2 4 8
9 3 9 27
Wanting:
a b c
1 1 1 1
2 1 1 1
3 1 1 1
4 2 4 8
5 2 4 8
6 2 4 8
7 3 9 27
8 3 9 27
9 3 9 27
As usual, I am probably not using the functions correctly, but a bit stuck here a do not want to resolve to the old loop and rbind which would probably be a performance bottleneck. Any takers?
EDIT: As pointed out "each" argument in "rep" does solve this one, but does not solve the main issue. If map did iterate and call the function for each element, then using parameter "each" and "times" for function "rep" should yield the same result. The function passed to map is not vectorized, but assumes a single parameter of length 1.
The solution need to do:
res<-data.frame()
for(i in a) res<-rbind(res,myfuncMultipleRow(i))
So, after looking at latest purrr 0.3.0 (was on older version) map_depth pointed to the right direction.
a %>% select(count)%>% map_depth(.depth=2,.f=myfuncMultipleRow) %>% map_dfr(.f=bind_rows)
Dropping map_depth() , bind_rows() and nesting instead:
a %>% select(count)%>% map_dfr(~map_dfr(.,myfuncMultipleRow))
a %>% select(count)%>% map_dfr(.f=function(x) map_dfr(x,.f=myfuncMultipleRow))

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Resources