How to take the latest entry from a data.frame and store it in new dataframe - r

I have a data.frame that is full of data, and where the data for parameters repeat itself, but I want to use the latest information that is stored.
Thankfully I have an index in the files that tells me which duplicate is he current row in the data.frame.
Example for my problem is the following:
A B C D
1 1 2 3 1
2 1 2 2 2
3 3 4 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
A small explanation ... A and B columns can be considered key, and the C column represents value for that key ... the column D represents the index of the measurement .. but it does not have to start from 1 ... it can start from 3,6, ... any integer. This is happening because the data is incomplete
So at the end the output should be like:
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
Can you please help me program a make an R program, or point me to the right direction, that is going to save all the keys with the their latest index ...
I have tried using for loops but it didn't work ....
Sincerely thanks
If you have any question feel free to ask

Using duplicated and subsetting in base R, you can do
dat[!duplicated(dat[,1:2], fromLast=TRUE),]
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
duplicated returns a logical vector indicating whether a row (here the first two columns) has been duplicated. The fromLast argument initiates this process from the bottom of the data.frame.

You can use dplyr verbs to group your data group_by, then sort arrange. The do verb allows you to operate at the group-level. tail grabs the last row of each group...
library(dplyr)
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
do(tail(.,1)) %>%
ungroup()
Thanks to Frank's suggestion, you could also use slice
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
slice(n()) %>%
ungroup()

Related

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

Apply a vector of filters based on a string (or vector of strings) in dplyr

R and the tidyverse have some extremely powerful but equally mysterious methods for turning strings into actionable expressions. I feel like one needs to be an expert to really understand how to use them.
NOTE: this question differs from this one in that I specifically ask about a vector (that is multiple) filter conditions. I demonstrate a solution for single filters that fails when I try multiple ways of extending it to multiple filters.
I want to do something along the lines of:
df = data.frame(A=1:10, B=1:10)
df %>% filter(A<3, B<5)
But where the filters are contained in either a string such as "A<3, B<5" or a character vector such as c("A<3", "B<5").
I can do
df %>% filter(eval(str2expression("A<3")))
# A B
# 1 1 1
# 2 2 2
But this does not work:
df %>% filter(eval(str2expression("A<3, B<5")))
Error in str2expression("A<3, B<5") : <text>:1:4: unexpected ','
1: A<3,
^
These don't work either:
> df %>% filter(!!c(str2expression("A<3"), str2expression("B<5")))
Error: Argument 2 filter condition does not evaluate to a logical vector
> df %>% filter(!!!c(str2expression("A<3"), str2expression("B<5")))
Error: Can't splice an object of type `expression` because it is not a vector
Run `rlang::last_error()` to see where the error occurred.
Evaluating a vector of expressions from str2expression for some reason only applies the last expression:
> df %>% filter(eval(c(str2expression("A<3"), str2expression("B<5"))))
# A B
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
Using a vector of evaluated expressions fails altogether:
> df %>% filter(!!!c(eval(str2expression("A<3")), eval(str2expression("B<5"))))
Error in eval(str2expression("A<3")) : object 'A' not found
I can do:
> df %>% filter(!!!c(expr(A<3), expr(B<5)))
# A B
# 1 1 1
# 2 2 2
and this tells me that expr(A<3) is NOT the same thing as str2expression("A<3")
But that isn't starting from strings.
What to do?
You could use parse_exprs from rlang
library(dplyr)
expr <- c("A<3", "B<5")
filter(df, !!!rlang::parse_exprs(expr))
# A B
#1 1 1
#2 2 2
Or you could combine the two expressions and then use it in eval
filter(df, eval(parse(text = paste0(expr, collapse = "&"))))
# A B
#1 1 1
#2 2 2
Learning from #Ronak Shah's answer, apparently, in dplyr I can use multiple conditions with a single & in filter instead of a comma. I don't understand this at all---it is not the same thing as an and logical:
> df %>% filter(A<3 & B<5)
A B
1 1 1
2 2 2
> df %>% filter(A<3 && B<5)
A B
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
Nevertheless, the following does work:
> df %>% filter(eval(str2expression("A<3 & B<5")))
A B
1 1 1
2 2 2
> df %>% filter(eval(str2expression("A<6 & B<5")))
A B
1 1 1
2 2 2
3 3 3
4 4 4

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources