Removing duplicated observations in r with restriction

Removing duplicated observations in r with restriction - r

I have a dataset with which contains duplicates of the ident variable.
I need to select only 1 observation of each ident and it needs to be the newest value, i.e. the resulting data should contain the observation for the ident where the 'year' is the highest in the initial data set.
I believe a general case would look like this:
1. ident value year
2. A 1 19X1
3. A 2 19X2
4. B 4 19X2
5. B 2 19X1
6. B 1 19X3
7. C 1 19X4
8. C 2 19X1
(I could not order it in a proper table here, so please disregard the numbered list on the left)
Only, I have several hundred thousands obs.
Order of the resulting data set is not important to me.

Using library dplyr you can do something like this:
library(dplyr)
df %>% group_by(ident) %>% arrange(desc(year)) %>% slice(1)
Output will be as follows:
Source: local data frame [3 x 4]
Groups: ident [3]
X1. ident value year
(dbl) (chr) (int) (chr)
1 3 A 2 19X2
2 6 B 1 19X3
3 7 C 1 19X4
This assumes year is in a format where sorting in descending order makes it go from latest to oldest.
NOTE: that x1. column is a result of your input data above. I just read it as is.

Try
df <- do.call(rbind, lapply(split(df, df$ident),
function(x) x[which.max(x$year), ]))

Related

How to fill in missing times in R from next and previous date time?

I want to use a way where I can successfully match values for 'checks' which have no start and end times. At first I thought to use bilinear interpolation for this task, but then i thought that's too much complicated and rather I just need something very similar.
My data looks something like this:
df <- data.frame("ID" = c(A,A,A,A,A,B,B,B,B,B),
"Check"= c(1:5),
"Start_time" = c("start_a1","start_a2","start_a3","start_a4","start_a5","startb1","startb2","startb3",NA,"startb5"),
"end_time" = c("end_a1","end_a2","end_a3","end_a4","end_a5","end_b1","end_b2",NA,NA,"endb5")
)
so what I am ideally looking for any check which has missing start time and end time it should pick data from the next check's start time, not previous.
I am trying the following code block but its giving me an issue:
df$end_time[df$check==3 & is.na(df_main$end_time)]] <- df$start_time[df$check==5]
#this gives a length issue
Any advice would be helpful here, my dataset contains approx 5k rows, and each ID has a number of checks with start time and end time.

The tidyr package has a function fill() which does exactly this.
library(tidyr)
df %>%
group_by(ID) %>%
fill(c(Start_time,end_time),.direction='up')
# A tibble: 10 × 4
# Groups: ID [2]
ID Check Start_time end_time
<chr> <int> <chr> <chr>
1 A 1 start_a1 end_a1
2 A 2 start_a2 end_a2
3 A 3 start_a3 end_a3
4 A 4 start_a4 end_a4
5 A 5 start_a5 end_a5
6 B 1 startb1 end_b1
7 B 2 startb2 end_b2
8 B 3 startb3 endb5
9 B 4 startb5 endb5
10 B 5 startb5 endb5
The .direction="up" parameter means it takes the next non-missing value to fill in blanks. To use the previous value you would use .direction="down". And using .direction="updown" would use the next value unless there are no more non-missing values in that group, then it would take the previous non-missing value. (Useful in cases where the missing value is the last row of the group.)

R:How to intersect list of dataframes and specifc column

I am trying to find all matching values in a specific column, in a list of data.frames. However, I keep getting a returned value of character(0).
I have tried the following:
Simple subset (very time consuming) -> e.g. dat[[i]][[i]]
lapply w/ Reduce and intersect (as seen here
LocA<-data.frame(obs.date=c("2018-01-10","2018-01-14","2018-01-20),
obs.count=c(2,0,1))
LocB<-data.frame(obs.date=c("2018-01-09","2018-01-14","2018-01-20),
obs.count=c(0,3,5))
LocC<-data.frame(obs.date=c("2018-01-12","2018-01-14","2018-01-19"),
obs.count=c(2,0,1))
LocD<-data.frame(obs.date=c("2018-01-11","2018-01-16","2018-01-21"),
obs.count=c(2,0,1))
dfList<-list(LocA,LocB,LocC,LocD)
##List of all dates
lapply(dfList,'[[',1)
[1]"2018-01-10" "2018-01-14" "2018-01-20" "2018-01-09"...
Attempts (failure)
>Reduce(intersect,lapply(dfList,'[[',1))
character (0)
I expect the output of this function to be an output identifying the data.frames that share a common date.
*Extra smiles if someone know how to identify shared dates and mutate in to a single data frame where..Col1 = dataframe name, Col2=obs.date,Col3 = obs.count

You can first merge all the data frames so you only have one:
a <- Reduce(function(x, y) merge(x, y, all = TRUE), dfList)
Or you can merge them like this:
a <-rbind(LocA,LocB,LocC,LocD)
Afterwards, you can extract all the duplicates:
b <- a[duplicated(a$obs.date), ]
Or if you want to keep all the unique ones and keep the duplicates:
c <- a[!duplicated(a$obs.date), ]

If by "intersect" you mean doing an "inner join" or "merging" with a specific column as key, then -- you want to use dplyr::inner_join or merge.
First, between two data.frames:
library(dplyr)
inner_join(LocA, LocB, by='obs.date')
# 2 rows
inner_join(LocC, LocD, by='obs.date')
# zero rows
So, not infinite merging.
Stack, then count
We'll combine the data first, then count the occurences. Notice the use of the .id-argument to track where the row originated.
library(dplyr)
bind_rows(dfList, .id = 'id') %>%
add_count(obs.date) %>%
filter(n > 1)
# A tibble: 5 x 4
id obs.date obs.count n
<chr> <chr> <dbl> <int>
1 1 2018-01-14 0 3
2 1 2018-01-20 1 2
3 2 2018-01-14 3 3
4 2 2018-01-20 5 2
5 3 2018-01-14 0 3

R: Create ID based on two different variables

I am a beginner trying to work with R, but constantly hitting walls.
I have a giant dataset (thousands of entries) that looks like this: there is a column for Latitude, Longitude and PlotCode.
I have more than one plot per Longitude and Latitude. I would like to create a new column with some sort of ID to all plots with the same latitude and Longitude.
Something that will look like this in the end:
Any suggestions? Thank you.

Welcome to SO! It's better to add data, desired outputs, attempts and so on in your question. However maybe you can find a solution with the package dplyr.
After installing it, you could do this:
library(dplyr)
# some data like yours
data_latlon <- data.frame(Lat = c(1,1,1,2,2,2,3,3,3)
, Long = c(45,45,45,12,12,12,23,23,23)
, PlotCode = c('a','a','a','b','b','b','c','c','c'))
data_latlon %>% # the pipe operator to have dplyr chains
group_by(Lat,Long) %>% # group by unique Lat and Long
summarise(PlotCodeGrouped = paste(PlotCode,collapse='')) # add a new column that collapse all the plot,
# you can specify how to separate
# with the collapse option, in
# this case nothing
# A tibble: 3 x 3
# Groups: Lat [?]
Lat Long PlotCodeGrouped
<dbl> <dbl> <chr>
1 1 45 aaa
2 2 12 bbb
3 3 23 ccc
EDIT
It's easier the code as you'd like the result:
data_latlon %>% # the pipe operator to have dplyr chains
group_by(Lat,Long, add=TRUE) # group by unique Lat and Long
# and add a ""hierarchical father"
# Groups: Lat, Long [3]
Lat Long PlotCode
<dbl> <dbl> <fct>
1 1. 45. a
2 1. 45. a
3 1. 45. a
4 2. 12. b
5 2. 12. b
6 2. 12. b
7 3. 23. c
8 3. 23. c
9 3. 23. c

I think I found the solution, what I needed is something called cluster ID.
dataframe <- transform(dataframe, Cluster_ID = as.numeric(interaction(Lat, Long, drop=TRUE)))

By grouping you mean sort / arrange them by PlotCode?
if so you can use sort function or you can use arrange function through
tidyverse / dplyr package

How to use select() only on columns of a certain type without loosing columns of other types?

There are a some similar questions (like here, or here), but none with quite the answer I am looking for.
The question:
How to use select() only on columns of a certain type?
The select helper functions used in select_if() or select_at() may only reference the column name or index. In this particular case I want to select columns of a certain type (numeric) and then select a subset of them based on their column sum while not losing the columns of other types (character).
What I would like to do:
tibbly = tibble(x = c(1,2,3,4),
y = c("a", "b","c","d"),
z = c(9,8,7,6))
# A tibble: 4 x 3
x y z
<dbl> <chr> <dbl>
1 1 a 9
2 2 b 8
3 3 c 7
4 4 d 6
tibbly %>%
select_at(is.numeric, colSums(.) > 12)
Error: `.vars` must be a character/numeric vector or a `vars()` object, not primitive
This doesn't work because select_at() doesn't recognize is.numeric as a proper function to select columns.
If I do something like:
tibbly %>%
select_if(is.numeric) %>%
select_if(colSums(.) > 12)
I manage to only select the columns with a sum > 12, but I also loose the character cholumns. I would like to avoid having to reattach the lost columns afterwards.
Is there a better way to select columns in a dplyr fashion, based on some properties other than their names / index?
Thank you!

Perhaps an option could be to create your own custom function, and use that as the predicate in the select_if function. Something like this:
check_cond <- function(x) is.character(x) | is.numeric(x) && sum(x) > 12
tibbly %>%
select_if(check_cond)
y z
<chr> <dbl>
1 a 9
2 b 8
3 c 7
4 d 6

R: Subset data frame based on multiple values for multiple variables

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!

If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing duplicated observations in r with restriction - r

Try df <- do.call(rbind, lapply(split(df, df$ident), function(x) x[which.max(x$year), ]))

Related

How to fill in missing times in R from next and previous date time?

R:How to intersect list of dataframes and specifc column

R: Create ID based on two different variables

How to use select() only on columns of a certain type without loosing columns of other types?

R: Subset data frame based on multiple values for multiple variables

Categories

Resources