select only rows that have the same id in r - r

ID Julian Month Year Location Distance
2 40749 July 2011 8300 39625
2 41425 May 2013 Hatchery 31325
3 40749 July 2011 6950 38625
3 41057 May 2012 Hatchery 31325
6 40735 July 2011 8300 39650
12 40743 July 2011 11025 42350
Above is the head() for the data frame I'm working with. It contains over 7,000 rows and 3,000 unique ID values. I want to delete all the rows that have only one ID value. Is this possible? Maybe the solution is in keeping only rows where the ID is repeated?

If d is your data frame, I'd use duplicated to find the rows that have recurring IDs. Using both arguments in fromLast gets you the first and last duplicate ID row.
d[(duplicated(d$ID, fromLast = FALSE) | duplicated(d$ID, fromLast = TRUE)),]
This double-duplicated method has a variety of uses:
Finding ALL duplicate rows, including "elements with smaller subscripts"
How to get a subset of a dataframe which only has elements which appear in the set more than once in R
How to identify "similar" rows in R?

Here is how I would do it:
new.dataframe <- c()
ids <- unique(dataframe$ID)
for(id in ids){
temp <- dataframe[dataframe$ID == id, ]
if(nrow(temp) > 1){
new.dataframe <- rbind(new.dataframe, temp)
}}
This will remove all the IDs that only have one row

Related

How to extract data from the dataset with a certain condition and how to combine data from two columns into one in R

This is my dataset example for one person:
dataset example for one person
I have made this table :
deathYear
diagYear
fcediags
pid
2013
NA
I21
1
2011
NA
I63
2
2033
NA
I21
4
2029
NA
I25
5
2020
NA
I21
18
2012
NA
I63
19
I have the problem with the data for the diagYear above. The results are NA.
And also:
The table T2 should only show the rows for persons that have at least one of these Diags: "I20","I21","I22","I25" or "I63" in the document data (no matter document$fces$alive=TRUE or FALSE), but (and only for these persons with this condition) it should also show the year of death (extracted from the date like in the code above - deathYear code) no matter the pearson died from some other diagnoses.
I also need to make one column Year instead of these two (deathYear and diagYear) which would contain the data for the year (extracted from the date - document$FCEs$date (pls see the picture) depending on the next conditions: 1. if document$fces$alive is TRUE, the Year column should have the data for the year only if there's at least one Diag1 in the person's document set that is either "I20", "I21", "I22", "I25" or "I63"
2. if document$fces$alive is FALSE (but only for these persons from the condition 1.), then the column Year should have a deathYear data from the code above no matter the Diag value for the case of death (in this case Diag1, doesn't have to be "I20", "I21", "I22", "I25" or "I63").
I have tried these codes:
getDiags <- function(x) {
document<-fromJSON(x)
fcediags <- document$FCEs$Diag1
fcedage <- document$FCEs$pAge
fcealive <- document$FCEs$alive
deathYear<-2030
if(length(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/"))>0)
deathYear<-as.numeric(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/")[[1]][1])
diagYear<-0
v1 = c("I20","I21","I22","I25","I63")
for (i in 1:length(document$FCEs$Diag1)){
if (document$FCEs$Diag1[i] %in% v1){
diagYear<-as.numeric(strsplit(document$FCEs[document$FCEs$Diag1[i],]$date, "/")[[1]][1])
}
} #this block of code doesn't work, it shows NA in the table
return (data.frame(fcedage,fcediags,fcealive,sex,ldl,pid=document$ID,deathYear,diagYear))
}
for (i in 1:length(fces$fcediags)){
T2 <- subset(fces,fces$fcediags == "I20" | fces$fcediags == "I21" | fces$fcediags == "I22" | fces$fcediags == "I25" | fces$fcediags == "I63", select = c(deathYear,diagYear,fcediags,pid))
}
#I've obviously made this table wrong because it shows rows for only these "I20","I21",...,"I63" Diag1s, but for these persons (with these mentioned Diag1s), it should show the year of death (document$fces$alive=FALSE) no matter the Diag1 value for the case of death.
(pid is pearson's ID), but they are not good enough. Results in the column diagYear shouldn't be NA and the two columns should be merged in one.
Can someone please help me? Thank you in advance!

efficient way to match and sum variables of two data frames based on two criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.
> head(df1)
year importer exporter imports sic87dd
2300 1991 USA CAN 9.404848e+05 2011
2301 1991 USA CAN 2.259720e+04 2015
2302 1991 USA CAN 5.459608e+02 2021
2303 1991 USA CAN 1.173237e+04 2022
2304 1991 USA CAN 2.483033e+04 2023
2305 1991 USA CAN 5.353975e+00 2024
However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)
So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.
sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1
while(i <= nrow(df2)){
while(j <= nrow(df1)){
if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
df2$imports[i] <- df2$imports[i] + df1$imports[j]
}
j <- j + 1
}
i <- i + 1
j <- 1
}
Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.
Any help is appreciated.
There is a simple solution using dplyr:
First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):
df1$sic87dd <- as.factor(df1$sic87dd)
Next, use the group_by command and summarise:
df1 %>%
group_by(sic87dd) %>%
summarise(total_imports = sum(imports))

Aggregate data in dataframe by first transforming values in column

I have a data set with import and export numbers from countries which looks basically like this:
Country_from Country_to Count Value
UK USA 5 10
France Belgium 4 7
USA UK 1 6
Belgium France 8 9
Now, I want to aggregate this data and to combine the import and export numbers by summation. So, I want my resulting dataframe to be:
Country_from Country_to Count Value
UK USA 6 16
France Belgium 12 16
I made a script which concates the to and from countries and then sorts the characters alphabetically to check whether, for example, UK - USA and USA-UK are the same and then aggregates the values.
This sorting part of my code looks like the following:
#concatenate to and from country name
country_from = data.frame(lapply(data_lines$Country_from, as.character), stringsAsFactors=FALSE)
country_to = data.frame(lapply(data_lines$Country_to, as.character), stringsAsFactors=FALSE)
concat_names = as.matrix(paste(country_from, country_to, " "))
#order characters alphabetically
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
sorted = strSort(concat_names)
This approach works in this specific case, but it could theoretically be the case that two different countries have the same alphabetically sorted characters.
If there is a Country_from-Country_to combination without the same reverse, then it should save the values as they are given (so do nothing).
Does anyone have an idea how to do this without using the alphabetically sorted characters?
One way using dplyr would be to create a rowwise grouping variable by sorting and pasting Country_from and Country_to and then take sum by that group.
library(dplyr)
df %>%
rowwise() %>%
mutate(country = paste(sort(c(Country_from, Country_to)), collapse = "-")) %>%
ungroup() %>%
group_by(country) %>%
summarise_at(vars(Count:Value), funs(sum))
# country Count Value
# <chr> <int> <int>
#1 Belgium-France 12 16
#2 UK-USA 6 16
Here, instead of sorting the characters we are sorting the words.

Range overlap/intersect by group and between years

I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.
I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.
Here is an example of the original data set:
IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
71283200722,02522,525
Here is what I would like the final answer to look like:
MarkYear1Year2IDs
1081199220051, 3
1081199220052, 3
1283200520076, 7
In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.
I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.
Using foverlaps() function from data.table package:
require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd) ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE) ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]] ## (3)
olaps[, `:=`(Mark = dt$Mark[xid],
Year1 = dt$Year[xid],
Year2 = dt$Year[yid],
xid = dt$ID[xid],
yid = dt$ID[yid])] ## (4)
olaps = olaps[xid < yid] ## (5)
# xid yid Mark Year1 Year2
# 1: 2 3 1081 1992 2005
# 2: 1 3 1081 1992 2005
# 3: 6 7 1283 2005 2007
We first convert the data.frame to data.table by reference using setDT. Then, we key the data.table on columns Mark, LocStart and LocEnd, which will allow us to perform overlapping range joins.
We calculate self overlaps (dt with itself) with any type of overlap. But we return matching indices here using which = TRUE.
Remove all indices where Year corresponding to xid and yid are identical.
Add all the other columns and replace xid and yid with corresponding ID values, by reference.
Remove all indices where xid >= yid. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps() doesn't have a way to remove this by default yet.

Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Resources