Deduplicate dataframe based on criteria in R? - r

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David

Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.

If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Related

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

How to convert a n x 3 data frame into a square (ordered) matrix?

I need to reshape a table or (data frame) to be able to use an R package (NetworkRiskMetrics). Suppose I have a data frame of lenders, borrowers and loan values:
lender borrower loan_USD
John Mark 100
Mark Paul 45
Joe Paul 30
Dan Mark 120
How do I convert this data frame into:
John Mark Joe Dan Paul
John
Mark
Joe
Dan
Paul
(placing zeros in empty cells)?
Thanks.
Use reshape function
d <- data.frame(lander=c('a','b','c', 'a'), borower=c('m','p','m','p'), loan=c(10,20,15,12))
d
loan lander borower
10.1 1 a m
20.1 1 b p
15.1 1 c m
12.1 1 a p
reshape(data=d, direction='long', varying=list('lander','borower'), idvar='loan', timevar='loan')
lander borower loan
1 a m 10
2 b p 20
3 c m 15
4 a p 12

Ordering alphabetically after ordering once numerically [duplicate]

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Sort data frame column by factor

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Resources