Difference between two higher numbers in a column in R - r

I have a data frame like these:
NUM_TURNO CODIGO_MUNICIPIO SIGLA_PARTIDO SHARE
1 1 81825 PPB 38.713318
2 1 81825 PMDB 61.286682
3 1 09717 PMDB 48.025900
4 1 09717 PL 1.279217
5 1 09717 PFL 50.694883
6 1 61921 PMDB 51.793868
This is a data.frame of elections in Brazil. Grouping by NUM_TURNO and CODGIDO_MUNICIPIO I want to compare the SHARE of the FIRST and SECOND most votted politics in each city and round (1 or 2) and create a new column.
What am I having problem to do? I don't know how to calculate the difference only for the two biggest SHARES of votes.
For the first case, for example, I want to create something that gives me the difference between 61.286682 and 38.713318 = 22.573364 and so on.
Something like this:
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
mutate(Diff = HIGHER SHARE - 2º HIGHER SHARE))

You can also use top_n from dplyr with grouping and summarizing. Keep in mind that in the data you provided, you will get an error in summarize if you use diff with a single value, hence the use of ifelse.
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
top_n(2, SHARE) %>%
summarize(Diff = ifelse(n() == 1, NA, diff(SHARE)))
# A tibble: 3 x 3
# Groups: NUM_TURNO [?]
NUM_TURNO CODIGO_MUNICIPIO Diff
<dbl> <dbl> <dbl>
1 1 9717 2.67
2 1 61921 NA
3 1 81825 22.6

You could arrange your dataframe by Share and then slice the first two values. Then you could use summarise to get the diff between the values for every group:
library(dplyr)
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
arrange(desc(Share)) %>%
slice(1:2) %>%
summarise(Diff = -diff(Share))

Related

how to filter data frame with conditions of many columns

Update!!!
after trying the code in the comment.
the result shows me all brand_id which the review_score is 5 (the highest score).
While there are also more than thousands of brand_id over 10 reviews...
thus I'm still confused about it
the thing is I get a data frame with many columns, and I need to find the brand with different conditions for different columns.
Here is the data frame:
Brand id
Brand name
review score
1
A
1.0
2
B
2.0
2
B
3.0
3
C
1.0
3
C
1.5
3
C
2.0
And I need to What's the Brand id for this item with the highest review score? And also filter for this item with more than 10 reviews?
I tried code like this:
item %>%
group_by(Brandid, review_score) %>%
summarise(idnumber = n()) %>%
filter(idnumber > 10)%>%
arrange(desc(review_scode))
And I tried this, also failed..
item %>%
group_by(Brand_id) %>%
mutate(n = n(), 'max' = max(review_overall, na.rm = TRUE))%>%
filter(n >= 10) %>%
arrange(desc('max'))
Then I got many items with the same review_scode...
But seems there should be only one answer to this question.
So could you please help me!
Thank you!
With this code you get the max review_score for each Brand.
Using add_count you get the count of each Brand.
There is no column review_overall:
Maybe you could clarify:
library(dplyr)
df %>%
group_by(Brand_name) %>%
add_count() %>%
filter(review_score == max(review_score))
In case you want to filter n too, use this code:
df %>%
group_by(Brand_name) %>%
add_count() %>%
filter(review_score == max(review_score & n > 10))
Output:
Brand_id Brand_name review_score n
<int> <chr> <dbl> <int>
1 1 A 1 1
2 2 B 3 2
3 3 C 2 3

Calculate mean and sd of a variable(salary) depending another variable(JobSatisfaction)

I have two columns on the data set and I know I have to use the functions ddply and summarise but I do not know how to start.
Hopefully this will get you started:
data %>%
group_by(Satisfaction) %>%
summarise(Mean = mean(Salary),
SD = sd(Salary))
# A tibble: 7 x 3
Satisfaction Mean SD
<int> <dbl> <dbl>
1 1 12481. 1437.
2 2 31965. 5235.
3 3 45844. 7631.
4 4 69052. 9257.
5 5 79555. 12975.
6 6 100557. 13739.
7 7 111414. 19139.
First, you should use the group_by verb to group the data by the variable you are interested in. Then, as you alluded to, you can use the summarise verb to perform a function on the data for the groups. You can do multiple at once by separating the new columns you want to make with ,.
Recall that the %>% pipe operator directs the output of one function to the next as the first argument.
Example data:
set.seed(3)
data <- data.frame(Salary = sapply(rep(1:7,each = 10), function(x){floor(runif(1,x*10000,x*20000))}),
Satisfaction = rep(1:7,each = 10))

R - Removing the same name in two columns of a data frame

I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin Eärwen
6 Tar-Vanimeldë Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 Finwë Míriel Serindë then ,Indis
9 Tar-Ancalimë Hallacar
7 Tar-Míriel Ar-Pharazôn
5 Tarannon Falastur Berúthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod Eldalótë
4 Ar-Gimilzôr Inzilbêth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-Pharazôn Tar-Míriel
13 Fingolfin Anairë
18 Silmariën Elatan
23 Rowan Greenhand Belba Baggins
20 Rían Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!
You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.
My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)

R Tidy solution to select from group_by output based on a column's data availability

I have following R dplyr dataframe in df_pub (Science/Nature Publication Data)
Please note that there are same PMID (or paper) with contributing authors in each row (Authors info is not shown here).
I need to select and store publications (PMID) which has no email attached to it and store the last observation of it in data-frame.
Actually I want to remove all PMIDs having any email in any observation. I need to collect the Publications (PMIDs) which does not have an attached email, and then find the last author or last observation (usually she/he/xe are the group leader or PI, we'll contact them manually and request them to update their email).
So for the example above, the expected output will not contain PMID 22522932 because it has an email attached. For other PMIDs only the last row of each such PMID will be stored.
I started with this but then lost
df_pub %>%
group_by(pmid) %>%
filter(is.na(email)) # This does not do the expected
If I understand correctly, this will do what you want:
df_pub %>%
group_by(pmid) %>%
filter(!any(!is.na(email)),
row_number() == n())
I think this is what you wanted. It checks which pmids have no email attached and then shows only the last row.
df_pub %>%
group_by(pmid) %>%
filter(sum(is.na(email)) == n()) %>% #chooses pmids that number of NAs equals number os rows
filter(row_number() == n()) #chooses the last row for each pmid
Try this. Might not be the most concise code, but I think it solves your question.
# Sample dataframe
pmid email No
1 1 <NA> 1
2 1 <NA> 2
3 1 <NA> 3
4 2 a#b.com 4
5 2 <NA> 5
# Logic
val <- df$pmid[!is.na(df$email)] %>% unique()
df[!df$pmid %in% val, ] %>%
group_by(pmid) %>%
slice(n()) %>%
ungroup()
# Result
# A tibble: 2 x 3
pmid email No
<dbl> <fct> <int>
1 1 NA 3

Is it possible to create a row containing totals for some columns and averages for others?

I have a dataframe dealing with time series data in which some some columns represent amounts, and some columns represent percentages. I want a row which summarizes each column, but obviously it is not particularly useful for me to sum the columns containing percentages.
Here is an example dataframe:
date<-c("2019-04-27", "2019-04-28", "2019-05-01")
name<-c("sam", "sam", "sam")
amt1<-c(3,6,2)
amt2<-c(4,2,7)
percent1<-c(0.25, 0.7, 0.42)
amt3<-c(13,7,4)
percent2<-c(0.54, 0.48, 0.77)
df<-data.frame(date,name, amt1, amt2, percent1, amt3, percent2)
df$date<-as.Date(df$date)
What I would like is a row that contains:
-the sums for columns amt1, amt2, amt3
-the means for columns percent1, percent2.
Anyone have any ideas of how to accomplish this?
One option would be to select the numeric columns (select_if), then using mutate_if, get the mean of those columns where the values are all less than 1, in the next step, do the sum of columns where there are any value greater than 1. (Disclaimer - The OP said there are no column name patterns or index and this is one of the possible logic given by the OP)
library(tidyverse)
df %>%
select_if(is.numeric) %>%
mutate_if(~ all(.x < 1), mean) %>%
mutate_if(~ any(.x > 1), sum) %>%
slice(1) %>%
bind_rows(df, .) %>%
mutate(name = replace(as.character(name), n(), "Other"))
# date name amt1 amt2 percent1 amt3 percent2
#1 2019-04-27 sam 3 4 0.2500000 13 0.5400000
#2 2019-04-28 sam 6 2 0.7000000 7 0.4800000
#3 2019-05-01 sam 2 7 0.4200000 4 0.7700000
#4 <NA> Other 11 13 0.4566667 24 0.5966667

Resources