select duplicate rows based on another column value in R - r

How do I select duplicated rows which belong to the same ID based on a specific value in another column i.e. Activity C?
This is the original dataframe
ID Activity Cost
1 A 8
1 B 2
1 C 5
2 A 4
3 A 2
3 C 7
3 C 1
4 A 3
4 B 8
This is the targeted dataframe i.e. only ID 1 and ID 3 are selected as they contain Activity C
ID Activity Cost
1 A 8
1 B 2
1 C 5
3 A 2
3 C 7
3 C 1
On the flip side, how do I get a dataframe where the ID do not have Activity C?
Flipside data frame:
ID Activity Cost
2 A 4
4 A 3
4 B 8
Cheers

Related

Count Number of Consecutive Occurrence of values in Table in sql server

I have a teable like
ID Name
1 A
2 A
3 A
4 B
5 B
6 C
7 C
8 B
9 A
10 C
I want like this frequency count in increment
ID Name Frequency
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 C 2
8 B 3
9 A 4
10 C 3
I need ms sql query to calculate frequency column

Split data.frame function by a field

I have a data frame function whose output is too lengthy which is being used as an output in a r shiny app. I want to spilt this by field fac. How could I do it. So I want tables which has fac= A and so on for the unique fields in fac. Thank you.
prod()
x y fac
1 1 1 C
2 1 2 B
3 1 3 B
4 1 4 B
5 1 5 A
6 1 6 B
7 1 7 B
8 1 8 C
9 1 9 C
10 1 10 C

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

Removing duplicates in R

I have a large dataset (>37 m individuals) and I am using R. I am very much a beginner. Currently, I'm trying (and trying, and trying) to calculate the average household size per Province in the Country that I am analyzing. I have managed to create a separate data frame, with the required variables to give an individual number to each person and thus a household number under the variable called HH (for HouseHolds). Now I want R to remove the duplicates from this specific column in the new data frame that I created, i.e. the HH column.
I have tried numerous times using the duplicate() and unique() functions but it does not work. I've also tried to isolate the this "HH" column in a separate sheet but these functions does still not remove the duplicates. I've also tried converting it into a vector and then doing the duplicate() and unique() functions (as you can see beneath).
When I use a smaller sample in excel it works perfectly well (asking excel to remove the duplicates).
This is how I created my dataset based on my initial dataset (i.e. PHCKCON):
HHvars<-c("eano", "county", "tif")
HHKE<-PHCKCON[HHvars]
as.numeric(HHKE$county)
HHKE$county<-as.numeric(HHKE$county)
Then I created an 4th column for my Households:
HHKE$HH<-(paste(HHKE$eano, HHKE$county, HHKE$tif))
Here is an example of my dataset:
The values in the first three columns are numeric whilst the last are classified as characters
Here is a small sample of the data (I invented these but same idea):
Enumeration.area County Household.members
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
And here is what I did to create my 4th column called HH:
mydata$HH<-paste(mydata$Enumeration.area, mydata$County, mydata$Household.members)
It then gives a fourth column.
HH
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
2 a 8
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
Then I created a separate dataset for my HH column (in order to duplicate):
attach(mydata)
HHvars<-c("HH")
EX2<-mydata[HHvars]
I then tried to duplicate EX2, HH colum:
EX2[!duplicated(EX2$HH),]
But it is not working. And not when using the
unique()
function either.
I hope that it is clearer! And still grateful for any help.
Cheers,
Madeleine
If what you're asking for is simply the mean and median for each county of each enumeration.area, you can do this rather quickly using dplyr. I made up some data below to somewhat match yours.
library(dplyr)
HH <- data.frame(
Enumeration.area=c(1,1,1,2,2,2,3,3,3),
County=c('a','a','b','a','a','a','b','a','b'),
Household.members=c(4,6,5,8,10,9,3,4,3)
)
HH %>% group_by(Enumeration.area,County) %>% summarise(mean=mean(Household.members),median=median(Household.members))
Which results in:
Enumeration.area County mean median
(dbl) (fctr) (dbl) (dbl)
1 1 a 5 5
2 1 b 5 5
3 2 a 9 9
4 3 a 4 4
5 3 b 3 3
Then each row of the resulting data set is a unique combination of Enumeration.area and County, and for each of those combinations you'll have your mean and median household numbers.
edit:
Since your desired output is regarding creating a concatenated identifier for each observation, this is how you could do that:
df <- HH %>% group_by(Enumeration.area,County) %>%
mutate(id=paste(Enumeration.area,County,Household.members))
This will create a character string that is the combination of Enumeration.area, County, and Household.members. Then using distinct(id) will remove any duplicates, as shown below:
df
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
9 3 b 3 3 b 3
df %>% distinct(id)
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
As you can see, the duplicate row "3 b 3" has now just been reduced to one unique observation.

Drop out observations by conditioning on other data in R

I have two data sets as below. I want to remove 1st data observations which are matched ID, V1 and user with 2nd data sets.
How should I do that?
ID V1 user V2 V3 V4 ...
1 1 A 10
1 2 B 15
1 3 C 13
2 1 A 11
2 1 B 13
3 1 C 15
3 2 B 20
4 1 D 11
4 2 A 15
4 3 B 11
4 3 C 12
ID V1 user
1 3 C
2 1 B
3 2 B
4 3 C
This should work:
merged_df <- merge(data dataframe_1,data dataframe2, by=c("ID","V1","USER"), all.x=TRUE)
This should exclude observations that matched.

Resources