Removing duplicates in R - r

I have a large dataset (>37 m individuals) and I am using R. I am very much a beginner. Currently, I'm trying (and trying, and trying) to calculate the average household size per Province in the Country that I am analyzing. I have managed to create a separate data frame, with the required variables to give an individual number to each person and thus a household number under the variable called HH (for HouseHolds). Now I want R to remove the duplicates from this specific column in the new data frame that I created, i.e. the HH column.
I have tried numerous times using the duplicate() and unique() functions but it does not work. I've also tried to isolate the this "HH" column in a separate sheet but these functions does still not remove the duplicates. I've also tried converting it into a vector and then doing the duplicate() and unique() functions (as you can see beneath).
When I use a smaller sample in excel it works perfectly well (asking excel to remove the duplicates).
This is how I created my dataset based on my initial dataset (i.e. PHCKCON):
HHvars<-c("eano", "county", "tif")
HHKE<-PHCKCON[HHvars]
as.numeric(HHKE$county)
HHKE$county<-as.numeric(HHKE$county)
Then I created an 4th column for my Households:
HHKE$HH<-(paste(HHKE$eano, HHKE$county, HHKE$tif))
Here is an example of my dataset:
The values in the first three columns are numeric whilst the last are classified as characters
Here is a small sample of the data (I invented these but same idea):
Enumeration.area County Household.members
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
And here is what I did to create my 4th column called HH:
mydata$HH<-paste(mydata$Enumeration.area, mydata$County, mydata$Household.members)
It then gives a fourth column.
HH
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
2 a 8
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
Then I created a separate dataset for my HH column (in order to duplicate):
attach(mydata)
HHvars<-c("HH")
EX2<-mydata[HHvars]
I then tried to duplicate EX2, HH colum:
EX2[!duplicated(EX2$HH),]
But it is not working. And not when using the
unique()
function either.
I hope that it is clearer! And still grateful for any help.
Cheers,
Madeleine

If what you're asking for is simply the mean and median for each county of each enumeration.area, you can do this rather quickly using dplyr. I made up some data below to somewhat match yours.
library(dplyr)
HH <- data.frame(
Enumeration.area=c(1,1,1,2,2,2,3,3,3),
County=c('a','a','b','a','a','a','b','a','b'),
Household.members=c(4,6,5,8,10,9,3,4,3)
)
HH %>% group_by(Enumeration.area,County) %>% summarise(mean=mean(Household.members),median=median(Household.members))
Which results in:
Enumeration.area County mean median
(dbl) (fctr) (dbl) (dbl)
1 1 a 5 5
2 1 b 5 5
3 2 a 9 9
4 3 a 4 4
5 3 b 3 3
Then each row of the resulting data set is a unique combination of Enumeration.area and County, and for each of those combinations you'll have your mean and median household numbers.
edit:
Since your desired output is regarding creating a concatenated identifier for each observation, this is how you could do that:
df <- HH %>% group_by(Enumeration.area,County) %>%
mutate(id=paste(Enumeration.area,County,Household.members))
This will create a character string that is the combination of Enumeration.area, County, and Household.members. Then using distinct(id) will remove any duplicates, as shown below:
df
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
9 3 b 3 3 b 3
df %>% distinct(id)
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
As you can see, the duplicate row "3 b 3" has now just been reduced to one unique observation.

Related

How do I create an index variable based on three variables in R? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 26 days ago.
I'm trying to create an index variable based on an individual identifier, a test name, and the date the test was taken in R. My data has repeated students taking the same test over and over with different scores. I'd like to be able to easily identify what number try each observation is for that specific test. My data looks something like this and I'd like to create a variable like the ID variable shown. It should start over at 1 and count, in order of date, the number of observations with the same student and test name.
student <- c(1,1,1,1,1,1,2,2,2,3,3,3,3,3)
test <-c("math","math","reading","math","reading","reading","reading","math","reading","math","math","math","reading","reading")
date <- c(1,2,3,3,4,5,2,3,5,1,2,3,4,5)
data <- data.frame(student,test,date)
print(data)
student test date
1 1 math 1
2 1 math 2
3 1 reading 3
4 1 math 3
5 1 reading 4
6 1 reading 5
7 2 reading 2
8 2 math 3
9 2 reading 5
10 3 math 1
11 3 math 2
12 3 math 3
13 3 reading 4
14 3 reading 5
I want to add a variable that indicates the attempt number for a test taken by the same student so it looks something like this:
student test date id
1 1 math 1 1
2 1 math 2 2
3 1 reading 3 1
4 1 math 3 3
5 1 reading 4 2
6 1 reading 5 3
7 2 reading 2 1
8 2 math 3 1
9 2 reading 5 2
10 3 math 1 1
11 3 math 2 2
12 3 math 3 3
13 3 reading 4 1
14 3 reading 5 2
I figured how to create an ID variable based on only one other variable, for example based on the student number, but I don't know how to do it for multiple variables. I also tried cumsum but that keeps counting with each new value, and doesn't start over at 1 when there is a new value.
tests <- transform(tests, ID = as.numeric(factor(EMPLID)))
tests$id <-cumsum(!duplicated(tests[1:3]))
library(dplyr)
data %>%
group_by(student, test) %>%
arrange(date, .by_group = TRUE) %>% ## make sure things are sorted by date
mutate(id = row_number()) %>%
ungroup()
# # A tibble: 14 × 4
# student test date id
# <dbl> <chr> <dbl> <int>
# 1 1 math 1 1
# 2 1 math 2 2
# 3 1 math 3 3
# 4 1 reading 3 1
# 5 1 reading 4 2
# 6 1 reading 5 3
# 7 2 math 3 1
# 8 2 reading 2 1
# 9 2 reading 5 2
# 10 3 math 1 1
# 11 3 math 2 2
# 12 3 math 3 3
# 13 3 reading 4 1
# 14 3 reading 5 2

How to create a column/index based on either of two conditions being met (to enable clustering of matched pairs within same dataframe)?

I have a large dataset of matched pairs (id1 and id2) and would like to create an index variable to enable me to merge these pairs into rows.
As such, the first row would be index 1 and from then on the index will increase by 1, unless either id1 or id2 match any of the values in previous rows. Where this is the case, the previously attributed index should be applied.
I have looked for weeks and most solutions seem to fall short of what I need.
Here's some data to replicate what I have:
id1 <- c(1,2,2,4,6,7,9,11)
id2 <- c(2,3,4,5,7,8,10,2)
df <- cbind(id1,id2)
df <- as.data.frame(df)
df
id1 id2
1 1 2
2 2 3
3 2 4
4 4 5
5 6 7
6 7 8
7 9 10
8 11 2
And here's what hope to achieve:
#wanted result
index <- c(1,1,1,1,2,2,3,1)
df_indexed <- cbind(df,index)
df_indexed
id1 id2 index
1 1 2 1
2 2 3 1
3 2 4 1
4 4 5 1
5 6 7 2
6 7 8 2
7 9 10 3
8 11 2 1
It may be easier to do in igraph
library(igraph)
g <- graph.data.frame(df)
df$index <- clusters(g)$membership[as.character(df$id1)]
df$index
#[1] 1 1 1 1 2 2 3 1

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

R reshaping data - aggregating data on one part of a table to append to another

I have some survey data that I'd like to reshape to be able to interactively slice and dice using filters. However, I'm stuck in how to reshape the data in traditional ways, and I couldn't figure out the appropriate use of the reshape package. Please help!
The data is as follows: each respondent is in a row, along with the responses to each question. In additional columns are multiple demographic columns on the respondent.
ID Q1 Q2 Q3 … Q30 Demo1 Demo2 Demo3 Average Score
1 1 2 2 … 2 1 1 1 2.5
2 2 3 1 … 5 1 2 1 2.7
3 4 1 5 … 4 2 3 2 1.6
4 1 5 4 … 3 2 1 2 2.5
5 3 4 4 … 1 1 2 2 1.4
The goal is to reshape the data to have each unique question/demographic combination be unique, and the average/sample of the scores for that combination as values.
Question Demo1 Demo2 Demo3 Average NumResp
1 1 1 1 3.4 2
1 1 1 2 2.3 5
1 1 1 3 3.1 1
… … … … … ...
30 4 5 3 1.3 9
As a part 2 to the question, there are also calculations that change the responses from the 1-5 scale into "positive", "neutral" or "negative". It would be great to add this as a column that shows % of all respondents in that specific demographic that was either one of the three, with all 3 values adding up to 100%.
Q Sentiment Demo1 Demo2 Demo3 Average
1 Positive 1 1 1 3.4
1 Neutral 1 1 1 2.3
1 Negative 1 1 1 3.1
… … … … …
30 Negative 4 5 3 1.3
Any help is greatly appreciated! Would prefer to do this in R, though Python will work too.
With melt we can specify the id variables (grouping) or the measure variables ( to collapse to "long"). The argument variable.name allows us to name the new variable created by collapsing the wide columns. And value.name allows us to name the value column. This is all available and more with the documentation for ?melt.data.frame.
To create the Sentiment variable we use cut to break the value range of scores into thirds. There is an argument called labels that allows us to choose the names of the new values.
library(reshape2)
m <- melt(df, variable.name="Question", value.name="Average", id=c("Demo1", "Demo2", "Demo3"))
m$Question <- gsub("Q", "", m$Question)
a <- aggregate(Average~., m, mean)
a$Sentiment <- cut(a$Average, seq(1,5,length.out=4), labels=c("Negative", "Neutral", "Postive"), include.lowest=T)
# Demo1 Demo2 Demo3 Question Average Sentiment
# 1 1 1 1 1 1 Negative
# 2 1 2 1 1 2 Negative
# 3 2 1 2 1 1 Negative
# 4 1 2 2 1 3 Neutral
# 5 2 3 2 1 4 Postive
# 6 1 1 1 2 2 Negative
# 7 1 2 1 2 3 Neutral
# 8 2 1 2 2 5 Postive
# 9 1 2 2 2 4 Postive
# 10 2 3 2 2 1 Negative
Note below that I deleted the "ID" and "Average.Score" columns as they will be recalculated in the process.
Data
df <- read.table(text="
ID Q1 Q2 Q3 Q30 Demo1 Demo2 Demo3 Average.Score
1 1 2 2 2 1 1 1 2.5
2 2 3 1 5 1 2 1 2.7
3 4 1 5 4 2 3 2 1.6
4 1 5 4 3 2 1 2 2.5
5 3 4 4 1 1 2 2 1.4", header=T)
df <- df[,!names(df) %in% c("ID", "Average.Score")]
Under the assumption, that you have data set like this (make it data.table):
ID Q1 Q2 ... Demo1 Demo2 Demo3
1: 1 7 8 2 7 3
2: 2 3 7 6 10 1
3: 3 6 1 5 5 8
4: 4 5 9 10 1 7
5: 5 10 4 8 4 6
and dictionary of answers scores:
value Question Score
1: 7 1 17
2: 3 1 6
3: 6 1 19
Lets transform data to have Question, Answer, ID, Demo:
d2 <- melt(dt, id.vars=c('ID', 'Demo1', 'Demo2', 'Demo3'), measure.vars=grep('^Q[0-9]+$', colnames(dt), val=T))
d2[, c('Question', 'variable'):=list(substring(variable,2), NULL)]
R> d2
ID Demo1 Demo2 Demo3 value Question
1: 1 2 7 3 7 1
2: 2 6 10 1 3 1
3: 3 5 5 8 6 1
Now let's add scores:
d3 <- merge(d2, vals_enc, by=c('Question', 'value'))
And finally get average score and respondents for Question and Demographics:
d3[, list(Avg=mean(Score), Number=.N), .(Question,Demo1,Demo2,Demo3)]
Question Demo1 Demo2 Demo3 Avg Number
1: 1 6 10 1 6 1
2: 1 10 1 7 18 1
3: 1 5 5 8 19 1
Note:
for each Id there is the same demographic status, so number of respondents for each combination of Demographic and Question should be the same.
As it comes to part 2 of the Question:
do you have such calculations or are you looking for them?

R For Loop with Certain conditions

I have a dataframe (surveillance) with many variables (villages, houses, weeks). I want to eventually do a time-series analysis.
Currently for each village, there are between 1-183 weeks, each of which has several houses associated. I need the following: each village to have a single data point at each week. Thus, I need to sum up a third variable.
Example:
Village Week House Affect
A 3 7 12
B 6 3 0
C 6 2 2
A 3 9 1
A 5 8 0
A 5 2 8
C 7 19 0
C 7 2 1
I tried this and failed. How do I ask R to only sum observations with the same village and week value?
for (i in seq(along=surveillance)) {
if (surveillance$village== surveillance$village& surveillance$week== surveillance$week)
{surveillance$sumaffect <- sum(surveillance$affected)}
}
Thanks
No need for loop. Use ddply or similar
library(plyr)
Village = c("A","B","C","A","A","A","C","C")
Week = c(3,6,6,3,5,5,7,7)
Affect = c(12,0,2,1,0,8,0,1)
df = data.frame(Village,Week,Affect)
View(df)
result = ddply(df,.(Village,Week),summarise, val = sum(Affect))
View(result)
DF:
Village Week Affect
1 A 3 12
2 B 6 0
3 C 6 2
4 A 3 1
5 A 5 0
6 A 5 8
7 C 7 0
8 C 7 1
Result:
Village Week val
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
The function aggregate will do what you need.
dfs <- ' Village Week House Affect
1 A 3 7 12
2 B 6 3 0
3 C 6 2 2
4 A 3 9 1
5 A 5 8 0
6 A 5 2 8
7 C 7 19 0
8 C 7 2 1
'
df <- read.table(text=dfs)
Then the aggregation
> aggregate(Affect ~ Village + Week , data=df, sum)
Village Week Affect
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
This is an example of a split-apply-combine strategy; if you find yourself doing this often, you should investigate the dplyr (or plyr, its ancestor) or data.table as alternatives to quickly doing this sort of analysis.
EDIT: updated to use sum instead of mean

Resources