Replacing NA with observed values? [duplicate] - r

This question already has answers here:
Filling missing value in group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 2 years ago.
I have a dataset that contains multiple observations per person. In some cases an individual will have their ethnicity recorded in some rows but missing in others. In R, how can I replace the NA's with the ethnicity stated in the other rows without having to manually change them?
Example:
PersonID Ethnicity
1 A
1 A
1 NA
1 NA
1 A
2 NA
2 B
2 NA
3 NA
3 NA
3 A
3 NA
Need:
PersonID Ethnicity
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
3 A
3 A
3 A
3 A

You could use fill from tidyr
df %>%
group_by(PersonID)%>%
fill(Ethnicity,.direction = "downup")
# A tibble: 12 x 2
# Groups: PersonID [3]
PersonID Ethnicity
<int> <fct>
1 1 A
2 1 A
3 1 A
4 1 A
5 1 A
6 2 B
7 2 B
8 2 B
9 3 A
10 3 A
11 3 A
12 3 A

Related

Remove NA rows based on mulitple columns's name in R [duplicate]

This question already has answers here:
Omit rows containing specific column of NA
(10 answers)
Closed 2 years ago.
Given a small dataset as follows:
A B C
1 2 NA
NA 2 3
1 NA 3
1 2 3
How could I remove rows based on the condition: columns B and C have NAs?
The expected result will like this:
A B C
NA 2 3
1 2 3
Another option in Base R is
df[complete.cases(df[c("B","C")]),]
A B C
2 NA 2 3
4 1 2 3
With base R:
df[!is.na(df$B) & !is.na(df$C),]
Using dplyr:
df %>%
filter(!is.na(B), !is.na(C))
returns
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 NA 2 3
2 1 2 3
or
df %>%
drop_na(B, C)

Generate data frame with parameters [duplicate]

This question already has answers here:
Fill missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 3 years ago.
I have a data frame of ids with number column
df <- read.table(text="
id nr
1 1
2 1
1 2
3 1
1 3
", header=TRUE)
I´d like to create new dataframe from it, where each id will have unique nr from df dataframe. As you may notice, id 3 have only nr 1, but no 2 and 3. So result should be.
result <- read.table(text="
id nr
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
", header=TRUE)
You can use expand.grid as:
library(dplyr)
result <- expand.grid(id = unique(df$id), nr = unique(df$nr)) %>%
arrange(id)
result
id nr
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
We can do:
tidyr::expand(df,id,nr)
# A tibble: 9 x 2
id nr
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3

Rearranging columns with NAs [duplicate]

This question already has answers here:
How to move cells with a value row-wise to the left in a dataframe [duplicate]
(5 answers)
Closed 4 years ago.
Sorry guys,
this is probably a silly question but I do not manage to find a quick solution to solve this issue.
I have a dataframe of this form indicating the number of components of households and gender of each member
Familyid Gender_1 Gender_2 Gender_3 Gender_4 Ncomponent
1 1 NA NA NA 1
2 NA 1 NA NA 1
3 1 2 NA NA 2
4 1 NA 2 NA 2
5 NA 1 2 NA 2
6 2 NA NA 1 2
I would like to collect this info just in two columns in the following way.
Familyid Gender_member1 Gender_member2 Ncomponent
1 1 NA 1
2 1 NA 1
3 1 2 2
4 1 2 2
5 1 2 2
6 2 1 2
In other words I want to create a column indicating gender of member 1, regardless in which column he/she is located in my original dataframe, and a different one indicating gender of the second family member, whenever this latter exists.
Can anyone helping me out with this?
Marco
I just removed NAs for Gender_x columns.
xy <- read.table(text = "Familyid Gender_1 Gender_2 Gender_3 Gender_4 Ncomponent
1 1 NA NA NA 1
2 NA 1 NA NA 1
3 1 2 NA NA 2
4 1 NA 2 NA 2
5 NA 1 2 NA 2
6 2 NA NA 1 2",
header = TRUE)
xy
fetch.gender <- grepl("^Gender_\\d{1}$", names(xy))
out <- apply(xy[, fetch.gender], MARGIN = 1, FUN = na.omit)
out <- do.call(rbind, out)
names(out) <- c("Gender_member1", "Gender_member2")
data.frame(Familyid = xy$Familyid, out, Ncomponent = xy$Ncomponent)
Familyid Gender_1 Gender_2 Ncomponent
1 1 1 1 1
2 2 1 1 1
3 3 1 2 2
4 4 1 2 2
5 5 1 2 2
6 6 2 1 2

Combining rows by index in R [duplicate]

This question already has answers here:
Combining pivoted rows in R by common value
(4 answers)
Closed 4 years ago.
EDIT: I am aware there is a similar question that has been answered, but it does not work for me on the dataset I have provided below. The above dataframe is the result of me using the spread function. I am still not sure how to consolidate it.
EDIT2: I realized that the group_by function, which I had previously used on the data, is what was preventing the spread function from working in the way I wanted it to work originally. After using ungroup, I was able to go straight from the original dataset (not pictured below) to the 2nd dataframe pictured below.
I have a dataframe that looks like the following. I am trying to make it so that there is only 1 row for each id number.
id init_cont family 1 2 3
1 I C 1 NA NA
1 I C NA 4 NA
1 I C NA NA 3
2 I D 2 NA NA
2 I D NA 1 NA
2 I D NA NA 4
3 K C 3 NA NA
3 K C NA 4 NA
3 K C NA NA 1
I would like the resulting dataframe to look like this.
id init_cont family 1 2 3
1 I C 1 4 3
2 I D 2 1 4
3 K C 3 4 1
We cangroup_by the 'd', 'init_cont', 'family' and then do a summarise_all to remove all the NA elements in the columns 1:3
library(dplyr)
df1 %>%
group_by_at(names(.)[1:3]) %>%
summarise_all(na.omit)
#Or
#summarise_all(funs(.[!is.na(.)]))
# A tibble: 3 x 6
# Groups: d, init_cont [?]
# d init_cont family `1` `2` `3`
# <int> <chr> <chr> <int> <int> <int>
#1 1 I C 1 4 3
#2 2 I D 2 1 4
#3 3 K C 3 4 1

Removing duplicates in R

I have a large dataset (>37 m individuals) and I am using R. I am very much a beginner. Currently, I'm trying (and trying, and trying) to calculate the average household size per Province in the Country that I am analyzing. I have managed to create a separate data frame, with the required variables to give an individual number to each person and thus a household number under the variable called HH (for HouseHolds). Now I want R to remove the duplicates from this specific column in the new data frame that I created, i.e. the HH column.
I have tried numerous times using the duplicate() and unique() functions but it does not work. I've also tried to isolate the this "HH" column in a separate sheet but these functions does still not remove the duplicates. I've also tried converting it into a vector and then doing the duplicate() and unique() functions (as you can see beneath).
When I use a smaller sample in excel it works perfectly well (asking excel to remove the duplicates).
This is how I created my dataset based on my initial dataset (i.e. PHCKCON):
HHvars<-c("eano", "county", "tif")
HHKE<-PHCKCON[HHvars]
as.numeric(HHKE$county)
HHKE$county<-as.numeric(HHKE$county)
Then I created an 4th column for my Households:
HHKE$HH<-(paste(HHKE$eano, HHKE$county, HHKE$tif))
Here is an example of my dataset:
The values in the first three columns are numeric whilst the last are classified as characters
Here is a small sample of the data (I invented these but same idea):
Enumeration.area County Household.members
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
And here is what I did to create my 4th column called HH:
mydata$HH<-paste(mydata$Enumeration.area, mydata$County, mydata$Household.members)
It then gives a fourth column.
HH
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
2 a 8
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
Then I created a separate dataset for my HH column (in order to duplicate):
attach(mydata)
HHvars<-c("HH")
EX2<-mydata[HHvars]
I then tried to duplicate EX2, HH colum:
EX2[!duplicated(EX2$HH),]
But it is not working. And not when using the
unique()
function either.
I hope that it is clearer! And still grateful for any help.
Cheers,
Madeleine
If what you're asking for is simply the mean and median for each county of each enumeration.area, you can do this rather quickly using dplyr. I made up some data below to somewhat match yours.
library(dplyr)
HH <- data.frame(
Enumeration.area=c(1,1,1,2,2,2,3,3,3),
County=c('a','a','b','a','a','a','b','a','b'),
Household.members=c(4,6,5,8,10,9,3,4,3)
)
HH %>% group_by(Enumeration.area,County) %>% summarise(mean=mean(Household.members),median=median(Household.members))
Which results in:
Enumeration.area County mean median
(dbl) (fctr) (dbl) (dbl)
1 1 a 5 5
2 1 b 5 5
3 2 a 9 9
4 3 a 4 4
5 3 b 3 3
Then each row of the resulting data set is a unique combination of Enumeration.area and County, and for each of those combinations you'll have your mean and median household numbers.
edit:
Since your desired output is regarding creating a concatenated identifier for each observation, this is how you could do that:
df <- HH %>% group_by(Enumeration.area,County) %>%
mutate(id=paste(Enumeration.area,County,Household.members))
This will create a character string that is the combination of Enumeration.area, County, and Household.members. Then using distinct(id) will remove any duplicates, as shown below:
df
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
9 3 b 3 3 b 3
df %>% distinct(id)
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
As you can see, the duplicate row "3 b 3" has now just been reduced to one unique observation.

Resources