Filling out missing information by grouping in R [duplicate] - r

This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 5 months ago.
I have a sample dataset below:
df <- data.frame(id = c(1,1,2,2,3,3),
gender = c("male",NA,"female","female",NA, "female"))
> df
id gender
1 1 male
2 1 <NA>
3 2 female
4 2 female
5 3 <NA>
6 3 female
By grouping the same ids, some rows are missing. What I would like to do is to fill those missing cells based on the existing information.
SO the desired output would be:
> df
id gender
1 1 male
2 1 male
3 2 female
4 2 female
5 3 female
6 3 female
Any thoughts?
Thanks!

You can use dplyr::group_by and tidyr::fill e.g.:
df |>
dplyr::group_by(id) |>
tidyr::fill(gender, .direction = "updown")

Related

How to make the values of one column the main column names using spread()

I need to run a chi-square test so I need the levels of one column (gender) to be the column names for the output of different variables. Here's some data:
test <- data.frame(gender = as.character(sample(c('male','female'),10, replace = T)),
test1 = sample(c(1:10)),
test2 = sample(1:5,10 , replace = T))
> test
gender test1 test2
1 female 2 2
2 male 9 1
3 male 4 4
4 female 8 1
5 female 5 4
6 female 3 3
7 female 7 3
8 female 1 1
9 male 10 2
10 male 6 2
I've used the following line of code with dplyr::spread() but it's giving me an error:
test %>% spread(gender,test1)
Error: Each row of output must be identified by a unique combination of keys.
I've followed all the examples that dplyr provides using gather() and spread() but nothing is working. If you have any tips please let me know. Here's my desired outcome:
> goal
male female
1 10 3
2 1 4
3 5 10
4 3 9
5 6 7
We can create a sequence column grouped by gender to make a unique identifier and then use `spread
library(dplyr)
library(tidyr)
test %>%
select(-test2) %>%
group_by(gender) %>%
mutate(rn = row_number()) %>%
spread(gender, test1)

Restructuring a dataframe [duplicate]

This question already has answers here:
How can I transpose my dataframe so that the rows and columns switch in r?
(2 answers)
Closed 3 years ago.
I have a data look like below:
cat1 <- c("A","A","B","B")
gender <- c("male","female","male","female")
mean <- c(1,2,3,4)
sd <-c(5,6,7,8)
data <- data.frame("cat1"=cat1,"gender"=gender, "mean"=mean, "sd"=sd)
> data
cat1 gender mean sd
1 A male 1 5
2 A female 2 6
3 B male 3 7
4 B female 4 8
I would like to change the format of the table to this below.
> data
cat1 score male female
1 A mean 1 2
2 A sd 5 6
3 B mean 3 4
4 B sd 7 8
Basically, I am alternating between score and cat2 variables.
Any suggestions?
One option using gather and spread
library(dplyr)
library(tidyr)
data %>%
gather(score, value, -cat1, -gender) %>%
spread(gender, value)
# cat1 score female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7
We can also use melt and dcast from data.table package:
library(data.table)
dcast(melt(data, id=c("cat1","gender"), variable.name = "score"), cat1 + score ~ gender)
#> cat1 score female male
#> 1 A mean 2 1
#> 2 A sd 6 5
#> 3 B mean 4 3
#> 4 B sd 8 7
Generally, any solution that converts the data to long format and then reshape it back to wide to swap variable and value columns works here.
It can be done with recast
library(reshape2)
recast(data, id.var = 1:2, cat1 + variable ~ gender)
# cat1 variable female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7

Filling a column based on multiple conditions in R

Using dummy data for this, I have two data frames:
One is a list of locations and their rankings on a measurement separated by gender (df1)
Locations Male Female
1 A 1 2
2 B 2 1
3 C 1 2
The other is a list of people
Name Gender Location
1 Joe Male A
2 Alex Female B
3 Chris Female A
4 Sam Male C
I want to add a column to the second data frame (df2$Value) that gives the corresponding number for each person from the first data frame based on their gender and location. (So in this case the results would be 1,1,2,1).
I've tried playing around with merge and with some conditional statements, but to no avail.
Convert df1 into a stacked format with matching names:
df1 <- df1 %>% gather(Gender, Value, Male:Female)
names(df1)[names(df1)=="Locations"] <- "Location"
Using left_join to match the Value by Gender and Location:
df2 %>% left_join(df1)
# Name Gender Location Value
# 1 Joe Male A 1
# 2 Alex Female B 1
# 3 Chris Female A 2
# 4 Sam Male C 1

R: how to remove duplicate rows by column [duplicate]

This question already has answers here:
Remove duplicated rows using dplyr
(6 answers)
Closed 5 years ago.
df <- data.frame(id = c(1, 1, 1, 2, 2),
gender = c("Female", "Female", "Male", "Female", "Male"),
variant = c("a", "b", "c", "d", "e"))
> df
id gender variant
1 1 Female a
2 1 Female b
3 1 Male c
4 2 Female d
5 2 Male e
I want to remove duplicate rows in my data.frame according to the gender column in my data set. I know there has been a similar question asked (here) but the difference here is that I would like to remove duplicate rows within each subset of the data set, where each subset is defined by an unique id.
My desired result is this:
id gender variant
1 1 Female a
3 1 Male c
4 2 Female d
5 2 Male e
I've tried the following and it works, but I'm wondering if there's a cleaner, more efficient way of doing this?
out = list()
for(i in 1:2){
df2 <- subset(df, id == i)
out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)
df[!duplicated(df[ , c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Another way of doing this using subset as below:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Here's a dplyr based solution in case you are interested (edited to include Gregor's suggestions)
library(dplyr)
group_by(df, id, gender) %>% slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, gender [4]
#> id gender variant
#> <dbl> <fctr> <fctr>
#> 1 1 Female a
#> 2 1 Male c
#> 3 2 Female d
#> 4 2 Male e
It might also be worth using the arrange function as well depending on which values of variant should be removed.

Create missing observations in panel data

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt
You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2
There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA

Resources