R: how to remove duplicate rows by column [duplicate] - r

This question already has answers here:
Remove duplicated rows using dplyr
(6 answers)
Closed 5 years ago.
df <- data.frame(id = c(1, 1, 1, 2, 2),
gender = c("Female", "Female", "Male", "Female", "Male"),
variant = c("a", "b", "c", "d", "e"))
> df
id gender variant
1 1 Female a
2 1 Female b
3 1 Male c
4 2 Female d
5 2 Male e
I want to remove duplicate rows in my data.frame according to the gender column in my data set. I know there has been a similar question asked (here) but the difference here is that I would like to remove duplicate rows within each subset of the data set, where each subset is defined by an unique id.
My desired result is this:
id gender variant
1 1 Female a
3 1 Male c
4 2 Female d
5 2 Male e
I've tried the following and it works, but I'm wondering if there's a cleaner, more efficient way of doing this?
out = list()
for(i in 1:2){
df2 <- subset(df, id == i)
out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)

df[!duplicated(df[ , c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Another way of doing this using subset as below:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e

Here's a dplyr based solution in case you are interested (edited to include Gregor's suggestions)
library(dplyr)
group_by(df, id, gender) %>% slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, gender [4]
#> id gender variant
#> <dbl> <fctr> <fctr>
#> 1 1 Female a
#> 2 1 Male c
#> 3 2 Female d
#> 4 2 Male e
It might also be worth using the arrange function as well depending on which values of variant should be removed.

Related

Filling out missing information by grouping in R [duplicate]

This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 5 months ago.
I have a sample dataset below:
df <- data.frame(id = c(1,1,2,2,3,3),
gender = c("male",NA,"female","female",NA, "female"))
> df
id gender
1 1 male
2 1 <NA>
3 2 female
4 2 female
5 3 <NA>
6 3 female
By grouping the same ids, some rows are missing. What I would like to do is to fill those missing cells based on the existing information.
SO the desired output would be:
> df
id gender
1 1 male
2 1 male
3 2 female
4 2 female
5 3 female
6 3 female
Any thoughts?
Thanks!
You can use dplyr::group_by and tidyr::fill e.g.:
df |>
dplyr::group_by(id) |>
tidyr::fill(gender, .direction = "updown")

Replace missing values in a cell, with a value from the cell above (n-1) using a LOOP

I have a data file with thousands of rows, that has gaps which I wish to fill with a value.
I need to replace the empty cells with the values from those above it.
It will be easier to give you an idea of what my data looks like, here is a sample
Variable <- c("AGE","","","","SEX","","SEGMENT","","","","")
Value <- c(1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5)
Description <- c("18-24","25-34","35-44","45+","Female","Male","A","B","C","D","E")
df <- data.frame(Variable, Value, Description)
> df
Variable Value Description
1 AGE 1 18-24
2 2 25-34
3 3 35-44
4 4 45+
5 SEX 1 Female
6 2 Male
7 SEGMENT 1 A
8 2 B
9 3 C
10 4 D
11 5 E
As you can see above the first column has gaps. I need these empty cells to be replaced with the relevant value above so the new variable will look like this in the dataframe
> df
Variable Value Description Variable_NEW
1 AGE 1 18-24 AGE
2 2 25-34 AGE
3 3 35-44 AGE
4 4 45+ AGE
5 SEX 1 Female SEX
6 2 Male SEX
7 SEGMENT 1 A SEGMENT
8 2 B SEGMENT
9 3 C SEGMENT
10 4 D SEGMENT
11 5 E SEGMENT
Thinking out aloud. I'm assuming to achieve this, I will need to create a new variable with a loop and then use a logic like this
IF Variable[n]="" THEN Variable_New[n] = Variable[n-1],
ELSE Variable_New[n] = Variable[n]
I'm familiar with loops but don't how to write this kind of thing in R where it has a lag/n-1 kind of function. There are probably many ways to accomplish this, but it would be a preferable using a loop. Any help will be greatly appreciated. Thanks
Here a loop approach:
#Data
Variable <- c("AGE","","","","SEX","","SEGMENT","","","","")
Value <- c(1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5)
Description <- c("18-24","25-34","35-44","45+","Female","Male","A","B","C","D","E")
df <- data.frame(Variable, Value, Description,stringsAsFactors = F)
#Create new column
df$NewVar <- df$Variable
#Loop
for(i in 2:dim(df)[1])
{
df$NewVar[i] <- ifelse(df$NewVar[i]=="",df$NewVar[i-1],df$NewVar[i])
}
Output:
Variable Value Description NewVar
1 AGE 1 18-24 AGE
2 2 25-34 AGE
3 3 35-44 AGE
4 4 45+ AGE
5 SEX 1 Female SEX
6 2 Male SEX
7 SEGMENT 1 A SEGMENT
8 2 B SEGMENT
9 3 C SEGMENT
10 4 D SEGMENT
11 5 E SEGMENT
You don't need to write loops, there are built-in functions which can help you with this task.
You can replace blank values with NA and use fill :
library(dplyr)
df %>%
mutate(Variable_NEW = replace(Variable, Variable == "", NA)) %>%
tidyr::fill(Variable_NEW)
# Variable Value Description Variable_NEW
#1 AGE 1 18-24 AGE
#2 2 25-34 AGE
#3 3 35-44 AGE
#4 4 45+ AGE
#5 SEX 1 Female SEX
#6 2 Male SEX
#7 SEGMENT 1 A SEGMENT
#8 2 B SEGMENT
#9 3 C SEGMENT
#10 4 D SEGMENT
#11 5 E SEGMENT
This replaces the "" with missing and then fixes the variable named Variable:
df %>%
dplyr::mutate_all(list(~na_if(.,""))) %>%
tidyr::fill(Variable, .direction = "down")
Using data.table and a for loop:
library(data.table)
DT <- as.data.table(df)
DT[, Variable_new := Variable[1]]
for (i in 2:nrow(DT)) {
DT[i, Variable_new := fifelse(DT[i, Variable] == '', DT[i-1, Variable_new], DT[i, Variable])]
}
> DT
Variable Value Description Variable_new
1: AGE 1 18-24 AGE
2: 2 25-34 AGE
3: 3 35-44 AGE
4: 4 45+ AGE
5: SEX 1 Female SEX
6: 2 Male SEX
7: SEGMENT 1 A SEGMENT
8: 2 B SEGMENT
9: 3 C SEGMENT
10: 4 D SEGMENT
11: 5 E SEGMENT
You can write your own function with a loop or use the na.locf function from the zoo package to fill-in missing NA values. Example:
fillin <- function(x) {
for (i in 2:length(x)) {
if (x[i] %in% c(NA, "")) {
x[i] <- x[i - 1]
}
}
x
}
Variable <- c("AGE","","","","SEX","","SEGMENT","","","","")
Value <- c(1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5)
Description <- c("18-24","25-34","35-44","45+","Female","Male","A","B","C","D","E")
df <- data.frame(Variable, Value, Description)
df$Variable_fillin <- fillin(df$Variable)
library(zoo)
df$Variable[df$Variable == ""] <- NA
df$Variable_nalocf <- na.locf(df$Variable)
df
#> Variable Value Description Variable_fillin Variable_nalocf
#> 1 AGE 1 18-24 AGE AGE
#> 2 <NA> 2 25-34 AGE AGE
#> 3 <NA> 3 35-44 AGE AGE
#> 4 <NA> 4 45+ AGE AGE
#> 5 SEX 1 Female SEX SEX
#> 6 <NA> 2 Male SEX SEX
#> 7 SEGMENT 1 A SEGMENT SEGMENT
#> 8 <NA> 2 B SEGMENT SEGMENT
#> 9 <NA> 3 C SEGMENT SEGMENT
#> 10 <NA> 4 D SEGMENT SEGMENT
#> 11 <NA> 5 E SEGMENT SEGMENT

R extracting the frequencies

I am trying to get the frequencies but my ids are repeating. Here is a sample data:
id <- c(1,1,2,2,3,3)
gender <- c("m","m","f","f","m","m")
score <- c(10,5,10,5,10,5)
data <- data.frame("id"=id,"gender"=gender, "score"=score)
> data
id gender score
1 1 m 10
2 1 m 5
3 2 f 10
4 2 f 5
5 3 m 10
6 3 m 5
I would like to get the frequencies of the gender categories but I have repeating ids. When I run this code below:
gender<-as.data.frame(table(data$gender))
> gender
Var1 Freq
1 f 2
2 m 4
The frequency should be female = 1, male =2. it should look like this below:
> gender
Var1 Freq
1 f 1
2 m 2
How can I get this considering the id information?
You can use data.table::uniqueN to count the number of unique ids per gender group
library(data.table)
setDT(data)
data[, .(Freq = uniqueN(id)), gender]
# gender Freq
# 1: m 2
# 2: f 1
The idea from #IceCreamToucan with dplyr:
data %>%
group_by(gender) %>%
summarise(freq = n_distinct(id))
gender freq
<fct> <int>
1 f 1
2 m 2
In base R
rowSums(table(data$gender,data$id)!=0)
f m
1 2
Being late to the party, I was quite surprised about the sophisticated answers which use grouping or rowSums().
In base R, I would
remove the duplicate id rows from the data.frame by subsetting with duplicated(id),
apply table() on the gender column.
So, the code is
table(data[duplicated(data$id), "gender"])
f m
1 2

subtract first or second value from each row [duplicate]

This question already has answers here:
R subtract value for the same ID (from the first ID that shows)
(3 answers)
subtract first value from each subset of dataframe
(2 answers)
Closed 4 years ago.
I'm manipulating my data using dplyr, and after grouping my data, I would like to subtract all values by the first or second value in my group (i.e., subtracting a baseline). Is it possible to perform this in a single pipe step?
MWE:
test <- tibble(one=c("c","d","e","c","d","e"), two=c("a","a","a","b","b","b"), three=1:6)
test %>% group_by(`two`) %>% mutate(new=three-three[.$`one`=="d"])
My desired output is:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 0
3 e a 3 1
4 c b 4 -1
5 d b 5 0
6 e b 6 1
However I am getting this as the output:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 NA
3 e a 3 1
4 c b 4 -1
5 d b 5 NA
6 e b 6 1
We can use the first from dplyr
test %>%
group_by(two) %>%
mutate(new=three- first(three))
# A tibble: 6 x 4
# Groups: two [2]
# one two three new
# <chr> <chr> <int> <int>
#1 c a 1 0
#2 d a 2 1
#3 e a 3 2
#4 c b 4 0
#5 d b 5 1
#6 e b 6 2
If we are subsetting the 'three' values based on string "c" in 'one', then we don't need .$ as it will get the whole column 'c' instead of the values within the group by column
test %>%
group_by(`two`) %>%
mutate(new=three-three[one=="c"])
library(tidyverse)
tibble(
one = c("c", "d", "e", "c", "d", "e"),
two = c("a", "a", "a", "b", "b", "b"),
three = 1:6
) -> test_df
test_df %>%
group_by(two) %>%
mutate(new = three - three[1])
## # A tibble: 6 x 4
## # Groups: two [2]
## one two three new
## <chr> <chr> <int> <int>
## 1 c a 1 0
## 2 d a 2 1
## 3 e a 3 2
## 4 c b 4 0
## 5 d b 5 1
## 6 e b 6 2

Create missing observations in panel data

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt
You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2
There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA

Resources