Using dummy data for this, I have two data frames:
One is a list of locations and their rankings on a measurement separated by gender (df1)
Locations Male Female
1 A 1 2
2 B 2 1
3 C 1 2
The other is a list of people
Name Gender Location
1 Joe Male A
2 Alex Female B
3 Chris Female A
4 Sam Male C
I want to add a column to the second data frame (df2$Value) that gives the corresponding number for each person from the first data frame based on their gender and location. (So in this case the results would be 1,1,2,1).
I've tried playing around with merge and with some conditional statements, but to no avail.
Convert df1 into a stacked format with matching names:
df1 <- df1 %>% gather(Gender, Value, Male:Female)
names(df1)[names(df1)=="Locations"] <- "Location"
Using left_join to match the Value by Gender and Location:
df2 %>% left_join(df1)
# Name Gender Location Value
# 1 Joe Male A 1
# 2 Alex Female B 1
# 3 Chris Female A 2
# 4 Sam Male C 1
Related
This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 5 months ago.
I have a sample dataset below:
df <- data.frame(id = c(1,1,2,2,3,3),
gender = c("male",NA,"female","female",NA, "female"))
> df
id gender
1 1 male
2 1 <NA>
3 2 female
4 2 female
5 3 <NA>
6 3 female
By grouping the same ids, some rows are missing. What I would like to do is to fill those missing cells based on the existing information.
SO the desired output would be:
> df
id gender
1 1 male
2 1 male
3 2 female
4 2 female
5 3 female
6 3 female
Any thoughts?
Thanks!
You can use dplyr::group_by and tidyr::fill e.g.:
df |>
dplyr::group_by(id) |>
tidyr::fill(gender, .direction = "updown")
My data looks like this:
period
id
category
1
1234
1
1
2345
2
1
4567345
1
2
1234
3
2
2345
3
2
4567345
1
3
123467
2
3
234567
2
3
45673
1
I need to create a new column "category_pre" containing category values from previous period for each ID. If an ID is not found in previous period, the script should return "NA". The new column should be added to existing dataframe.
What would be the best way to do it?
Thanks!
We can use the lag() function here from dplyr:
df <- df %>%
group_by(id) %>%
mutate(category_pre = lag(category, order_by=period))
df
period id category category_pre
<dbl> <dbl> <dbl> <dbl>
1 1 1234 1 NA
2 1 2345 2 NA
3 1 4567345 1 NA
4 2 1234 3 1
5 2 2345 3 2
6 2 4567345 1 1
7 3 123467 2 NA
8 3 234567 2 NA
9 3 45673 1 NA
Step one, make a new data frame with period + 1:
df2 = data.frame(period=df$period+1, id=df$id, category=df$category)
Step two, merge the two data frames on period and id, all.x=T ensures that rows where there is no previous period are included (with NA):
merge(df, df2, by=c('period','id'), all.x = T)
Step three, rename the columns to category and category_pre.
I need to run a chi-square test so I need the levels of one column (gender) to be the column names for the output of different variables. Here's some data:
test <- data.frame(gender = as.character(sample(c('male','female'),10, replace = T)),
test1 = sample(c(1:10)),
test2 = sample(1:5,10 , replace = T))
> test
gender test1 test2
1 female 2 2
2 male 9 1
3 male 4 4
4 female 8 1
5 female 5 4
6 female 3 3
7 female 7 3
8 female 1 1
9 male 10 2
10 male 6 2
I've used the following line of code with dplyr::spread() but it's giving me an error:
test %>% spread(gender,test1)
Error: Each row of output must be identified by a unique combination of keys.
I've followed all the examples that dplyr provides using gather() and spread() but nothing is working. If you have any tips please let me know. Here's my desired outcome:
> goal
male female
1 10 3
2 1 4
3 5 10
4 3 9
5 6 7
We can create a sequence column grouped by gender to make a unique identifier and then use `spread
library(dplyr)
library(tidyr)
test %>%
select(-test2) %>%
group_by(gender) %>%
mutate(rn = row_number()) %>%
spread(gender, test1)
I have an oddly formatted data set in which
Row 1 are gender and income data from subject 1 collected at visit 1
Row 2 are diabetes and hypertension history data from subject 1 collected at visit 2
Row 3 are gender and income data from subject 2 collected at visit 1
Row 4 are diabetes and hypertension history data from subject 2 collected at visit 2
and so on
I use R and want to combine all data from each subject so that there are 2 rows in the new data, row 1 has gender income diabetes and hypertension data for subject 1, and row 2 has data for subject 2. Could I get some help please?
Split and then combine.
(dat <- read.table("000.txt", head=F, as.is=T))
# V1 V2 V3
# 1 Bob Female 22445
# 2 Bob diabeteY HyperN
# 3 Lucy Male 12345
# 4 Lucy diabeteN HyperY
dat01 <- dat[seq(1, nrow(dat), by=2),]
names(dat01) <- c("name", "gender","income")
dat01
# name gender income
# 1 Bob Female 22445
# 3 Lucy Male 12345
dat02 <- dat[seq(2, nrow(dat), by=2),]
names(dat02) <- c("name", "diabet", "hyper")
dat02
# name diabet hyper
# 2 Bob diabeteY HyperN
# 4 Lucy diabeteN HyperY
(dat.final <- merge(dat01, dat02, by="name"))
# name gender income diabet hyper
# 1 Bob Female 22445 diabeteY HyperN
# 2 Lucy Male 12345 diabeteN HyperY
I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt
You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2
There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA