This question already has an answer here:
Forward and backward fill data frame in R [duplicate]
(1 answer)
Closed 3 years ago.
Suppose that we have the following data frame:
ID <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6)
age <- c(25, 25, 25, 22, 22, 56, 56, 56, 80, 33, 33, 90, 90, 90, 5, 5, 5)
gender <- c("m", "m", NA, "f", "f", "m", NA, "m", "m", "m", NA, NA, NA, "m", NA, NA, NA)
company <- c("c1", "c2", "c2", "c3", "c3", "c1", "c1", "c1", "c1", "c5", "c5", "c3", "c4", "c5", "c3", "c1", "c1")
income <- c(1000, 1000, 1000, 500, 1700, 200, 200, 250, 500, 700, 700, 300, 350, 300, 500, 1700, 200)
df <- data.frame(ID, age, gender, company, income)
In this data we have 6 unique IDs, and if you look at the gender variable, sometimes in includes NA
I want to replace the NAs with the correct gender category. Also, in case an ID has all NA's for gender, then leave it as is.
The expected outcome would be:
Here's way in base R using ave -
df$gender <- with(df, ave(gender, ID, FUN = function(x) na.omit(x)[1]))
ID age gender company income
1 1 25 m c1 1000
2 1 25 m c2 1000
3 1 25 m c2 1000
4 2 22 f c3 500
5 2 22 f c3 1700
6 3 56 m c1 200
7 3 56 m c1 200
8 3 56 m c1 250
9 3 80 m c1 500
10 4 33 m c5 700
11 4 33 m c5 700
12 5 90 m c3 300
13 5 90 m c4 350
14 5 90 m c5 300
15 6 5 <NA> c3 500
16 6 5 <NA> c1 1700
17 6 5 <NA> c1 200
Some ways with dplyr and tidyr -
df %>%
group_by(ID) %>%
mutate(gender = na.omit(gender)[1])
df %>%
group_by(ID) %>%
fill(gender, .direction = "up") %>%
fill(gender, .direction = "down")
Using the tidyverse library you can do this
library(tidyverse)
# for each ID get the gender
df_gender_ref <- df %>% filter(!is.na(gender)) %>% select(ID,gender) %>% unique()
# add the new gender column to the original dataframe
df %>% select(-gender) %>% left_join(df_gender_ref)
Related
How can I merge 5 vectors together into one dynamic/flexible matrix?
V1 <- c(13, 31, 54)
name1 <- c("a", "b2", "c")
V2 <- c(17, 27, 34, 52)
name2 <- c("a", "b1", "b2", "c")
V3 <- c(19, 25, 33, 47, 58, 44)
name3 <- c("a", "b1", "b2", "b3", "c", "d")
V4 <- c(13, 29, 35, 56)
name4 <- c("a", "b1", "b2", "c")
V5<-c(21, 35, 67, 82, 96)
name5<-c("d", "c", "b3", "b1", "b2")
And create a matrix like this:
We can load the objects into the global environment with mget based on the pattern of object names i.e. those object names that starts with (^) 'V' followed by one or more digits (\\d+) at the end ($) of the string. Then, append NA at the end of each of the list element based on the max lengths of the list
lst1 <- mget(ls(pattern = '^V\\d+$'))
t(sapply(lst1, `length<-`, max(lengths(lst1))))
If we need the names as well
lst2 <- mget(ls(pattern = '^name\\d+$'))
out <- xtabs(unlist(lst1) ~ rep(seq_along(lst1), lengths(lst1)) +
unlist(lst2))
names(dimnames(out)) <- NULL
Or another option is map2 from purrr
library(purrr)
map2_dfr(lst1, lst2, setNames)
-output
# A tibble: 5 x 6
# a b2 c b1 b3 d
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 13 31 54 NA NA NA
#2 17 34 52 27 NA NA
#3 19 33 58 25 47 44
#4 13 35 56 29 NA NA
#5 NA 96 35 82 67 21
Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?
For example, I have two dataframes:
df1:
deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"),
Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4),
Percent_Column_in_each_bin = c(20, 10, 25))
df2:
sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))
I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:
i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
'20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
'20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5),
'20180101_HH1_1_3_Bin' = c(1, 1, 2, 3, 3, 4, 1, 3, 1, 2))
I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.
I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!
Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from #Ben) is using dplyr to slice and find the max ID like this:
20180101_HH1_1_1 <- sp.c %>%
group_by(20180101_HH1_1_1, species) %>%
arrange(desc(ct)) %>%
slice(1) %>%
group_by(20180101_HH1_1_1) %>%
mutate(Count_Total_Per_Bin = sum(ct)) %>%
group_by(species, add=TRUE) %>%
mutate(species_percent_of_total_in_bin =
paste0((100*ct/Count_Total_Per_Bin) %>%
mutate(ID= "20180101_HH1_1_1 ") %>%
ungroup()
but I have to do this for over 100 IDs. My desired output would be something like:
end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
bin = c(1, 1, 1, 2, 3, 4, 4, 5),
Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))
I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.
Here is another approach, which does not require a loop.
Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.
res <- sapply(deploy$Percent_Column_in_each_bin, function(x) {
cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)
Or using purrr:
library(purrr)
cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID),
~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))
Output
species ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1 RR 25 11 1 2 1
2 GS 66 15 1 2 1
3 GT 1 33 2 4 2
4 BR 12 68 4 7 3
5 RS 30 71 4 8 3
6 BA 6 100 5 10 4
7 GS 1 2 1 1 1
8 RS 22 65 4 7 3
9 SH 500 5 1 1 1
10 RR 6 42 3 5 2
Edit:
To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.
First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.
library(tidyverse)
res %>%
pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
group_by(species, site, bin) %>%
summarise(max_ct = max(ct)) %>%
arrange(site, bin)
Output
# A tibble: 26 x 4
# Groups: species, site [21]
species site bin max_ct
<fct> <chr> <fct> <dbl>
1 GS 20180101_HH1_1_1 1 66
2 RR 20180101_HH1_1_1 1 25
3 SH 20180101_HH1_1_1 1 500
4 GT 20180101_HH1_1_1 2 1
5 RR 20180101_HH1_1_1 3 6
6 BR 20180101_HH1_1_1 4 12
7 RS 20180101_HH1_1_1 4 30
8 BA 20180101_HH1_1_1 5 6
9 GS 20180101_HH1_1_2 1 1
10 SH 20180101_HH1_1_2 1 500
11 GS 20180101_HH1_1_2 2 66
12 RR 20180101_HH1_1_2 2 25
13 GT 20180101_HH1_1_2 4 1
14 RR 20180101_HH1_1_2 5 6
15 BR 20180101_HH1_1_2 7 12
16 RS 20180101_HH1_1_2 7 22
17 RS 20180101_HH1_1_2 8 30
18 BA 20180101_HH1_1_2 10 6
19 GS 20180101_HH1_1_3 1 66
20 RR 20180101_HH1_1_3 1 25
21 SH 20180101_HH1_1_3 1 500
22 GT 20180101_HH1_1_3 2 1
23 RR 20180101_HH1_1_3 2 6
24 BR 20180101_HH1_1_3 3 12
25 RS 20180101_HH1_1_3 3 30
26 BA 20180101_HH1_1_3 4 6
It is helpful to distinguish between the contents of your two dataframes.
df2 appears to contain measurements from some sites
df1 appears to contain parameters by which you want to process/summarise the measurements in df2
Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:
max_dist = max(df2$percent_dist_from_surf)
for(ii in 1:nrow(df1)){
# extract parameters
this_ID = df1[[ii,"ID"]]
this_depth = df1[[ii,"Site_Depth"]]
this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
this_percent = df1[[ii,"Percent_Column_in_each_bin"]]
# add column to df2
df2 = df2 %>%
mutate(!!sym(this_ID) := insert_your_calculation_here)
}
The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.
And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)
Suppose that we have the following data frame:
ID <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5)
age <- c(25, 25, 25, 22, 22, 56, 56, 56, 80, 33, 33, 90, 90, 90)
gender <- c("m", "m", "m", "f", "f", "m", "m", "m", "m", "m", "m", "f", "f", "m")
company <- c("c1", "c2", "c2", "c3", "c3", "c1", "c1", "c1", "c1", "c5", "c5", "c3", "c4", "c5")
income <- c(1000, 1000, 1000, 500, 1700, 200, 200, 250, 500, 700, 700, 300, 350, 300)
df <- data.frame(ID, age, gender, company, income)
I need to find the row that have different values by ID for age, gender, and income. I don't care about the company whether they are same or different.
So after processing, here is the output:
BONUS,
Can we create another data frame include the list of variables that are different by id. For example:
An option would be to group by 'ID', check whether the number of distinct elements in 'age', 'gender', 'income' is equal to 1 and then negate (!)
library(dplyr)
out <- df %>%
group_by(ID) %>%
filter(!(n_distinct(age) == 1 &
n_distinct(gender) == 1 &
n_distinct(income) == 1))
out
# A tibble: 9 x 5
# Groups: ID [3]
# ID age gender company income
# <dbl> <dbl> <fct> <fct> <dbl>
#1 2 22 f c3 500
#2 2 22 f c3 1700
#3 3 56 m c1 200
#4 3 56 m c1 200
#5 3 56 m c1 250
#6 3 80 m c1 500
#7 5 90 f c3 300
#8 5 90 f c4 350
#9 5 90 m c5 300
If there are many variable, another option i filter_at
df %>%
group_by(ID) %>%
filter_at(vars(age, gender, income), any_vars(!(n_distinct(.) == 1)))
From the above, we can get the ssecond output with
library(tidyr)
out %>%
select(-company) %>%
gather(key, val, - ID) %>%
group_by(key, add = TRUE) %>%
filter(n_distinct(val) > 1) %>%
group_by(ID) %>%
summarise(Different = toString(unique(key)))
# A tibble: 3 x 2
# ID Different
# <dbl> <chr>
#1 2 income
#2 3 age, income
#3 5 gender, income
In base R, we can split c("age", "gender", "income") column based on ID find out ID's which have more than 1 unique row and subset them.
df[df$ID %in% unique(df$ID)[sapply(split(df[c("age", "gender", "income")], df$ID),
function(x) nrow(unique(x)) > 1)], ]
# ID age gender company income
#4 2 22 f c3 500
#5 2 22 f c3 1700
#6 3 56 m c1 200
#7 3 56 m c1 200
#8 3 56 m c1 250
#9 3 80 m c1 500
#12 5 90 f c3 300
#13 5 90 f c4 350
#14 5 90 m c5 300
I am working with a data set of patients' health state over time.
I would like to compute the data frame of transitions
from the current health state to the next health state.
Here is an example where the health state is measured
only by AFP level and weight.
The health state measurements might look like the following:
x <- data.frame(id = c(1, 1, 1, 2, 2, 2),
day = c(1, 2, 3, 1, 2, 3),
event = c('status', 'status', 'death', 'status', 'status', 'status'),
afp = c(10, 50, NA, 20, 30, 40),
weight = c(100, 105, NA, 200, 200, 200))
The desired output looks like the following:
y <- data.frame(id = c(1, 1, 2, 2),
current_afp = c(10, 50, 20, 30),
current_weight = c(100, 105, 200, 200),
next_event = c('status', 'death', 'status', 'status'),
next_afp = c(50, NA, 30, 40),
next_weight = c(105, NA, 200, 200))
One inefficient way to obtain the output is:
take the cross product of the measurements data frame with itself
keep only rows with matching ids, and day.x + 1 = day.y
rename the columns
Is there a more efficient way to obtain the output?
Note: The real measurements data frame can have more than 10 columns,
so it is not very efficient from a lines of code perspective
to explicitly write
current_afp = x$afp[1:(n-1)],
next_afp = x$afp[2:n]
...
and so on.
You could try:
library(dplyr)
x %>%
mutate_each(funs(lead(.)), -id, -day) %>%
full_join(x, ., by = c("id", "day")) %>%
select(-event.x) %>%
setNames(c(names(.)[1:2],
paste0("current_", sub("\\..*","", names(.)[3:4])),
paste0("next_", sub("\\..*","", names(.)[5:7])))) %>%
group_by(id) %>%
filter(day != last(day))
Which gives:
# id day current_afp current_weight next_event next_afp next_weight
#1 1 1 10 100 status 50 105
#2 1 2 50 105 death NA NA
#3 2 1 20 200 status 30 200
#4 2 2 30 200 status 40 200
Using base R with a split-apply-combine approach
res <- lapply(split(x[-2], x$id), function(y) {
xx <- cbind(y[1:(nrow(y)-1), ], y[2:nrow(y), -1])
colnames(xx) <- c("id", paste("current", colnames(y)[-1], sep="_"),
paste("next", colnames(y)[-1], sep="_"))
xx[, which(colnames(xx) != "current_event")]
})
do.call(rbind, res)
id current_afp current_weight next_event next_afp next_weight
1 1 10 100 status 50 105
2 1 50 105 death NA NA
3 2 20 200 status 30 200
4 2 30 200 status 40 200
Or, an example where not all days are in sequence
x <- data.frame(id = c(1, 1, 1, 2, 2, 2),
day = c(1, 2, 3, 1, 2, 4),
event = c('status', 'status', 'death', 'status', 'status', 'status'),
afp = c(10, 50, NA, 20, 30, 40),
weight = c(100, 105, NA, 200, 200, 200))
x
id day event afp weight
1 1 1 status 10 100
2 1 2 status 50 105
3 1 3 death NA NA
4 2 1 status 20 200
5 2 2 status 30 200
6 2 4 status 40 200
Some of the transitions are NA, which could be removed if desired.
res <- lapply(split(x, x$id), function(y) {
y <- merge(data.frame(id=unique(y$id), day = 1:max(y$day)), y,
by = c("id", "day"), all.x=TRUE)[, -2]
xx <- cbind(y[1:(nrow(y)-1), ], y[2:nrow(y), -1])
colnames(xx) <- c("id", paste("current", colnames(y)[-1], sep="_"),
paste("next", colnames(y)[-1], sep="_"))
xx[, which(colnames(xx) != "current_event")]
})
do.call(rbind, res)
id current_afp current_weight next_event next_afp next_weight
1.1 1 10 100 status 50 105
1.2 1 50 105 death NA NA
2.1 2 20 200 status 30 200
2.2 2 30 200 <NA> NA NA
2.3 2 NA NA status 40 200
I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:
BEFORE:
data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60))
AFTER:
data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60),
sum = c(30, 30, 70, 70, 110, 110))
In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.
You are looking for ave
> data2 <- transform(data1, sum=ave(value, month, FUN=sum))
month sex value sum
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110
data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform
Also data.table is helpful
library(data.table)
DT <- data.table(data1)
DT[, sum:=sum(value), by=month]
UPDATE
We can also use a tidyverse approach which is simple, yet elegant:
> library(tidyverse)
> data1 %>%
group_by(month) %>%
mutate(sum=sum(value))
# A tibble: 6 x 4
# Groups: month [3]
month sex value sum
<dbl> <fct> <dbl> <dbl>
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110