Related
I have a dataframe (df) in which each row represents the start (Start) and the end (End) of a specific habitat (Habitat) within a transect (Transect) and site (Site) in meters. It is important to note that the length of the transects varies within and among sites. As an example:
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
Site Transect Habitat Start End
1 A 1 X 0.0 2.8 # Habitat `X` is between the meters 0 and 2.8
2 A 1 Y 2.8 3.4 # Habitat `Y` is between the meters 2.8 and 3.4
3 A 1 X 3.4 5.0 # Habitat `X` is between the meters 3.4 and 5.0
4 A 1 Z 5.0 10.0 # Habitat `Z` is between the meters 5 and 10.0
5 A 2 Z 0.0 1.5
6 A 2 Y 1.5 5.0
7 A 2 X 5.0 8.0
8 A 2 Z 8.0 12.0
9 A 2 X 12.0 15.0
10 B 1 X 0.0 2.0
11 B 1 Z 2.0 5.0
12 B 1 X 5.0 7.5
13 B 1 Y 7.5 20.0
14 B 2 Z 0.0 4.0
15 B 2 X 4.0 8.0
16 B 2 Y 8.0 12.0
17 B 2 Z 12.0 15.0
In this example, for instance, habitat X is twice in the transect 1 in site A. Also, we can observe that the total length of transects 1 and 2 in site A are 10 and 15 m, respectively. In site B, the total length of the transects 1 and 2 are 20 and 15 meters, respectively.
What I want is to calculate per Site and Transect the percentage that each Habitat represents with respect to all the habitats presented in terms of meters. For example, in transect 1 and site A habitat X represents 4.4 meters of a total length of 10 meters for transect 1. In site A and transect 2, habitat X has 6 meters from a total length of 15 meters for transect B.
To this aim, the first thing I do is to calculate the length (Length) in meters of each habitat record (=row)
df$Length <- df$End - df$Start
Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:
df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Porcentage = (sum(Length)/max(End))*100))
I want to change max(End) to another expression that represents the total length OF THE TRANSECT. Right now max(End) represents the last meter (End) in which a specific habitat was present. How can I include in the code above "maximum value of End" but within of a specific Site and Transect, but not for a specific Habitat.
How can I do it? My desired output would be this:
Site Transect Habitat Percentage
1 A 1 X 44.0
2 A 1 Y 6.0
3 A 1 Z 50.0
4 A 2 X 40.0
5 A 2 Y 23.3
6 A 2 Z 36.7
7 B 1 X 22.5
8 B 1 Y 62.5
9 B 1 Z 15.0
10 B 2 X 26.7
11 B 2 Y 26.7
12 B 2 Z 46.7
Does anyone know how to do it?
Thanks in advance!
With dplyr, when you have different levels of hierarchy that need managing, you may need multiple group_by() statements. In the code below, I use group_by(Site, Transect, Habitat) to calculate the total length of each habitat in the Site and Transect and then group_by(Site, Transect) to calculate the percentage.
library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>%
mutate(length = End-Start) %>%
group_by(Site, Transect, Habitat) %>%
summarise(tot_length = sum(length)) %>%
group_by(Site, Transect) %>%
mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups: Site, Transect [4]
#> Site Transect Habitat tot_length percentage
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 A 1 X 4.4 44
#> 2 A 1 Y 0.6 6
#> 3 A 1 Z 5 50
#> 4 A 2 X 6 40
#> 5 A 2 Y 3.5 23.3
#> 6 A 2 Z 5.5 36.7
#> 7 B 1 X 4.5 22.5
#> 8 B 1 Y 12.5 62.5
#> 9 B 1 Z 3 15
#> 10 B 2 X 4 26.7
#> 11 B 2 Y 4 26.7
#> 12 B 2 Z 7 46.7
Created on 2023-02-16 by the reprex package (v2.0.1)
In your code from above, when you are calculating the percentage, your data are still grouped by Habitat, so the percentage you are calculating is within the Habitat rather than across habitats within Site and Transect pairs.
I have a wide table with more than 22 columns. This table is the result of fuzzymatch and that's why it's in wide format. The column names are shown below (in order) (I will try to create a sample data frame for better demonstration):
[1] "shift_date.x" "shift" "ageyrs" "site" "level"
[6] "crowded_shift" "time" "dd" "AE" "ageyrs_start"
[11] "ageyrs_end" "time_start" "time_end" "shift_date.y" "shift_n"
[16] "ageyrs_n" "site_n" "level_n" "crowded_shift_n" "los_n"
[21] "dd_n" "AE_n"
What I want to do is to break this data frame starting from column 14 to the end ("shift_date.y" to "AE_n") and add it as new rows to the bottom of first section of table (change it to long format). The problem is that the first section has 13 columns but the second part has 8 and I am not sure how I can combine them (that's why probably subsetting and rbind don't work).
As an example, imagine we have the following data frame:
shift <- c (2,1,0)
ageyrs <- c(12.2,13,14)
site <- c(0,1,3)
level <- c (1,5,6)
ageyrs_s <- c (2,4,5)
ageyrs_n <- c (4,6,8)
shift2 <- c (2,1,0)
ageyrs2 <- c(12.2,13,14)
site2 <- c(0,1,3)
level2 <- c (1,5,6)
a <- data.frame(shift, ageyrs, site, level, ageyrs_s, ageyrs2, shift2, ageyrs2, site2, level2)
shift ageyrs site level ageyrs_s ageyrs_n shift_n ageyrs_n site_n level_n
1 2 12.2 0 1 2 4 2 12.2 0 1
2 1 13.0 1 5 4 6 1 13.0 1 5
3 0 14.0 3 6 5 8 0 14.0 3 6
No I want to break this dataframe at "shift2" column and create a dataframe line shown below:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
Any suggestions on how to resolve this?
We can use split.default from base R to split the data into list of data.frames and then convert to a single data.frame after unlisting the list elements
nm1 <- sub("\\d+$", "", names(a))
lst1 <- lapply(split.default(a, nm1),
unlist, use.names = FALSE)
out <- data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))[unique(nm1)]
-output
out
# shift ageyrs site level ageyrs_s ageyrs_n
#1 2 12.2 0 1 2 4
#2 1 13.0 1 5 4 6
#3 0 14.0 3 6 5 8
#4 2 12.2 0 1 NA NA
#5 1 13.0 1 5 NA NA
#6 0 14.0 3 6 NA NA
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
a %>%
rename_at(vars(shift:level), ~ str_c(., '1')) %>%
pivot_longer(cols = -c(ageyrs_s, ageyrs_n), names_to = c(".value", 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])")
Try this. You can use bind_rows() and setNames() to define common names so that the values can be joined properly:
library(dplyr)
#Code
newa <- a %>% select(shift:ageyrs_n) %>%
bind_rows(a %>% select(shift2:level2) %>% setNames(gsub('2','',names(.))))
Output:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0
ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0
ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:
f <- function(x) {
idx <- which(is.na(x))
for (id in idx) {
sel <- x[id+c(-1,1)]
if (id < length(x))
sel <- sel[!is.na(sel)]
x[id] <- mean(sel)
}
return(x)
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
# id time ss ss_imp
# 11 1 0 2 2.0
# 12 1 1 2 2.0
# 13 1 2 4 4.0
# 14 1 3 3 3.0
# 15 1 4 NA 1.5
# 16 1 5 0 0.0
# 17 1 6 0 0.0
# 21 2 0 1 1.0
# 22 2 1 4 4.0
# 23 2 2 0 0.0
# 24 2 3 NA 0.0
# 25 2 4 0 0.0
# 26 2 5 0 0.0
# 27 2 6 0 0.0
# 31 3 0 4 4.0
# 32 3 1 2 2.0
# 33 3 2 1 1.0
# 34 3 3 3 3.0
# 35 3 4 3 3.0
# 36 3 5 2 2.0
# 37 3 6 NA NA
# 41 4 0 3 3.0
# 42 4 1 4 4.0
# 43 4 2 3 3.0
# 44 4 3 NA 3.0
# 45 4 4 NA 1.5
# 46 4 5 0 0.0
# 47 4 6 0 0.0
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")
I have a data frame that looks like this:
site date var dil
1 A 7.4 2
2 A 6.5 2
1 A 7.3 3
2 A 7.3 3
1 B 7.1 1
2 B 7.7 2
1 B 7.7 3
2 B 7.4 3
I need add a column called wt to this dataframe that contains the weighting factor needed to calculate the weighted mean. This weighting factor has to be derived for each combination of site and date.
The approach I'm using is to first built a function that calculate the weigthing factor:
> weight <- function(dil){
dil/sum(dil)
}
then apply the function for each combination of site and date
> df$wt <- ddply(df,.(date,site),.fun=weight)
but I get this error message:
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
You are almost there. Modify your code to use the transform function. This allows you to add columns to the data.frame inside ddply:
weight <- function(x) x/sum(x)
ddply(df, .(date,site), transform, weight=weight(dil))
site date var dil weight
1 1 A 7.4 2 0.40
2 1 A 7.3 3 0.60
3 2 A 6.5 2 0.40
4 2 A 7.3 3 0.60
5 1 B 7.1 1 0.25
6 1 B 7.7 3 0.75
7 2 B 7.7 2 0.40
8 2 B 7.4 3 0.60