Replace/Add column values based on another column - r

I am creating three new columns and trying to fill them based on the values in two other columns. The new columns are "0-5", "5-15" and "15-30". First, I wanted to know whether or not the cells are within these ranges according to the columns upper and lower, so I created rules to fill them in with 'y' (for yes). Now, if there is a y present, I would like to replace that y with the corresponding number in the value column. I am stuck on this part. I am also wondering if there is an easier way to fill in the "0-5", "5-15", and "15-30" columns directly with the number in "value" based on the upper/lower columns without having to put in "y" first.
x y upper lower 0-5 5-15 15-30 value
378828.1 1682697.2 2 12 y y NA 4.04
378828.1 1682697.2 12 37 NA y y 1.00
381625.6 1684852.5 0 63 y y y 1.96
388660.2 1704566.9 5 18 NA y y 2.65

We can accomplish this using the dplyr package as well as the match function:
library(dplyr)
dat %>%
rowwise() %>%
mutate(`0-5` = ifelse(any(match(lower:upper, 0:4)), value, NA),
`5-15` = ifelse(any(match(lower:upper, 5:14)), value, NA),
`15-30` = ifelse(any(match(lower:upper, 15:29)), value, NA))
x y upper lower value `0-5` `5-15` `15-30`
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 378828.1 1682697 2 12 4.04 4.04 4.04 NA
2 378828.1 1682697 12 37 1.00 NA 1.00 1.00
3 381625.6 1684853 0 63 1.96 1.96 1.96 1.96
4 388660.2 1704567 5 18 2.65 NA 2.65 2.65
data
dat <- structure(list(x = c(378828.1, 378828.1, 381625.6, 388660.2),
y = c(1682697.2, 1682697.2, 1684852.5, 1704566.9), upper = c(2L,
12L, 0L, 5L), lower = c(12L, 37L, 63L, 18L), value = c(4.04,
1, 1.96, 2.65)), .Names = c("x", "y", "upper", "lower", "value"
), class = "data.frame", row.names = c(NA, -4L))

Related

Making a table that contains Mean and SD of a Dataset

I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")

How can I pick an element from a matrix depending on a set of conditions?

I have a dataframe containing n rows and m columns. Each row is an individual and each column is information on that individual.
df
id age income
1 18 12
2 24 24
3 36 12
4 18 24
. . .
. . .
. . .
I also have a matrix rXcshowing age buckets in each row and income buckets in each column and each element of the matrix is the % of people for each income-age bucket.
matrix age\income
12 24 36 .....
18 0.15 0.12 0.11 ....
24 0.12 0.6 0.2 ...
36 0.02 0.16 0.16 ...
. ..................
. ..................
For each individual in the dataframe, I need to find the right element of the matrix given the age and income bucket of the individual.
The desired output should look like this
df2
id age income y
1 18 12 0.15
2 24 24 0.6
3 36 12 0.02
4 18 24 0.12
. . .
. . .
. . .
I tried with a series of IFs inside a loop (like in the example):
for (i in 1:length(df$x)) {
workingset <- df[i,]
if(workingset$age==18){
temp<-marix[1,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if(workingset$age==24){
temp<-marix[2,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if{
...
}
if(i==1){
df2 <- workingset
}else{
df2<- rbind(df2, workingset)
}
}
This code works, but it takes too long. Is there a way do this job efficiently?
Assuming your data looks exactly like shown you could use dplyr and tidyr.
First convert your matrix (I name it my_mat) into a data.frame
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric))
returns
# A tibble: 9 x 3
age income y
<dbl> <dbl> <dbl>
1 18 12 0.15
2 18 24 0.12
3 18 36 0.11
4 24 12 0.12
5 24 24 0.6
6 24 36 0.2
7 36 12 0.02
8 36 24 0.16
9 36 36 0.16
This can be left joined with your data.frame df, so in one go:
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric)) %>%
left_join(df, ., by=c("age", "income"))
gives you
# A tibble: 4 x 4
id age income y
<dbl> <dbl> <dbl> <dbl>
1 1 18 12 0.15
2 2 24 24 0.6
3 3 36 12 0.02
4 4 18 24 0.12
Data
my_mat <- structure(c(0.15, 0.12, 0.02, 0.12, 0.6, 0.16, 0.11, 0.2, 0.16
), .Dim = c(3L, 3L), .Dimnames = list(c("18", "24", "36"), c("12",
"24", "36")))
df <- structure(list(id = c(1, 2, 3, 4), age = c(18, 24, 36, 18), income = c(12,
24, 12, 24)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), spec = structure(list(cols = list(
id = structure(list(), class = c("collector_double", "collector"
)), age = structure(list(), class = c("collector_double",
"collector")), income = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

Error: Aesthetics must be either length 1 or the same as the data (2190): x

structure(list(Generation = c(1, 2, 3, 4), Type = structure(c(1L,
1L, 1L, 1L), .Label = c("Mean_1", "Mean_16", "Mean_4", "Mean_64",
"Mean_F"), class = "factor"), Heterozygosity = c(0.983622406756008,
0.984560911429398, 0.984607962819721, 0.983946353837307)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
So I have a data frame with the above structure, which looks like this:
# A tibble: 6 x 3
Generation Type Heterozygosity
<dbl> <fct> <dbl>
1 1 Mean_1 0.984
2 2 Mean_1 0.985
3 3 Mean_1 0.985
4 4 Mean_1 0.984
5 5 Mean_1 0.983
6 6 Mean_1 0.983
and I'm trying to plot a line graph with the following code:
ggplot(LongRed, aes(x = Generations, y = Heterozygosity)) +
geom_line(aes(color = Type))
and it gives me the error:
Error: Aesthetics must be either length 1 or the same as the data (2190): x
What should I do?
It is working with 'Generation' instead of 'Generations'
library(ggplot2)
ggplot(LongRed, aes(x = Generation, y = Heterozygosity)) +
geom_line(aes(color = Type))

Divide by last row in mutate in Tidyverse

So this is a relatively simple problem, I have a dataset as below
df <- structure(list(term = c("(Intercept)", "overall_quality", "overall_costs",
"wwpf"), estimate = c(0.388607224137536, 0.456477162621961, 0.485612564501229,
NA), std.error = c(0.499812263278414, 0.0987819420575201, 0.108042289289401,
NA), statistic = c(0.777506381273137, 4.62105879995918, 4.49465267438447,
NA), p.value = c(0.440597919486169, 0.0000279867005591494, 0.0000426773877613654,
NA), average = c(NA, 8.09615384615385, 7.86538461538461, 7.90384615384615
), Elasticity = c(NA, 3.69570933584318, 3.81952959386543, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I am trying to use below
df %>% mutate(Elasticity= average*estimate/average[nrow(df)])
Expected output: https://ibb.co/42ptLXx
basically, divide by last row value & since I am trying to incorporate this in function, I need the method to be dynamic & not hard coded value.
Please help !
We can use n() to return the index of last row for subsetting the value of that column
library(dplyr)
df %>%
mutate(Elasticity= average*estimate/average[n()])
If we need a function (using rlang_0.4.0), we can make use {{..}} for evaluation
f1 <- function(dat, col1, col2) {
dat %>%
mutate(Elasticity = {{col1}} * {{col2}}/{{col1}}[n()])
}
f1(df, average, estimate)
# A tibble: 4 x 7
# term estimate std.error statistic p.value average Elasticity
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 0.389 0.500 0.778 0.441 NA NA
#2 overall_quality 0.456 0.0988 4.62 0.0000280 8.10 0.468
#3 overall_costs 0.486 0.108 4.49 0.0000427 7.87 0.483
#4 wwpf NA NA NA NA 7.90 NA

Plotting mean and st-dev from a dataset with multiple y values for an x value

My data is organized as such:
Distance r^2
0 1
0 0.9
0 0
0 0.8
0 1
1 0.5
1 0.45
1 0.56
1 1
2 0
2 0.9
3 0
3 0.1
3 0.2
3 0.3
...
300 1
300 0.8
I want to plot r^2 decay with distance, meaning I want to plot a mean value + st-dev for every unique distance value. So I should have 1 point at x=0, 1 point at x=1... but I have multiple x=0 values.
What is the best way to achieve this, given how the data is organized? I would like to do it in R if possible.
Thank you,
Adrian
Edit:
I have tried:
> dd <-structure(list(Distance = dist18, r.2 = a18[,13]), Names = c("Distance", "r^2"), class = "data.frame", row.names = c(NA, -15L))
> ggplot(dd, aes(x=Distance, y=r.2)) + stat_summary(fun.data="mean_sdl")
Error in data.frame(x = c(42L, 209L, 105L, 168L, 63L, 212L, 148L, 175L, : arguments imply differing number of rows: 126877, 15
> head(dist18)
[1] 42 209 105 168 63 212
> head(dd)
Distance r.2
1 42 0.89
2 209 0.92
3 105 0.91
4 168 0.81
5 63 0.88
6 212 0.88
Is this because my data is not sorted?
You can also plot your SD as an area around the mean similar to CI plotting (assuming temp is your data set)
library(data.table)
library(ggplot2)
temp <- setDT(temp)[, list(Mean = mean(r.2), SD = sd(r.2)), by = Distance]
ggplot(temp) + geom_point(aes(Distance, Mean)) + geom_ribbon(aes(x = Distance, y = Mean, ymin = (Mean - SD), ymax = (Mean + SD)), fill = "skyblue", alpha = 0.4)
Using dplyr it will be something like this:
df = data.frame(distance = rep(1:300, each = 10), r2 = runif(3000))
library(dplyr)
df_group = group_by(df, distance)
summarise(df_group, mn = mean(r2), s = sd(r2))Source: local data frame [300 x 3]
distance mn s
1 300 0.4977758 0.3565554
2 299 0.4295891 0.3281598
3 297 0.5346428 0.3424429
4 296 0.4623368 0.3163320
5 291 0.3224376 0.2103655
6 290 0.3916658 0.2115264
7 288 0.6147680 0.2953960
8 287 0.3405524 0.2032616
9 286 0.5690844 0.2458538
10 283 0.2901744 0.2835524
.. ... ... ...
Where df is the data.frame with your data, and distance and r2 the two column names.
this should work.
# Create a data frame like yours
df=data.frame(sample(50,size=300,replace=TRUE),runif(300))
colnames(df)=c('Distance','r^2')
#initialize empty data frame with columns x, mean and stdev
results=data.frame(x=numeric(0),mean=numeric(0),stdev=numeric(0))
count=1
for (i in 0:max(df$Distance)){
results[count,'x']=i
temp_mean=mean(df[which(df$Distance==i),'r^2'])
results[count,'mean']=temp_mean
temp_sd=sd(df[which(df$Distance==i),'r^2'])
results[count,'stdev']=temp_sd
count=count+1
}
# Plot your results
plot(results$x,results$mean,xlab='distance',ylab='r^2')
epsilon=0.02 #to add the little horizontal bar to the error bars
for (i in 1:nrow(results)){
up = results$mean[i] + results$stdev[i]
low = results$mean[i] - results$stdev[i]
segments(results$x[i],low , results$x[i], up)
segments(results$x[i]-epsilon, up , results$x[i]+epsilon, up)
segments(results$x[i]-epsilon, low , results$x[i]+epsilon, low)
}
Here's the result http://imgur.com/ED7PwD8
If you want to plot mean and +/- 1 sd for each point, the ggplot function makes this easy. With the test data
dd<-structure(list(Distance = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
2L, 2L, 3L, 3L, 3L, 3L), r.2 = c(1, 0.9, 0, 0.8, 1, 0.5, 0.45,
0.56, 1, 0, 0.9, 0, 0.1, 0.2, 0.3)), .Names = c("Distance", "r.2"
), class = "data.frame", row.names = c(NA, -15L))
you can just run
library(Hmisc)
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1)
which produces
I tried with your real data and got
real <- read.table("http://pelinfamily.ca/bio/GDR-18_conc.ld", header=F)
dd <- data.frame(Distance=real[,2]-real[,1], r.2=real[,13])
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1, geom="ribbon", alpha=.4) +
stat_summary(fun.data="mean_sdl", mult=1, geom="line")

Resources