r: Adding columns and newly calculated values in data frame - r

I have already searched the Forum for Hours (really) and start to get the faint Feeling that I am slowly going crazy, especially as it appears to me to be a really easily solvable Problem.
What do I want to do?
Basically, I want to simulate clinical data. Specifically, for each Patient (column 1:ID) an arbitrary score (column 3: score), dependant on the assigned Treatment Group (column 2: group).
set.seed(123)
# Number of subjects in study
n_patients = 1000
# Score: Mean and SDs
mean_verum = 70
sd_verum = 20
mean_placebo = 40
sd_placebo = 20
# Allocating to Treatment groups:
data = data.frame(id = as.character(1:n_patients))
data$group[1:(n_patients/2)] <- "placebo"
data$group[(n_patients/2+1):n_patients] <- "verum"
# Attach Score for each treatment group
data$score <- ifelse(data$group == "verum", rnorm(n=100, mean=mean_verum, sd=sd_verum), rnorm(n=100, mean=mean_placebo, sd=sd_placebo))
So far so easy. Now, I wish to 1) calculate a probability of an Event happening (logit function) depending on the score. Then, 2) I want to actually assign an Event, depending on the probability (rbinom).
I want to do this for n different probablities/Events. This is the Code I've used so far:
Calculate probabilities:
a = -1
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE1 <- p1
a = -0.5
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE2 <- p1
…
Assign Events:
data$Abbruch_AE1 <- rbinom(n_patients, 1, data$p_E1)
data$Abbruch_AE2 <- rbinom(n_patients, 1, data$p_E2)
…
Obviously, this is really inefficient, as it would like to easily scale this up or down, depending on how many probabilities/Events I want to simulate.
The Problem is, I simply do not get it, how I can simultaneously a) generate new, single column in the dataframe, where I want to put in the values for each, b) perform the function to assign the probabilities/Events, and c) do this for a number n of different formulas, which have their specific a and b.
I am sure the solution to this Problem is a simple one - what I didn't manage was to do all These Things at once, which is were I would like this to be eventually. I ahve played around with for loops, all to no avail.
Any help would be greatly appreciated!
This how my dataframe Looks like:
structure(list(id = structure(1:3, .Label = c("1", "2", "3"), class = "factor"),
group = c("placebo", "placebo", "placebo"), score = c(25.791868726014,
45.1376741831306, 35.0661624307525), p_AE1 = c(0.677450814266315,
0.633816117436442, 0.656861351663365), p_AE2 = c(0.560226492151216,
0.512153420188678, 0.537265362130761), p_AE3 = c(0.435875409622676,
0.389033483248856, 0.413221988111604), p_AE4 = c(0.319098312196655,
0.278608032377073, 0.299294085148527), p_AE5 = c(0.221332386680766,
0.189789774534235, 0.205762225373345), p_AE6 = c(0.147051201194953,
0.124403316086538, 0.135795233451071), p_AE7 = c(0.0946686004658072,
0.0793379289917946, 0.0870131973838217), p_AE8 = c(0.0596409872667201,
0.0496714832182721, 0.0546471270895262), AbbruchAE1 = c(1L,
1L, 1L), AbbruchAE2 = c(1L, 1L, 0L), AbbruchAE3 = c(0L, 0L,
0L), AbbruchAE4 = c(0L, 1L, 0L), AbbruchAE5 = c(1L, 0L, 0L
), AbbruchAE6 = c(1L, 0L, 0L), AbbruchAE7 = c(0L, 0L, 0L),
AbbruchAE8 = c(0L, 0L, 0L)), .Names = c("id", "group", "score", "p_AE1", "p_AE2", "p_AE3", "p_AE4", "p_AE5", "p_AE6", "p_AE7", "p_AE8", "AbbruchAE1", "AbbruchAE2", "AbbruchAE3", "AbbruchAE4", "AbbruchAE5", "AbbruchAE6", "AbbruchAE7", "AbbruchAE8"), row.names = c(NA, 3L), class = "data.frame")

Related

Regression over time specific year as weight

I am doing a regression with panel data of EU countries over time with observations from 2007-16. I want to use the observation for 2007 for each specific country as the weight. Is there a simple way to do this?
The is essentially the regression I run, but I don't think the weighting is working as I intend it to.
lm(log(POP25) ~ log(EMPLOY25), weights = POP25, data = data)
structure(list(...1 = 1:6, TIME = 2007:2012, NUTS_ID = c("AT",
"AT", "AT", "AT", "AT", "AT"), NUMBER = c(1L, 1L, 1L, 1L, 1L,
1L), POP15 = c(5529.1, 5549.3, 5558.5, 5572.1, 5601.1, 5620.8
), POP20 = c(5047.1, 5063.2, 5072.6, 5090, 5127.1, 5151.9), POP25 = c(4544,
4560.7, 4571.3, 4587.8, 4621.5, 4639), EMPLOY15 = c(3863.6, 3928.7,
3909.3, 3943.9, 3982.3, 4013.4), EMPLOY20 = c(3676.2, 3737, 3723.8,
3761.9, 3802.3, 3835), EMPLOY25 = c(3333.5, 3390.4, 3384.7, 3424.6,
3454.4, 3486.4)), row.names = c(NA, 6L), class = "data.frame")
You are right - this is not doing what you expect it to. The reason is that you are supplying POP25 as the weight but you haven't yet made it explicit that you only want the POP25 value from 2007.
A weights vector needs to be the same length as the dependent and independent variables. The easiest way to do this is by creating a weights column in the table, where the value is the POP25 value for each NUTS_ID in the year 2007:
library(dplyr)
data <- data |>
group_by(NUTS_ID) |>
mutate(weights = POP25[TIME==2007])
You can then supply this as the weights vector:
lm(log(POP25) ~ log(EMPLOY25), weights = weights, data = data)

Creating a frequency histogram using ggplot2

Hi I am relatively new to R. I am struggling with what seems like it should be a relatively simple task- I am trying to make a frequency histogram using ggplot2 from a subset of data from a longer dataframe.
Here is an example of the data structure us in the picture attached
https://i.stack.imgur.com/HIwQv.png
The data is from a survey where 0 means not selected and 1 means it was selected. There are numeric in the original dataset I want a histogram of the frequency in which each variable was selected. The column variables on the x-axis and frequency counts on the y-axis. I have various subsets like this within a dataframe and I would like each to subset to how their own graph.
I first subset the columns of interest
new dataset <-subset(df, select = c(WAB_R, WAB_B, BDAE, PNT))
When I checked the class it was dataframe and no longer numeric
I tried to use as.numeric to convert it back to a numeric, but with no luck
I could use some guidance in how to structure the data to then obtain a histogram.
Thanks Carla
Maybe try this approach using tidyverse functions. You have to reshape to long selecting the desired variables. Here the code using ggplot2 for the final plot:
library(tidyverse)
#Code 1
df %>% select(c(WAB_R, WAB_B, BDAE, PNT)) %>%
pivot_longer(everything()) %>%
ggplot(aes(x=value))+
geom_histogram(stat = 'count',aes(fill=name),
position = position_dodge2(0.9,preserve = 'single'))+
labs(fill='Variable')
Output:
Or this:
#Code 2
df %>% select(c(WAB_R, WAB_B, BDAE, PNT)) %>%
pivot_longer(everything()) %>%
ggplot(aes(x=factor(value)))+
geom_histogram(stat = 'count',aes(fill=name),
position = position_dodge2(0.9,preserve = 'single'))+
labs(fill='Variable')+xlab('value')
Output:
Some data used:
#Data
df <- structure(list(ID = 1:4, WAB_R = c(0L, 1L, 0L, 1L), WAB_B = c(0L,
1L, 0L, 0L), BDAE = c(0L, 0L, 0L, 1L), PNT = c(0L, 0L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -4L))

How to use ifelse and paste functions

I am learning the use of the ifelse function from Zuur et al (2009) A Beginners guide to R. In one exercise, there is a data frame called Owls which contains data about about 27 nests and two night of observations.
structure(list(Nest = structure(c(1L, 1L, 1L, 1L), .Label = "AutavauxTV", class = "factor"),
FoodTreatment = structure(c(1L, 2L, 1L, 1L), .Label = c("Deprived",
"Satiated"), class = "factor"), SexParent = structure(c(1L,
1L, 1L, 1L), .Label = "Male", class = "factor"), ArrivalTime = c(22.25,
22.38, 22.53, 22.56), SiblingNegotiation = c(4L, 0L, 2L,
2L), BroodSize = c(5L, 5L, 5L, 5L), NegPerChick = c(0.8,
0, 0.4, 0.4)), .Names = c("Nest", "FoodTreatment", "SexParent",
"ArrivalTime", "SiblingNegotiation", "BroodSize", "NegPerChick"
), row.names = c(NA, 4L), class = "data.frame")
The two nights differed as to the feeding regime (satiated or deprived) and are indicated in the Foodregime variable. The task is to use ifelse and past functions that make a new categorical variable that defines observations from a single night at a particular nest.
In the solutions the following code is suggested:
Owls <- read.table(file = "Owls.txt", header = TRUE, dec = ".")
ifelse(Owls$FoodTreatment == "Satiated", Owls$NestNight <- paste(Owls$Nest, "1",sep = "_"), Owls$NestNight <- paste(Owls$Nest, "2",sep = "_"))
and apparently it creates a new variable with values the endings of which vary ("-1" or "-2")
however when I call the original dataframe, all "-1" endings in the NestNight variable disappears and are turned to "-2."
Why does this happen? Did the authors miss something from the code or it's me who is not getting it?
Many thanks
EDIT: Sorry, I wanted to give a reproducible example by copying my data using dput but it did not work. If you can let me know how I can correct it so that it appears properly, I'd be grateful too!
Solution
If you do the assignment outside the ifelse structure, it works:
Owls$NestNight <- ifelse(Owls$FoodTreatment == "Satiated",
paste(Owls$Nest, "1",sep = ""),
paste(Owls$Nest, "2",sep = ""))
Explanation
What happens in your case is simply if you would execute the following two lines:
Owls$NestNight <- paste(Owls$Nest, "1",sep = "")
Owls$NestNight <- paste(Owls$Nest, "2",sep = "")
You first assign paste(Owls$Nest, "1",sep = "") to Owls$NestNight and then you reassign paste(Owls$Nest, "2",sep = "") to it. The ifelse is not affected by this, but you don't assign it's result to any variable.
Maybe it is more clear if you test this simple code:
c(a <- 1:5, a <- 6:10) #c is your ifelse, a is your Owls$NestNight
a #[1] 6 7 8 9 10

Inverse probability weights in r

I'm trying to apply inverse probability weights to a regression, but lm() only uses analytic weights. This is part of a replication I'm working on where the original author is using pweight in Stata, but I'm trying to replicate it in R. The analytic weights are providing lower standard errors which is causing problems with some of my variable being significance.
I've tried looking at the survey package, but am not sure how to prepare a survey object for use with svyglm(). Is this the approach I want, or is there an easier way to apply inverse probability weights?
dput :
data <- structure(list(lexptot = c(9.1595012302023, 9.86330744180814,
8.92372556833205, 8.58202430280175, 10.1133857229336), progvillm = c(1L,
1L, 1L, 1L, 0L), sexhead = c(1L, 1L, 0L, 1L, 1L), agehead = c(79L,
43L, 52L, 48L, 35L), weight = c(1.04273509979248, 1.01139605045319,
1.01139605045319, 1.01139605045319, 0.76305216550827)), .Names = c("lexptot",
"progvillm", "sexhead", "agehead", "weight"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Linear Model (using analytic weights)
prog.lm <- lm(lexptot ~ progvillm + sexhead + agehead, data = data, weight = weight)
summary(prog.lm)
Alright, so I figured it out and thought I would update the post incase others were trying to figure it out. It's actually pretty straightforward.
data$X <- 1:nrow(data)
des1 <- svydesign(id = ~X, weights = ~weight, data = data)
prog.lm <- svyglm(lexptot ~ progvillm + sexhead + agehead, design=des1)
summary(prog.lm)
Standard errors are now correct.

Color data points based on sample classification

A pairwise scatterplot showing relationship between genes (columns in data frame) across multiple samples (rows in data frame) is created. The samples belong to two distinct groups: group "A" and "B". Since one dot in plot represent one sample, I need to color the data points (dots) according to groups with two different colors, say group A with "green" and group B with "red". Is it possible to do that?
Any kind of help will be appreciated.
plot(DF[1:6], pch = 21) #command used for plotting, DF is data frame
Sample Data Frame Example:
CBX3 PSPH ATP2C1 SNX10 MMD ATP13A3
B 10.589844 6.842970 8.084550 8.475023 9.202490 10.403811
A 10.174385 5.517944 7.736994 9.094834 9.253766 10.133408
B 10.202084 5.669137 7.392141 7.522270 7.830969 9.123178
B 10.893231 6.630709 7.601690 7.894177 8.979142 9.791841
B 10.071038 5.091222 7.032585 8.305581 7.903737 8.994821
A 10.005002 4.708631 7.927246 7.292527 8.257853 10.054630
B 10.028055 5.080944 6.421961 7.616856 8.287496 9.642294
A 10.144115 6.626483 7.686203 7.970934 7.919615 9.475175
A 10.675386 6.874047 7.900560 7.605519 8.585158 8.858613
A 9.855063 5.164399 6.847923 8.072608 8.221344 9.077744
A 10.994228 6.545318 8.606128 8.426329 8.787876 9.857079
A 10.501266 6.677360 7.787168 8.444976 8.928174 9.542558
GGally has a good function for this as well.
library(GGally)
ggpairs(dd, color = 'CLASS',columns = 2:ncol(dd) )
It might not be that easy to do with base graphics. You could easily do this with lattice. With this sample data.frame
dd<-structure(list(CLASS = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
CBX3 = c(10.589844, 10.174385, 10.202084, 10.893231, 10.071038,
10.005002, 10.028055, 10.144115, 10.675386, 9.855063, 10.994228,
10.501266), PSPH = c(6.84297, 5.517944, 5.669137, 6.630709,
5.091222, 4.708631, 5.080944, 6.626483, 6.874047, 5.164399,
6.545318, 6.67736), ATP2C1 = c(8.08455, 7.736994, 7.392141,
7.60169, 7.032585, 7.927246, 6.421961, 7.686203, 7.90056,
6.847923, 8.606128, 7.787168), SNX10 = c(8.475023, 9.094834,
7.52227, 7.894177, 8.305581, 7.292527, 7.616856, 7.970934,
7.605519, 8.072608, 8.426329, 8.444976), MMD = c(9.20249,
9.253766, 7.830969, 8.979142, 7.903737, 8.257853, 8.287496,
7.919615, 8.585158, 8.221344, 8.787876, 8.928174), ATP13A3 = c(10.403811,
10.133408, 9.123178, 9.791841, 8.994821, 10.05463, 9.642294,
9.475175, 8.858613, 9.077744, 9.857079, 9.542558)), .Names = c("CLASS",
"CBX3", "PSPH", "ATP2C1", "SNX10", "MMD", "ATP13A3"), class = "data.frame", row.names = c(NA, -12L))
you can do
library(lattice)
splom(~dd[,-1], groups=dd$CLASS)
to get
You can add color to the points by specifying the argument col
to plot
DF <- read.delim(textConnection(
"category CBX3 PSPH ATP2C1 SNX10 MMD ATP13A3
B 10.589844 6.842970 8.084550 8.475023 9.202490 10.403811
A 10.174385 5.517944 7.736994 9.094834 9.253766 10.133408
B 10.202084 5.669137 7.392141 7.522270 7.830969 9.123178
B 10.893231 6.630709 7.601690 7.894177 8.979142 9.791841
B 10.071038 5.091222 7.032585 8.305581 7.903737 8.994821
A 10.005002 4.708631 7.927246 7.292527 8.257853 10.054630
B 10.028055 5.080944 6.421961 7.616856 8.287496 9.642294
A 10.144115 6.626483 7.686203 7.970934 7.919615 9.475175
A 10.675386 6.874047 7.900560 7.605519 8.585158 8.858613
A 9.855063 5.164399 6.847923 8.072608 8.221344 9.077744
A 10.994228 6.545318 8.606128 8.426329 8.787876 9.857079
A 10.501266 6.677360 7.787168 8.444976 8.928174 9.542558"))
plot(DF[2:7],col = ifelse(DF$category == 'A','red','green'))
A list of valid color values can be obtained by calling colors(). Vectors with a gradient of colors can be created via rainbow(), and just for fun, I use this little function for choosing pretty colors when making a figure.
(Edited per suggestions from #MrFlick)
#! #param n The number of colors to be selected
colorchoose <- function (n = 1, alpha, term = F)
{
cols <- colors()
mod <- ceiling(sqrt(length(cols)))
plot(xlab = "", ylab = "", main = "click for color name",
c(0, mod), c(0, mod), type = "n", axes = F)
s<-seq_along(cols)
dev.hold()
points(s%%mod, s%/%mod, col = cols, pch = 15, cex = 2.4)
dev.flush()
p <- locator(n)
return(cols[round(p$y) * mod + round(p$x)])
}

Resources