Creating new column in one dataframe based on column from another dataframe - r

I have a dataframe as follows:
dput(head(modellingdata, n = 5))
structure(list(X = 1:5, heading = c(2, 0.5, 2, 1.5, 2), StartFrame = c(27L,
28L, 24L, 31L, 35L), StartYaw = c(0.0719580421911421, 0.0595571128205128,
0.0645337707459207, 0.0717132524475524, 0.066818187062937), FirstSteeringTime = c(0.433389999999999,
0.449999999999989, 0.383199999999988, 0.499899999999997, 0.566800000000001
), pNum = c(1L, 1L, 1L, 1L, 1L), EarlyResponses = c(FALSE, FALSE,
FALSE, FALSE, FALSE), PeakFrame = c(33L, 34L, 32L, 38L, 46L),
PeakYaw = c(0.201025641025641, 0.140734297249417, 0.187890472913753,
0.154032698135198, 0.23129368951049), PeakSteeringTime = c(0.533459999999998,
0.550099999999986, 0.516700000000014, 0.616600000000005,
0.750100000000003), heading_radians = c(0.0349065850398866,
0.00872664625997165, 0.0349065850398866, 0.0261799387799149,
0.0349065850398866), error_rate = c(2.86537083478438, 11.459301348013,
2.86537083478438, 3.82015500141104, 2.86537083478438), error_growth = c(0.34899496702501,
0.0872653549837393, 0.34899496702501, 0.261769483078731,
0.34899496702501)), row.names = c(NA, 5L), class = "data.frame")
Each row of my df is a trial. Overall, I have 3037 rows (trials). pNum denotes the participant number - I have 19 participants overall.
I also have a dataframe of intercepts for each participant:
dput(head(heading_intercept, n = 19))
c(0.432448612242496, 0.446371667203615, 0.420854119185846, 0.366763485495426,
0.355619586392715, 0.381658477093055, 0.512552445721875, 0.317210665852951,
0.358345666677048, 0.421441965798511, 0.477135103908373, 0.325512003640487,
0.5542144068862, 0.454182438162137, 0.333993738757344, 0.424179318544432,
0.272486598058728, 0.37014581658542, 0.397112817663261)
What I want do is create a new column "intercept" in my modellingdata dataframe. If pNum is 1, I want to select the first intercept in the heading_intercept dataframe and input that value for every row where pNum is 1. When pNum is 2, I want to input the second intercept value into every row where pNum is 2. And so on...
I have tried this:
for (i in c(1:19)){
if (modellingdata$pNum == i){
modellingdata$intercept <- c(heading_intercept[i])
}
}
However this just inputs the first heading_intercept value for every row and every pNum. Does anybody have any ideas? Any help is appreciated!

modellingdata$intercept <- heading_intercept[modellingdata$pNum]
Or with minimum modification of your current loop:
modellingdata$intercept <- 0L
for (i in c(1:19)){
rows <- modellingdata$pNum == i
if (any(rows)) {
modellingdata$intercept[rows] <- heading_intercept[i]
}
}

Related

Grouped bar chart in R for multiple filter and select

Following is my dataset:
Result
course1
course2
course3
pass
15
17
18
pass
12
14
19
Fail
9
13
3
Fail
3
2
0
pass
14
11
20
Fail
5
0
7
I want to plot a grouped bar graph. I am able to plot following graphs but I want both the results in same graph.
par(mfrow=c(1,1))
options(scipen=999)
coul <- brewer.pal(3, "Set2")
# Bar graph for passed courses
result_pass <-data %>% filter(Result=='Pass') %>% summarize(c1_tot=sum(course1),
c2_tot = sum(course2), c3_tot = sum(course3) )
col_sum <- colSums(result_pass[,1:3])
barplot(colSums(result_pass[,1:3]), xlab = "Courses", ylab = "Total Marks", col = coul, ylim=range(pretty(c(0, col_sum))), main = "Passed courses ")
# Bar graph for Failed courses
result_fail <-data %>% filter(Result=='Fail') %>% summarize(c1_tot=sum(course1),
c2_tot = sum(course2), c3_tot = sum(course3) )
col_sum <- colSums(result_fail[,1:3])
barplot(colSums(result_fail[,1:3]), xlab = "Courses", ylab = "Total Marks", col = coul, ylim=range(pretty(c(0, col_sum))), main = "Failed courses ")
Any suggestion for which I can merge both the above plots and create grouped bar graph for Pass and Fail courses.
It's probably easier than you think. Just put the data directly in aggregate and use as formula . ~ Result, where . means all other columns. Removing first column [-1] and coerce as.matrix (because barplot eats matrices) yields exactly the format we need for barplot.
This is the basic code:
barplot(as.matrix(aggregate(. ~ Result, data, sum)[-1]), beside=TRUE)
And here with some visual enhancements:
barplot(as.matrix(aggregate(. ~ Result, data, sum)[-1]), beside=TRUE, ylim=c(0, 70),
col=hcl.colors(2, palette='viridis'), legend.text=sort(unique(data$Result)),
names.arg=names(data)[-1], main='Here could be your title',
args.legend=list(x='topleft', cex=.9))
box()
Data:
data <- structure(list(Result = c("pass", "pass", "Fail", "Fail", "pass",
"Fail"), course1 = c(15L, 12L, 9L, 3L, 14L, 5L), course2 = c(17L,
14L, 13L, 2L, 11L, 0L), course3 = c(18L, 19L, 3L, 0L, 20L, 7L
)), class = "data.frame", row.names = c(NA, -6L))

Plot classified by categories with column-names (R)

I've got a dataframe that possess the next structure:
D1A1 D1A2 D1A3 D1B1 D1B2 D1B3 D2A1 D2A2 D2A3 D2B1 D2B2 D2B3
10 12 15 40 39 27 11 13 14 33 31 32
The actual dataframe has a greater dimension (40 observations / columns). My interest is to create any kind of possible plot showing all the numerical information together with the data clustered by their column classification (D1A, D1B, D2A, D2B) as follows:
D1A1+D1A2+D1A3 || D1B1+D1B2+D1B3 || D2A1+D2A2+D2A3 || D2B1+D2B2+D2B3
As long as I feel extremely lost, any suggestion would be appreciated.
We can split the dataset by the substring of column names, loop over the list and get the rowSums and use barplot
out <- sapply(split.default(df1, sub("\\d+$", "", names(df1))),
rowSums, na.rm = TRUE)
barplot(out)
If there are more rows and want to plot, use tidyverse, we can reshape into 'long' format with pivot_longer by making use of the pattern in column names i.e. capturing the substring of column names without the digits at the end. This create 4 columns. Then, we use summarise with across to get the sum of each columns and return a bar plot - geom_col
library(dplyr)
library(tidyr)
library(ggplot2)
df2 %>%
pivot_longer(cols = everything(), names_to = ".value",
names_pattern = "(.*)\\d+$") %>%
summarise(across(everything(), sum, na.rm = TRUE)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_col()
-output
If we are interested in the spread of the data, a boxplot can help. Here, we don't summarise, and instead of geom_col use geom_boxplot
df2 %>%
pivot_longer(cols = everything(), names_to = ".value",
names_pattern = "(.*)\\d+$") %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_boxplot()
data
df1 <- structure(list(D1A1 = 10L, D1A2 = 12L, D1A3 = 15L, D1B1 = 40L,
D1B2 = 39L, D1B3 = 27L, D2A1 = 11L, D2A2 = 13L, D2A3 = 14L,
D2B1 = 33L, D2B2 = 31L, D2B3 = 32L), class = "data.frame", row.names = c(NA,
-1L))
df2 <- structure(list(D1A1 = c(10L, 15L), D1A2 = c(12L, 23L), D1A3 = 15:14,
D1B1 = c(40L, 23L), D1B2 = c(39L, 14L), D1B3 = c(27L, 22L
), D2A1 = 11:10, D2A2 = c(13L, 15L), D2A3 = c(14L, 17L),
D2B1 = c(33L, 35L), D2B2 = c(31L, 35L), D2B3 = c(32L, 32L
)), class = "data.frame", row.names = c(NA, -2L))

Conditionally count rows based on month and value of a column

I have a dataset containing Airbnb listings. I want to count the number of listings for each host_id based on if they are Entire home or Shared home per month. Therefor I assume I need two additional columns with the count for each row (tot_EH and tot_SH).
I've posted an image below to show how the dataset looks like and the desired output (deleted some columns that are not relevant). Now I just used one host_id but in reality it's many different ones.
Marked the new columns in red and entered the desired output. Can't figure out how to proceed. Would really appreciate some help!
Got help from a colleague and this worked:
df <- df %>%
group_by(host_id, last_scraped) %>% # group data by host and month
mutate(count_listings_in_data = length(unique(id)), # for each host/month combination; count the number of unique listing IDs
count_shared_homes = length(unique(id[which(room_type_NV == "Shared home")])), # for each host/month combination; count the number of unique listing IDs for which the room type is "shared"
count_entire_homes = length(unique(id[which(room_type_NV == "Entire home")]))) # for each host/month combination; count the number of unique listing IDs for which the room type is "entire"
One data.table approach, assuming your data are in a data.frame named df
library(data.table)
setDT(df)
df[room_type_NV == "Entire Home" , tot_EH := .N, by=.(date, host_id)]
df[room_type_NV == "Shared Home" , tot_SH := .N, by=.(date, host_id)]
Base R Solution:
df$grouping_var <- paste(df$host_id, as.Date(df$date, "%m-%Y"), sep = "_")
count_df <- data.frame(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
tmp <- data.frame(t(tapply(x$room_type_NV, x$room_type_NV, length)))
return(cbind(x, data.frame(tmp[rep(seq_len(nrow(tmp)), nrow(x)), ], row.names = NULL)))
}
)
),
row.names = NULL
)
Data:
structure(list(id = c(2, 1, 3, 1, 2, 3, 1, 2, 1, 2), date = structure(c(16983,
16983, 16983, 17014, 17014, 17014, 17045, 17045, 17106, 17106
), class = "Date"), host_id = c(27280608, 27280608, 27280608,
27280608, 27280608, 27280608, 27280608, 27280608, 27280608, 27280608
), room_type_NV = structure(c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 2L), .Label = c("Entire home", "Shared home"), class = "factor"),
grouping_var = c("27280608_2016-07-01", "27280608_2016-07-01",
"27280608_2016-07-01", "27280608_2016-08-01", "27280608_2016-08-01",
"27280608_2016-08-01", "27280608_2016-09-01", "27280608_2016-09-01",
"27280608_2016-11-01", "27280608_2016-11-01")), row.names = c(NA,
-10L), class = "data.frame")

r: Adding columns and newly calculated values in data frame

I have already searched the Forum for Hours (really) and start to get the faint Feeling that I am slowly going crazy, especially as it appears to me to be a really easily solvable Problem.
What do I want to do?
Basically, I want to simulate clinical data. Specifically, for each Patient (column 1:ID) an arbitrary score (column 3: score), dependant on the assigned Treatment Group (column 2: group).
set.seed(123)
# Number of subjects in study
n_patients = 1000
# Score: Mean and SDs
mean_verum = 70
sd_verum = 20
mean_placebo = 40
sd_placebo = 20
# Allocating to Treatment groups:
data = data.frame(id = as.character(1:n_patients))
data$group[1:(n_patients/2)] <- "placebo"
data$group[(n_patients/2+1):n_patients] <- "verum"
# Attach Score for each treatment group
data$score <- ifelse(data$group == "verum", rnorm(n=100, mean=mean_verum, sd=sd_verum), rnorm(n=100, mean=mean_placebo, sd=sd_placebo))
So far so easy. Now, I wish to 1) calculate a probability of an Event happening (logit function) depending on the score. Then, 2) I want to actually assign an Event, depending on the probability (rbinom).
I want to do this for n different probablities/Events. This is the Code I've used so far:
Calculate probabilities:
a = -1
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE1 <- p1
a = -0.5
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE2 <- p1
…
Assign Events:
data$Abbruch_AE1 <- rbinom(n_patients, 1, data$p_E1)
data$Abbruch_AE2 <- rbinom(n_patients, 1, data$p_E2)
…
Obviously, this is really inefficient, as it would like to easily scale this up or down, depending on how many probabilities/Events I want to simulate.
The Problem is, I simply do not get it, how I can simultaneously a) generate new, single column in the dataframe, where I want to put in the values for each, b) perform the function to assign the probabilities/Events, and c) do this for a number n of different formulas, which have their specific a and b.
I am sure the solution to this Problem is a simple one - what I didn't manage was to do all These Things at once, which is were I would like this to be eventually. I ahve played around with for loops, all to no avail.
Any help would be greatly appreciated!
This how my dataframe Looks like:
structure(list(id = structure(1:3, .Label = c("1", "2", "3"), class = "factor"),
group = c("placebo", "placebo", "placebo"), score = c(25.791868726014,
45.1376741831306, 35.0661624307525), p_AE1 = c(0.677450814266315,
0.633816117436442, 0.656861351663365), p_AE2 = c(0.560226492151216,
0.512153420188678, 0.537265362130761), p_AE3 = c(0.435875409622676,
0.389033483248856, 0.413221988111604), p_AE4 = c(0.319098312196655,
0.278608032377073, 0.299294085148527), p_AE5 = c(0.221332386680766,
0.189789774534235, 0.205762225373345), p_AE6 = c(0.147051201194953,
0.124403316086538, 0.135795233451071), p_AE7 = c(0.0946686004658072,
0.0793379289917946, 0.0870131973838217), p_AE8 = c(0.0596409872667201,
0.0496714832182721, 0.0546471270895262), AbbruchAE1 = c(1L,
1L, 1L), AbbruchAE2 = c(1L, 1L, 0L), AbbruchAE3 = c(0L, 0L,
0L), AbbruchAE4 = c(0L, 1L, 0L), AbbruchAE5 = c(1L, 0L, 0L
), AbbruchAE6 = c(1L, 0L, 0L), AbbruchAE7 = c(0L, 0L, 0L),
AbbruchAE8 = c(0L, 0L, 0L)), .Names = c("id", "group", "score", "p_AE1", "p_AE2", "p_AE3", "p_AE4", "p_AE5", "p_AE6", "p_AE7", "p_AE8", "AbbruchAE1", "AbbruchAE2", "AbbruchAE3", "AbbruchAE4", "AbbruchAE5", "AbbruchAE6", "AbbruchAE7", "AbbruchAE8"), row.names = c(NA, 3L), class = "data.frame")

How to use ifelse and paste functions

I am learning the use of the ifelse function from Zuur et al (2009) A Beginners guide to R. In one exercise, there is a data frame called Owls which contains data about about 27 nests and two night of observations.
structure(list(Nest = structure(c(1L, 1L, 1L, 1L), .Label = "AutavauxTV", class = "factor"),
FoodTreatment = structure(c(1L, 2L, 1L, 1L), .Label = c("Deprived",
"Satiated"), class = "factor"), SexParent = structure(c(1L,
1L, 1L, 1L), .Label = "Male", class = "factor"), ArrivalTime = c(22.25,
22.38, 22.53, 22.56), SiblingNegotiation = c(4L, 0L, 2L,
2L), BroodSize = c(5L, 5L, 5L, 5L), NegPerChick = c(0.8,
0, 0.4, 0.4)), .Names = c("Nest", "FoodTreatment", "SexParent",
"ArrivalTime", "SiblingNegotiation", "BroodSize", "NegPerChick"
), row.names = c(NA, 4L), class = "data.frame")
The two nights differed as to the feeding regime (satiated or deprived) and are indicated in the Foodregime variable. The task is to use ifelse and past functions that make a new categorical variable that defines observations from a single night at a particular nest.
In the solutions the following code is suggested:
Owls <- read.table(file = "Owls.txt", header = TRUE, dec = ".")
ifelse(Owls$FoodTreatment == "Satiated", Owls$NestNight <- paste(Owls$Nest, "1",sep = "_"), Owls$NestNight <- paste(Owls$Nest, "2",sep = "_"))
and apparently it creates a new variable with values the endings of which vary ("-1" or "-2")
however when I call the original dataframe, all "-1" endings in the NestNight variable disappears and are turned to "-2."
Why does this happen? Did the authors miss something from the code or it's me who is not getting it?
Many thanks
EDIT: Sorry, I wanted to give a reproducible example by copying my data using dput but it did not work. If you can let me know how I can correct it so that it appears properly, I'd be grateful too!
Solution
If you do the assignment outside the ifelse structure, it works:
Owls$NestNight <- ifelse(Owls$FoodTreatment == "Satiated",
paste(Owls$Nest, "1",sep = ""),
paste(Owls$Nest, "2",sep = ""))
Explanation
What happens in your case is simply if you would execute the following two lines:
Owls$NestNight <- paste(Owls$Nest, "1",sep = "")
Owls$NestNight <- paste(Owls$Nest, "2",sep = "")
You first assign paste(Owls$Nest, "1",sep = "") to Owls$NestNight and then you reassign paste(Owls$Nest, "2",sep = "") to it. The ifelse is not affected by this, but you don't assign it's result to any variable.
Maybe it is more clear if you test this simple code:
c(a <- 1:5, a <- 6:10) #c is your ifelse, a is your Owls$NestNight
a #[1] 6 7 8 9 10

Resources