Boxplots from frequency columns in ggplot2 - r

I have a dataframe such as the example below, which describes the number of students achieving specific scores (25-100) in each class (a,b,c)
df
# score class_a class_b class_c
# 1 25 0 10 5
# 2 50 5 3 7
# 3 75 2 2 2
# 4 100 0 6 4
I would like to create a box blot with class on the x axis, and the scores as the y axis, in order to show the range of scores for each class.
But, I am really not sure how to do this with summarized data such as this. I have tried:
library(reshape2)
df1 <- melt(df, id.vars='score')
But I am not sure this is the right direction.
Data
df <- data.frame(score=c(25, 50, 75, 100), class_a=c(0, 5, 2, 0),
class_b=c(10, 3, 2, 6), class_c=c(5, 7, 2, 4))

You may repeat the scores according to the frequencies in each class and boxplot the list.
Map(rep.int, df[1], df[-1]) |> boxplot()

Related

Select only columns with column names that match values from rows in other df

I have two dataframes, one that is a very large, wide dataset with hundreds of parameters and another with 3 columns that identify the parameters in the larger dataframe with specification limits and two columns for the lower and upper limits. What I want to do is to be able to reduce the wide dataframe to just the columns that are in the limits dataframe. I feel like this is incredibly basic but I cannot get it to work
See below for an example and output that I would like.
df
df <- data.frame("par.1" = c(1, 1, 2, 3, 5), "par.2" = c(10, 11, 12, 11, 15),"par.3" = c(8, 8, 12, 8, 9),"par.4" = c(8, 8, 12, 8, 9))
limits
limits <- data.frame("parameter" = c("par.2", "par.4"), "lsl" = c(8,5), "usl" = c(16,15))
Here is the output I am looking for
df.reduced
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Just subset df column names by values %in% the parameter column of limits
df[names(df) %in% limits$parameter]
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Alternatively, use match:
df[match(limits$parameter, names(df))]
An option with intersect
df[intersect(names(df), as.character(limits$parameter))]

How to input nominal values into one column based on values in another column through If_else statements

I am trying to input nominal variables based on a column dedicated to age. Basically, if someone is between the ages of 1 to 5, indicated in the age column, then I want the age group column to have the value of 1, since they are in age group 1. I'm trying to do this in multiple columns since ages increase by one each year. I've tried doing this through a for loop that uses an if else function, but it does not work.
`my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
my_data_1<-cbind("AgeGroup2000"=NA, "AgeGroup2001"=NA, "AgeGroup2002"=NA, my_data_1)
my_data_1
#I'm basically trying to make the below code into a for loop
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 1:5]<-1
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 6:10]<-2
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 11:15]<-3
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 1:5]<-1
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 6:10]<-2
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 11:15]<-3
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 1:5]<-1
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 6:10]<-2
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 11:15]<-3`
Maybe it is better to use findInterval or cut here. We can use lapply to apply it for multiple columns
my_data_1[paste0("AgeGroup_", 2000:2002)] <- lapply(my_data_1, findInterval, c(1, 6, 11))
# Age2000 Age2001 Age2002 AgeGroup_2000 AgeGroup_2001 AgeGroup_2002
#Participant1 1 2 3 1 1 1
#Participant2 3 4 5 1 1 1
#Participant3 5 6 7 1 2 2
#Participant4 7 8 9 2 2 2
#Participant5 9 10 11 2 3 3
#Participant6 11 12 13 3 3 3
Or mutate_all from dplyr
library(dplyr)
my_data_1 %>% mutate_all(list(Group = ~findInterval(., c(1, 6, 11))))
data
my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)

How to pass a multivariate vector valued function (with variable length output) to aggregate

I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2

Rolling standard deviation for multiple firm, with different time periods

I have a dataset with monthly stock return for approximately 100 firms. They have different time periods, and the reason for this is when they went on and off the stock exchange.
I have ordered my dataset by Company, Year, Month and I want the standard deviation to account for this so that it starts for a firm after 24 months, and ends when the last observation for that firm is due.
This means that the command has to be able to tell the difference between firms, so that the window doesn't transfer over to the next firm.
Year, Month, Company, Return
1990, 1, Company 1, -0,005
1990, 2, Company 1 , 0,003
etc...
1990, 1, Company 2, ...
1990, 2, Company 2, ...
etc...
2017, 6, Company 50, ...
I have been trying with this code, but it just keeps going when the next row contains a new firm, i.e. it just does a rolling standard deviation for the whole dataset.
rolling_sd <- (rollapply(Dataset$RETURN, width=24,
FUN = sd, fill=NA, align = "right"))
Also it does not align with the right date. If I have no align command, the first row of standard deviation should be 24 rows down, with the "right" it moves 12 down, but still not properly aligned.
How can I make it to take Company name into account?
If you omit the align="right" argument the sd values would be centered as discussed in the question but since the code shown does use right alignment the sd values would start in row 24. I suspect you are confusing runs made with and without the align= argument.
Using the data shown in the Note at the end and changing 24 to 3 in order to demonstrate it with this smaller dataset we use ave to apply the rolling sd to each company separately. The r at the end of rollapplyr is a shorter way of specifying align="right". With right alignment the sd shown in the ith row is the sd of the width rows ending in row i, i.e. rows i-width+1 to i inclusive.
library(zoo)
roll <- function(x) rollapplyr(x, width = 3, FUN = sd, fill = NA)
transform(Dataset, sd = ave(RETURN, Company, FUN = roll))
giving:
Year Month Company RETURN sd
1 1 1 A -0.042484496 NA
2 1 2 A 0.057661027 NA
3 1 3 A -0.018204616 0.05224021
4 1 4 A 0.076603481 0.05017135
5 2 1 A 0.088093457 0.05833792
6 2 2 A -0.090888700 0.10018338
7 2 3 A 0.005621098 0.08958278
8 2 4 A 0.078483809 0.08496093
9 1 1 B -0.042484496 NA
10 1 2 B 0.057661027 NA
11 1 3 B -0.018204616 0.05224021
12 1 4 B 0.076603481 0.05017135
13 2 1 B 0.088093457 0.05833792
14 2 2 B -0.090888700 0.10018338
15 2 3 B 0.005621098 0.08958278
16 2 4 B 0.078483809 0.08496093
Note
Some data in reproducible form
set.seed(123)
tmp <- data.frame(Year = c(1, 1, 1, 1, 2, 2, 2, 2), Month = 1:4, Company = "A",
RETURN = runif(8, -.1, .1))
Dataset <- rbind(tmp, transform(tmp, Company = "B"))

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

Resources