I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R
Related
I have two different variables in my data. first one: vein size (there are NAs).
Second variable is: procedure site (c=(1,2,3,4))
I want to impute different value to vein size based on different procedure site. I tried if else, but it wasn't successful. e.g.: if procedure site is 1 or 2, impute 3; if procedure site is 3, impute 4; if procedure site is 4, impute 5.I am new to this field. Any help is much appreciated!
vein.size<-(3,3,3,NA,NA,NA)
procedure.site<-(1,2,2,3,4,4)
df<-cbind(vein.size,procedure.site)
My expected output is:
vein.size<-(3,3,3,4,5,5)
thank you
You can use chain of ifelse statement or try case_when from dplyr :
library(dplyr)
df <- df %>%
mutate(output = case_when(is.na(vein.size) & procedure.site %in% 1:2 ~ 3,
is.na(vein.size) & procedure.site == 3 ~ 4,
is.na(vein.size) & procedure.site == 4 ~ 5,
TRUE ~ vein.size))
# vein.size procedure.site output
#1 3 1 3
#2 3 2 3
#3 3 2 3
#4 NA 3 4
#5 NA 4 5
#6 NA 4 5
data
vein.size<-c(3,3,3,NA,NA,NA)
procedure.site<-c(1,2,2,3,4,4)
df<-data.frame(vein.size,procedure.site)
You could use a lookup table and then merge:
# your data
vein.size <- c(3,3,3,NA,NA,NA)
procedure.site <- c(1,2,2,3,4,4)
your_df <- data.frame(vein.size = vein.size,
procedure.site = procedure.site)
# the lookup table
lookup_df <- data.frame(
procedure.site = c(1, 2, 3, 4),
imputation = c(3, 3, 4, 5)
)
# result
merge(your_df, lookup_df, by='procedure.site')
Which gives:
procedure.site vein.size imputation
1 1 3 3
2 2 3 3
3 2 3 3
4 3 NA 4
5 4 NA 5
6 4 NA 5
I am trying to input nominal variables based on a column dedicated to age. Basically, if someone is between the ages of 1 to 5, indicated in the age column, then I want the age group column to have the value of 1, since they are in age group 1. I'm trying to do this in multiple columns since ages increase by one each year. I've tried doing this through a for loop that uses an if else function, but it does not work.
`my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
my_data_1<-cbind("AgeGroup2000"=NA, "AgeGroup2001"=NA, "AgeGroup2002"=NA, my_data_1)
my_data_1
#I'm basically trying to make the below code into a for loop
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 1:5]<-1
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 6:10]<-2
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 11:15]<-3
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 1:5]<-1
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 6:10]<-2
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 11:15]<-3
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 1:5]<-1
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 6:10]<-2
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 11:15]<-3`
Maybe it is better to use findInterval or cut here. We can use lapply to apply it for multiple columns
my_data_1[paste0("AgeGroup_", 2000:2002)] <- lapply(my_data_1, findInterval, c(1, 6, 11))
# Age2000 Age2001 Age2002 AgeGroup_2000 AgeGroup_2001 AgeGroup_2002
#Participant1 1 2 3 1 1 1
#Participant2 3 4 5 1 1 1
#Participant3 5 6 7 1 2 2
#Participant4 7 8 9 2 2 2
#Participant5 9 10 11 2 3 3
#Participant6 11 12 13 3 3 3
Or mutate_all from dplyr
library(dplyr)
my_data_1 %>% mutate_all(list(Group = ~findInterval(., c(1, 6, 11))))
data
my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2
This is a followup question to this question, initially inspired by this question, but not quite the same.
This is my situation. First I pull some data from a database,
df <- data.frame(id = c(1:6),
profession = c(1, 5, 4, NA, 0, 5))
df
# id profession
# 1 1
# 2 5
# 3 4
# 4 NA
# 5 0
# 6 5
Second, I pull a key-table with human readable information about the profession codes,
profession.codes <- data.frame(profession.code = c(1,2,3,4,5),
profession.label = c('Optometrists',
'Accountants', 'Veterinarians',
'Financial analysts', 'Nurses'))
profession.codes
# profession.code profession.label
# 1 Optometrists
# 2 Accountants
# 3 Veterinarians
# 4 Financial analysts
# 5 Nurses
Now, I would like to overwrite the profession variable in my df with the labels from profession.codes, preferably using join from the plyr package, but I'm open to any smart solution. Though I do like that ply preserves the order of x.
I currently do it like this,
# install.packages('plyr', dependencies = TRUE)
library(plyr)
profession.codes$profession <- profession.codes$profession.code
df <- join(df, profession.codes, by="profession")
# levels(df$profession.label)
df$profession.label <- factor(df$profession.label,
levels = c(levels(df$profession.label),
setdiff(df$profession, df$profession.code)))
# levels(df$profession.label)
df$profession.label[df$profession==0 ] <- 0
df$profession.code <- NULL
df$profession <- NULL
names(df) <- c("id", "profession")
df
# id profession
# 1 Optometrists
# 2 Nurses
# 3 Financial analysts
# 4 <NA>
# 5 0
# 6 Nurses
This is how I overwrite profession without losing the NA and the 0.
The problem is that the 0 could be a 17 or any number and I would like to account for that in some way. Furthermore, I would also like to shorten my code, if possible.
Any help would be greatly appreciated.
Thanks,
Eric
This is one approach in base:
df <- data.frame(id = c(1:6),
profession = c(1, 5, 4, NA, 0, 5))
pc <- data.frame(profession.code = c(1,2,3,4,5),
profession.label = c('Optometrists',
'Accountants', 'Veterinarians',
'Financial analysts', 'Nurses'))
df$new <- as.character(pc[match(df$profession,
pc$profession.code), 'profession.label'])
df[is.na(df$new), 'new'] <- df[is.na(df$new), 'profession']
df$new <- as.factor(df$new)
df
Which yields:
id profession new
1 1 1 Optometrists
2 2 5 Nurses
3 3 4 Financial analysts
4 4 NA <NA>
5 5 0 0
6 6 5 Nurses