use mutate_at for variables that meet two criteria dplyr R - r

I'm trying to reverse score (recode) some items in a dataframe. All reverse scored items end in an R, and each scale has a unique start ("hc", "out", and "hm"). I normally would just select all variables that end with an "r", but the issue is that some scales are on a 5-point scale ("hc" and "out") and others are on a 7-point scale ("hm").
Here is a sample of the much, much larger dataset:
library(tidyverse)
data <- tibble(name = c("Mike", "Ray", "Hassan"),
hc_1 = c(1, 2, 3),
hc_2r = c(5, 5, 4),
out_1r = c(5, 4, 2),
out_2 = c(2, 4, 5),
out_3r = c(2, 2, 1),
hm_1 = c(6, 7, 7),
hm_2r = c(7, 1, 7))
Let's say that I want to do this one scale at a time, so I start with hm, which is on a seven-point scale.
I want to try something like this with an & statement, but I get an error:
library(tidyverse)
library(car)
data %>%
mutate_at(vars(ends_with("r") & starts_with("hm")), ~(recode(., "1=7; 2=6; 3=5; 4=4; 5=3; 6=2; 7=1")))
Error: ends_with("r") & starts_with("hc") must evaluate to column positions or names, not a logical vector
What's a clean way to make it perform the reverse scoring on these few variables at a time? Once again, the dataset is too big too practically select individual variables at a time.
Thanks!

It would be easier to use matches here
library(tidyverse)
data %>%
mutate_at(vars(matches("^hm.*r$")), ~(recode(.,
"1=7; 2=6; 3=5; 4=4; 5=3; 6=2; 7=1")))

Related

crosstab variables of tibble and make the output readable

I have to cross tabulate variables of tibble. I used table() for it, but the output is not easily readable.
Is there a way to format the output to make it more easily readable.
Thanks
library(tidyverse)
# random arrays of 0 and 1
a <- sample(c(0, 1, 2, 3, 4, 5), replace = TRUE, size = 100)
b <- sample(c(0, 1, 2, 3, 4, 5), replace = TRUE, size = 100)
tbl <- tibble(a, b)
cross_tab <- table(tbl$a, tbl$b)
cross_tab
I use expss for these kinds of tables:
library(expss)
cro(tbl$a,tbl$b) %>% htmlTable()
You can precede the command above with the expss command apply_labels to format the variable and value names. See the documentation for details.

Calculate and add values to a data-frame

My dataset looks like this:
"userid","progress"
1, incomplete
2, complete
3, not attempted
4, incomplete
5, not attempted
6, complete
7, complete
8, complete
9, complete
10, incomplete
I want to make a pie chart showing the percentage of people who have status-completed, incomplete and not attempted, that is total no of users/user id = complete/incomplete
This code is not working.
var1 = nrow(data1)/sum(data1$progress=="complete")
var2 = nrow(data1)/sum(data1$progress=="incomplete")
df <- data.frame(
val = c (var1, var2)
)
hchart(df, "pie")%>%hc_add_series_labels_values(values = df)
If you are trying to make a pie chart, most methods will do much of the work for you. No need to explicitly calculate the percentages. Anyway, the output of table is exactly what you want together with pie
# Load your data
ds <- read.csv(header = TRUE, text =
"userid,progress
1, incomplete
2, complete
3, not attempted
4, incomplete
5, not attempted
6, complete
7, complete
8, complete
9, complete
10, incomplete")
# Tabularize
tab <- table(ds$progress)
pie(tab) # Make piechart
As you see below, table counts the number of appearances for each level and returns a named integer vector. The nice thing here is that pie() computes the angles/areas from the relative frequencies and uses the names to label the chart.
print(tab)
#
# complete incomplete not attempted
# 5 3 2
If you insist on computing the percentages yourself, you can just use tab/sum(tab).
Edit: I see that you try to use the highcharter package. Why not use hcpie in that case? That function takes a factor as input:
library("highcharter")
hcpie(ds$progress)
Like this:
userid <- c(1,2,3,4,5,6,7,8,9,10)
progress <- c("incomplete","complete", "not attempted", "incomplete", "not attempted", "complete","complete","complete", "complete","incomplete")
df <- data.frame("userid"=userid, "progress"=progress)
df$progress <- as.factor(df$progress)
var1 = nrow(df[which(df$progress=="complete"), ])/nrow(df)
var2 = nrow(df[which(df$progress=="incomplete"), ])/nrow(df)
var3 = nrow(df[which(df$progress=="not attempted"), ])/nrow(df)
data <- c(var1, var2, var3)
pie(data, labels=c("complete","incomplete", "not attempted"))

R subset df based on multiple columns from another data frame

I am trying to find a more succinct way to filter a data frame using rows from another data frame (I am currently using a loop).
For example, suppose you have the following data frame df1 consisting of quantities of apples, pears, lemons and oranges. There is also a 5th column which we will call happiness.
require(gtools)
df1 <- data.frame(permutations(n = 4, r = 4, v = 1:4)) %>% cbind(sample(1:24))
colnames(df1) <- c("Apples", "Pears", "Lemons", "Oranges", "Happiness")
However you wish to filter this dataframe to leave only certain combinations of fruit which exist in a second data frame (not with the same column order):
df2 = data.frame(Apples = c(1, 3, 2, 4), Pears = c(4, 1, 1, 3), Lemons = c(2, 2, 3, 1), Oranges = c(3, 4, 4, 2))
Currently I am using a loop to apply each row of df2 as a filter condition one-by-one and then binding the result e.g:
df.ss = list()
for (i in 1:nrow(df2)){
df.ss[[i]] = filter(df1,
df1$Apples == df2$Apples &
df1$Pears == df2$Pears &
df1$Lemons == df2$Lemons &
df1$Oranges == df2$Oranges)
}
df.ss %>% bind_rows()
Is there a more elegant way of going about this ?
I think you are looking for an inner join
dplyr::inner_join(df1, df2)

Finding outliers further than certain standard deviations from mean for a data frame in r

I understand that to find rows in a data frame that meet certain criteria (ie. filtering data) I would use a code similar to:
s[(s$age < 20 | s$age > 40)]
But would I go about trying to find the outlier rows that have 'age' values + or - 1 standard deviation from the mean?
s <- data.frame(
sample = c("s_1", "s_2", "s_3", "s_4", "s_5", "s_6", "s_7", "s_8"),
flavor = c("original", "chicken", "original", "original", "cheese", "chicken", "cheese", "original"),
age = c(23, 25, 11, 5, 6, 44, 50, 2),
scale = c( 4, 3, 2, 5, 4, 3, 1, 5))
If you want to remove the outliers based on the initial statistics, it's straightforward:
s[(s$age < mean(s$age) - sd(s$age) | s$age > mean(s$age) + sd(s$age),]
This uses the base function sd. Also since you stated you want to select rows of a data.frame, I added a , to the indexing so it will return all columns.
If you want a continuous, filtering-like approach, you can use the apply - family functionality as mentioned by #Sotos

Combining frequency tables in R

I have a vector containing the frequencies of molecules within their respective molecular class for all molecules measured. I also have a vector that contains the per class frequency of significant molecules identified by variable selection. How can I merge these 2 vectors into a data frame and fill in empty frequencies with 0's (in R)?
Here is a workable example:
full = rep(letters[1:4], 4:7)
fullTable = table(full)
sub = rep(letters[1:2], c(2, 4))
subTable = table(sub)
I would like the table to look like:
print(data.frame(Letter=letters[1:4], fullFreq=c(4, 5, 6, 7), subFreq=c(2, 4, 0, 0)))
Try this (I supposed you meant subTable=table(sub) in your last line):
res<-merge(as.data.frame(fullTable),as.data.frame(subTable),by.x=1,by.y=1,all=TRUE)
colnames(res)<-c("Letter","fullFreq","subFreq")
res[is.na(res)]<-0
With the library dplyr
library(dplyr)
full=rep(letters[1:4], 4:7)
sub=rep(letters[1:2], c(2,4))
df <- data.frame(Letter=unique(c(full, sub)))
df <- df %>%
left_join(as.data.frame(table(full)), by=c("Letter"="full")) %>%
left_join(as.data.frame(table(sub)), by=c("Letter"="sub"))
df[is.na(df)] <- 0
df

Resources