How to create a vector of variables using summarise? - r

Here is a beginner's problem:
I like using summarise(), but sometimes I find it difficult storing the results.
For example, I know I can store 1 value in the following manner:
stdv <- Data %>%
filter(x == 1) %>%
summarise(stdv = sd(y))
But I get in trouble if I try to do so for more than 1 variable.
I think it's something to do with creating a vector o variables in the beginning but this doesn't work:
c(dog, cat) <- Data %>%
filter(x == 1) %>%
summarise(dog = sd(y),
cat = mean(y))
Can someone help? Thank ya

You can store it in a vector like this:
save_vector <- df %>%
summarise(dog = sd(id),
cat = var(id)) %>%
unlist()
save_vector
# dog cat
#1.636392 2.677778
Data
structure(list(id = c("1", "4", "3", "4", "6", "3", "5", "6",
"2", "3")), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))

We can use base R methods
with(df, c(dog = sd(id), cat = var(id)))
# dog cat
#1.636392 2.677778
data
df <- structure(list(id = c("1", "4", "3", "4", "6", "3", "5", "6",
"2", "3")), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))

Related

Loop for creating multiple new 3 level variables from another 5 level variable

I'm looking for a way to generate multiple 3-level variables from an older 5-level variable, while keeping the old variables.
This is how it is now:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I would like this:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4"), Question1_3l = c("NA",
"3", "3", "3"), Question2_3l = c("NA", "3", "3", "3"), Question3_3l = c("NA",
"2", "1", "3")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
I have this code to recode the 5-level variable
df2 %>%
mutate_at(vars(Question1, Question2, Question3), recode,'1'=1, '2'=1, '3'=3, '4'=5, '5'=5, 'l' = NA)
But what I want to do is to keep the old variable and generate the 3 level variable into something like Question1_3l, Question2_3l, Question3_3l.
It shouldn't be too difficult. In Stata it looks something like this:
foreach i of varlist ovsat-not_type_number {
local lbl : variable label `i'
recode `i' (1/2=1)(3=2)(4/5=3), gen(`i'_3l)
}
Thank you.
Not the most elegant, not the fastest (but still pretty fast), not the most idiomatic, but this does what you want (I think) and should be easy to read and customize.
dt <- structure(list(Quesiton1 = c("I", "5", "4", "4"),
Question2 = c("I", "5", "4", "4"),
Question3 = c("I", "3", "2", "4")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L))
#transfor your data into a data.table
setDT(dt)
#define the names of the columns that you want to recode
vartoconv <- names(dt)
#define the names of the recoded columns
newnames <- paste0(vartoconv, "_3l")
#define an index along the vector of the names of the columns to recode
for(varname_loopid in seq_along(vartoconv)){
#identify the name of the column to recode for each iteration
varname_loop <- vartoconv[varname_loopid]
#identify the name of the recoded column for each iteration
newname_loop <- newnames[varname_loopid]
#create the recoded variable by using conditionals on the variable to recode
dt[get(varname_loop) %in% c(1, 2), (newname_loop) := 1]
dt[get(varname_loop) == 3, (newname_loop) := 2]
dt[get(varname_loop) %in% c(4, 5), (newname_loop) := 3]
}
Try:
library(tidyverse)
library(stringr)
df2 <- replicate(6, sample(as.character(1:5), 50, replace = TRUE), simplify = "matrix") %>%
as_tibble(.name_repair = ~str_c("Question", 1:6))
df2 %>%
mutate_at(vars(Question1:Question3),
~case_when(.x %in% c('1', '2') ~ 1L, # 1L means integer 1
.x %in% c('3') ~ 3L,
.x %in% c('4', '5') ~ 5L,
TRUE ~ as.integer(NA)))

Row wise parallel Processing in R?

I am working on large data sets, for which i have written a code to perform row by row operation on a data frame, which is sequential. The process is slow.
I am trying to perform the operation using parallel processing to make it fast.
Here is code
library(geometry)
# Data set - a
data_a = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563,
9.19067944250871, 14.4459166666667, 11.414, 17.65325, 12.468,
11.273, 15.5945), .Dim = c(5L, 2L), .Dimnames = list(c("1", "2",
"3", "4", "5"), c("a", "b")))
# Data set - b
data_b = structure(c(10.4515034409741, 15.6780890052356, 12.5581992918563,
9.19067944250871, 14.4459166666667, 11.3318076923077, 13.132273830156,
6.16003995082975, 11.59114820435, 10.9573192090395, 11.414, 17.65325,
12.468, 11.273, 15.5945, 11.5245, 12.0249, 6.3186, 13.744, 11.0921), .Dim = c(10L,
2L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), c("a",
"b")))
conv_hull_1 <- convhulln( data_a, options = "FA") # Draw Convex Hull
test = c()
for (i in 1:nrow(data_b)){
df = c()
con_hull_all <- inhulln(conv_hull_1, matrix(data_b[i,], ncol = 2))
df$flag <- ifelse(con_hull_all[1] == TRUE , 0 , ifelse(con_hull_all[1] == FALSE , 1, 2))
test <- as.data.frame(rbind(test, df))
print(i)
}
test
Is there any way to parallelize row wise computation?
As you can observe, for small datasets the computational time is really low, but as soon as i increase the data size, the computation time increases drastically.
Can you provide solution with the code.
Thanks in advance.
You could take advantage of the parameter to the inhulln function. This allows more than one row of points to be tested to be passed in.
I've tried the code below on a 320,000 row matrix that I made from the original data and it's quick.
library(geometry)
library(dplyr)
# Data set - a
data_a = structure(
c(
10.4515034409741,
15.6780890052356,
12.5581992918563,
9.19067944250871,
14.4459166666667,
11.414,
17.65325,
12.468,
11.273,
15.5945
),
.Dim = c(5L, 2L),
.Dimnames = list(c("1", "2",
"3", "4", "5"), c("a", "b"))
)
# Data set - b
data_b = structure(
c(
10.4515034409741,
15.6780890052356,
12.5581992918563,
9.19067944250871,
14.4459166666667,
11.3318076923077,
13.132273830156,
6.16003995082975,
11.59114820435,
10.9573192090395,
11.414,
17.65325,
12.468,
11.273,
15.5945,
11.5245,
12.0249,
6.3186,
13.744,
11.0921
),
.Dim = c(10L,
2L),
.Dimnames = list(c(
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
), c("a",
"b"))
)
conv_hull_1 <- convhulln( data_a, options = "FA") # Draw Convex Hull
#Make a big data_b
for (i in 1:15) {
data_b = rbind(data_b, data_b)
}
In_Or_Out <- inhulln(conv_hull_1, data_b)
result <- data.frame(data_b) %>% bind_cols(InOrOut=In_Or_Out)
I use dplyr::bind_cols to bind the in or out result to a data frame version of the original data so you might need some changes for your specific environment.

Reorder Stacked Bar Chart

newbie R coder here. I have a stacked bar chart in base R that I'd like to reorder numerically by question type (Question 1 Pre, Question 1 Post, Question 2 Pre, Question 2 Post, etc.)
It's probably a fairly simple fix but I can't seem to get the reorder function to work. The other questions on reordering don't quite get to my solution. Maybe reorder isn't the right way to go about it?
Attached my graph and base code. Thank you so much! I appreciate your kind help.
if(!require(psych)){install.packages("psych")}
if(!require(likert)){install.packages("likert")}
library(readxl)
setwd("MSSE 507 Capstone Data Analysis/")
read_xls("ProcessDataMSSE.xls")
Data = read_xls("ProcessDataMSSE.xls")
str(Data) # tbl_df, tbl, and data.frame classes
### Change Likert scores to factor and specify levels; factors because numeric values are ordinal
Data <- Data[, c(3:26)] # Get rid of the other columns! (Drop multiple columns)
Data$`1Pre` <- factor(Data$`1Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`1Post` = factor(Data$`1Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`2Pre` <- factor(Data$`2Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`2Post` = factor(Data$`2Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`3Pre` <- factor(Data$`3Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`3Post` = factor(Data$`3Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`4Pre` <- factor(Data$`4Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`4Post` = factor(Data$`4Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`5Pre` <- factor(Data$`5Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`5Post` = factor(Data$`5Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`6Pre` <- factor(Data$`6Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`6Post` = factor(Data$`6Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`7Pre` <- factor(Data$`7Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`7Post` = factor(Data$`7Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`8Pre` <- factor(Data$`8Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`8Post` = factor(Data$`8Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`9Pre` <- factor(Data$`9Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`9Post` = factor(Data$`9Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`10Pre` <- factor(Data$`10Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`10Post` = factor(Data$`10Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`11Pre` <- factor(Data$`11Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`11Post` = factor(Data$`11Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`12Pre` <- factor(Data$`12Pre`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data$`12Post` = factor(Data$`12Post`,
levels = c("1", "2", "3", "4"),
ordered = TRUE)
Data <- factor(Data,levels=Data[3:26])
Data
### Double check the data frame
library(psych) # Loads psych package
headTail(Data) # Displays last few and first few data
str(Data) # Shows structure of an object (observations and variables, etc.) - in this case, ordinal factors with 4 levels (1 through 4)
summary(Data) # Summary of the number of times you see a data point
Data$`1Pre` # This allows us to check how many data points are really there
str(Data)
### Remove unnecessary objects, removing the data frame in this case (we've converted that data frame into a table with the read.table function above)
library(likert)
Data <- as.data.frame(Data) # Makes the tibble a data frame
likert(Data) # This will give the percentage responses for each level and group
Result = likert(Data)
summary(Result) # This will give the mean and SD
plot(Result,
main = "Pre and Post Treatment Percentage Responses",
ylab="Questions",
type="bar")
I largely agree with #DzimitryM 's solution. It is unclear to me, however whether this really works. In my solution, I need to use the items variable of the data.frame, not the data.frame as such. There is some comment in the code below (at the bottom) highlighting this.
Anyway this is the reason I made a working example with executable code.
I am aware of the fact, that it may be improved; my focus was on executability.
library(likert)
### mimic some of your data, with 'accepted' naming
Data <- data.frame(
C01Pre = as.character(c( rep(1, 10), rep(2, 60), rep(3, 25), rep(4, 5) )),
C01Post = as.character(c( rep(1, 25), rep(2, 52), rep(3, 21), rep(4, 2) )),
C02Pre = as.character(c( rep(1, 25), rep(2, 68), rep(3, 5), rep(4, 2) )),
C02Post = as.character(c( rep(1, 30), rep(2, 53), rep(3, 13), rep(4, 4) )),
C03Pre = as.character(c( rep(1, 20), rep(2, 52), rep(3, 25), rep(4, 3) )),
C03Post = as.character(c( rep(1, 20), rep(2, 39), rep(3, 35), rep(4, 6) ))
)
### coerce to ordered factor
Data$C01Pre <- factor(Data$C01Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C01Post <- factor(Data$C01Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C02Pre <- factor(Data$C02Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C02Post <- factor(Data$C02Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C03Pre <- factor(Data$C03Pre, levels = c("1", "2", "3", "4"), ordered = TRUE)
Data$C03Post <- factor(Data$C03Post, levels = c("1", "2", "3", "4"), ordered = TRUE)
Result = likert(Data)
### show the "natural" order when processed by likert()
summary(Result)
# Item low neutral high mean sd
# 6 C03Post 59 0 41 2.27 0.8510837
# 1 C01Pre 70 0 30 2.25 0.7017295
# 5 C03Pre 72 0 28 2.11 0.7506899
# 2 C01Post 77 0 23 2.00 0.7385489
# 4 C02Post 83 0 17 1.91 0.7666667
# 3 C02Pre 93 0 7 1.84 0.5983141
plot(Result,
group.order = names(Result$items)) ## this is the key!
## difference with other answer is:
## names of the "items" variable of the df
## not the data.frame itself
This results in the following graph:
Grouping option can be added to plot() in order to get the plot, that is ordered by the column names of the initial dataset:
plot(Result,
group.order = names(Data),
type="bar")

3D euclidean distance to identify unknown samples

I have this dataframe called mydf where I have three principal covariates (PCA.1,PCA.2, PCA.3). I want to get the 3d distance matrix and get the shortest euclidean distance between all the compared Samples. In another dataframe called myref, I have some known identity of Samples and some unknown samples. By calculating the shortest euclidean distance from mydf, I want to assign the known Identity to the unknown samples. Can someone please help me get this done.
mydf
mydf <- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), PCA.1 = c(0.00338, -0.020373, -0.019842, -0.019161,
-0.019594, -0.019728, -0.020356, 0.043339, -0.017559, -0.020657
), PCA.2 = c(0.00047, -0.010116, -0.011532, -0.011582, -0.013245,
-0.011751, -0.010299, -0.005801, -0.01, -0.011334), PCA.3 = c(-0.008787,
0.001412, 0.003751, 0.00371, 0.004242, 0.003738, 0.000592, -0.037229,
0.004307, 0.00339)), .Names = c("Sample", "PCA.1", "PCA.2", "PCA.3"
), row.names = c(NA, 10L), class = "data.frame")
myref
myref<- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), Identity = c("apple", "unknown", "ball", "unknown",
"unknown", "car", "unknown", "cat", "unknown", "dog")), .Names = c("Sample",
"Identity"), row.names = c(NA, 10L), class = "data.frame")
uIX = which(myref$Identity == "unknown")
dMat = as.matrix(dist(mydf[, -1])) # Calculate the Euclidean distance matrix
nn = apply(dMat, 1, order)[2, ] # For each row of dMat order the values increasing values.
# Select nearest neighbor (it is 2, because 1st row will be self)
myref$Identity[uIX] = myref$Identity[nn[uIX]]
Note that the above code will set some identities to unknown. If instead you want to match to the nearest neighbor with a known identity, change the second line to
dMat[uIX, uIX] = Inf

time series barplot in R

i have a time series data like this:
x <- structure(list(date = structure(c(1264572000, 1266202800, 1277362800,
1277456400, 1277859600, 1278032400, 1260370800, 1260892800, 1262624400,
1262707200), class = c("POSIXt", "POSIXct"), tzone = ""), data = c(-0.00183760994446658,
0.00089738603087497, 0.000423513598318936, 0, -0.00216496690393131,
-0.00434836817931339, -0.0224199153445617, 0.000583823085470003,
0.000353088613905206, 0.000470295331234771)), .Names = c("date",
"data"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"
), class = "data.frame")
and I will make a barplot of this dataset whereby each bar stands for each date (if there are no datas for timespan, there should be gaps).
Can Anyone help me?
Using ggplot: (Note that you have to provide stat="identity" to geom_bar to prevent it from summarising the data and creating a histogram).
library(ggplot2)
ggplot(x, aes(x=date, y=data)) + geom_bar(stat="identity")
And if you are inclined to use base graphics:
plot(x$date, x$data, type="h")

Resources