Subsetting a dataframe in R including observations that satisfy condition - r

I would like to randomly subset dataframe with condition that if the observation with alpha=1 is included in a subset, then all observation which has alpha=1 must be included in the subset. I simplify data, so it looks like this.
df
alpha beta gamma
1 5 2
1 6 3
1 5 3
2 3 2
2 5 9
2 2 6
3 3 4
3 4 7
3 3 8
4 3 4
4 8 3
4 4 9
5 9 8
5 5 5
5 3 5
What command should I use to get subsets like the following?
df1
alpha beta gamma
1 5 2
1 6 3
1 5 3
3 3 4
3 4 7
3 3 8
5 9 8
5 5 5
5 3 5
df2
alpha beta gamma
2 3 2
2 5 9
2 2 6
4 3 4
4 8 3
4 4 9
5 9 8
5 5 5
5 3 5
df3
alpha beta gamma
1 5 2
1 6 3
1 5 3
2 3 2
2 5 9
2 2 6
5 9 8
5 5 5
5 3 5
Specifically, the first observation in df with numbers (1,5,2) is randomly fell in subset df1 and df3. If so, it must follow that 2nd and 3d observations in df (1,6,3) and (1,5,3) are also included in subsets df1 and df2.
I hope that my question is clear. Please help.

Try this
str <- "alpha,beta,gamma
1,5,2
1,6,3
1,5,3
2,3,2
2,5,9
2,2,6
3,3,4
3,4,7
3,3,8
4,3,4
4,8,3
4,4,9
5,9,8
5,5,5
5,3,5"
df <- read.csv(textConnection(str))
df[df$alpha %in% sample(unique(df$alpha), 3), ]
Output
alpha beta gamma
4 2 3 2
5 2 5 9
6 2 2 6
10 4 3 4
11 4 8 3
12 4 4 9
13 5 9 8
14 5 5 5
15 5 3 5

Related

Frequency Table of Categorical Variables as a Data Frame in R

I would like to create a frequency Table of all Categorical Variables as a Data Frame in R. I would like to find the frequency and percentage of each survey response (grouped by condition, as well as the total frequency). I would like to generate this as a data frame.
An example of the desired frequency count out for just ONE variable ("q1"). I want a similar freq count for most of the variables in my data:
I have data such as this. The actual data has many more categorical variables.
library(readr)
data_in <- read_table2("treatment_cur q13_3 q14_1 q14_2 q14_3 q14_4 q14_5 q14_6 q14_7 q14_8 q14_9 q14_10 q14_11 q14_12 q14_13 q14_14 q14_15
Control 3 2 3 6 5 6 6 6 4 5 5 5 4 6 6 5
Control 2 4 5 6 5 6 5 5 6 4 5 5 6 5 4 6
Treatment 3 1 2 6 4 6 5 4 6 4 6 1 5 6 4 6
Control 3 2 3 6 4 6 6 6 6 6 6 6 6 5 5 6
Control NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Control 4 6 5 6 5 6 5 6 6 5 1 1 6 5 5 6
Control 3 3 2 2 3 3 6 6 4 6 5 5 3 6 6 2
Treatment 2 3 2 3 1 3 1 1 1 3 3 3 3 3 3 1
Control 3 5 5 6 3 6 3 3 3 2 2 1 4 2 3 4
Control 2 1 1 1 1 1 4 4 1 1 1 1 1 4 4 2
Control 4 3 4 6 6 6 6 6 6 6 6 6 6 6 6 6
Control 4 2 6 6 4 6 5 6 6 5 6 5 6 6 6 6
Control 2 2 3 3 2 3 5 6 5 3 3 3 3 5 3 2
Control 3 2 4 3 4 5 4 4 5 3 3 5 4 5 5 4
Treatment 2 2 2 2 2 3 1 1 2 2 3 2 3 3 2 3
Control 4 3 3 3 5 6 6 6 6 6 6 6 6 6 6 6
Treatment 2 1 3 3 2 1 3 4 2 2 3 3 2 3 3 3
Treatment 4 2 6 4 4 2 3 5 4 5 1 1 5 4 4 5
Control 3 3 3 4 4 4 4 5 3 2 5 4 5 5 4 4
Control 4 6 6 6 6 6 6 6 6 6 6 6 5 6 6 5
Control 2 2 3 6 2 5 1 2 4 4 1 1 6 4 4 6
Treatment 4 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6
Treatment 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Treatment 1 1 2 4 4 4 1 1 1 1 1 1 6 1 1 6
Treatment 3 2 3 3 2 6 6 6 6 3 3 2 4 5 5 6
Control 2 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1
Control 1 3 3 3 1 1 5 5 2 4 5 5 4 1 2 5
Treatment 3 4 4 5 5 4 4 4 3 5 3 4 4 6 6 5
Control NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Control 2 2 4 6 2 4 2 2 3 5 4 4 4 3 3 5
Treatment 1 1 2 1 1 1 1 1 6 1 1 1 6 2 3 6
Treatment 2 6 1 4 4 1 1 2 2 2 1 2 1 2 2 2
Treatment 3 3 4 4 4 6 6 5 4 6 3 5 5 6 6 4
Treatment 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Control 4 3 4 6 4 6 4 5 6 3 4 4 6 6 4 6
Control 4 4 3 6 2 5 2 2 4 3 1 6 5 5 5 5
Control NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Treatment 2 3 3 6 5 6 1 2 6 5 4 4 5 5 5 6
Control 4 6 6 6 6 6 5 5 5 5 5 6 5 5 5 5
Treatment 2 1 1 3 1 3 4 4 4 4 1 4 3 4 4 4
Treatment 2 1 3 3 3 3 4 6 5 4 5 5 4 6 6 5
Control 4 6 6 6 6 6 5 5 5 6 6 5 5 5 6 6
Control NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Control 4 2 2 4 2 4 6 6 6 6 4 6 5 6 6 5
Control 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Treatment 3 4 2 5 5 5 6 5 5 5 5 5 5 6 6 6
Control NA 2 4 4 4 4 4 3 4 6 4 5 4 6 4 4
Control 2 2 2 3 1 3 4 1 1 1 2 1 3 3 3 3
Treatment 2 2 2 3 2 2 3 3 2 2 2 2 2 2 2 2
Control 3 3 3 6 6 6 6 6 6 6 5 6 6 6 6 6
Treatment 2 1 2 2 2 1 2 2 1 1 2 1 2 2 1 3
Treatment 4 5 5 6 6 5 5 6 5 5 4 5 5 4 4 5
Control 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Treatment 3 3 4 4 4 6 3 2 5 3 2 2 5 6 5 6
Control 4 4 3 3 6 3 6 6 3 2 4 4 4 4 4 4
Treatment 4 1 3 4 4 4 5 6 6 6 6 6 6 6 6 6
Control 4 4 5 6 5 5 4 6 6 6 6 5 6 6 6 6
Treatment 3 3 4 6 6 6 6 6 5 6 6 5 4 6 6 4
Control 4 4 6 6 4 6 6 6 6 4 4 3 5 6 6 6
Control 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Treatment 4 5 5 6 6 6 6 6 5 5 6 6 5 5 6 6
Treatment 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Control 2 1 2 1 1 1 1 3 1 4 4 1 1 1 1 1
Treatment 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Treatment 4 6 5 5 5 5 5 6 5 4 5 4 4 5 5 4
Treatment 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Control 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Treatment 4 5 6 6 6 5 6 6 6 5 6 6 6 6 6 6
Control 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Treatment 3 3 2 5 4 4 5 6 6 4 5 5 4 5 4 6
Treatment 4 5 4 4 4 5 5 6 4 5 4 3 6 6 6 6
Control 1 2 3 2 1 4 1 1 3 1 3 3 3 3 4 4
Control 3 6 6 6 6 6 5 1 5 6 5 6 6 6 6 6
Control 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Control 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
")
My current solution is too complicated. If I wanted to know the frequency of variables from q13_3:q14_9, I know that I can do something like this to find it:
library(tables)
varList <- 2:11
data_in[varList] <- lapply(data_in[varList], factor,exclude = NULL)
lapply(varList,function(x,df,byVar){
tabular((Factor(df[[x]],paste(colnames(df)[x])) + 1) ~ ((Factor(df[[byVar]],paste(byVar)))*((n=1) + Percent("col"))),
data= df)
},data_in,"treatment_cur")
Below is a snippet of what my current output looks like. The problem is that the output is a list of a list which cannot be exported into a single excel sheet. I have to manually copy everything from the console onto an excel file.
treatment_cur
Control Treatment
q14_8 n Percent n Percent
1 6 13.953 4 12.50
2 4 9.302 4 12.50
3 5 11.628 2 6.25
4 6 13.953 4 12.50
5 5 11.628 7 21.88
6 13 30.233 11 34.38
NA 4 9.302 0 0.00
All 43 100.000 32 100.00
[[10]]
treatment_cur
Control Treatment
q14_9 n Percent n Percent
1 6 13.953 4 12.50
2 6 13.953 4 12.50
3 4 9.302 4 12.50
4 6 13.953 5 15.62
5 5 11.628 8 25.00
6 12 27.907 7 21.88
NA 4 9.302 0 0.00
All 43 100.000 32 10
This works alright, but I want to:
Find the total frequency of each variable value as well (treatment + condition) as an additional column (as seen in the image above);
I do not like the function I am using to produce this output. I want to export this into an excel file, but since this output is actually a list of lists (it cannot be exported to excel), and I am finding it quite cumbersome to copy and paste these values from the console into excel. I would like an easier way of finding these frequencies! Surely R has a better way of doing this...
Any help is MUCH appreciated!!
One way to do this would be to explore using the gtsummary package.
using your code above you can produce a table quite easily with counts and percentages:
library(gtsummary)
library(readr)
library(flextable)
tbl_summary(data_in, by = "treatment_cur") %>%
add_overall() %>%
as_flex_table() %>%
flextable::save_as_docx(., path = "G:/test.docx")
If you just run:
tbl_summary(data_in, by = "treatment_cur") %>%
add_overall()
you will see the table it generates for you. The extra code after that makes it so that it is able to be exported to a docx file. From there you can copy that into excel. This generates the counts you requested and you can determine if it is a simpler implementation.
Another alternative is to write directly to a csv file:
tbl_summary(data_in, by = "treatment_cur") %>%
add_overall() %>%
as_tibble() %>%
readr::write_csv( .,path = "G:/test.csv")
OR
if you really need everything in separate columns you can separate the n and percents into two tables, merge them and then write to csv.
#keep counts only
ncount <- tbl_summary(data_in, by = "treatment_cur",
statistic = all_categorical()~ "{n}") %>%
add_overall()
#keep pcts only
pctdata <- tbl_summary(data_in, by = "treatment_cur",
statistic = all_categorical()~ "{p}%") %>%
add_overall()
#combine and output
tbl_merge(list(ncount, pctdata)) %>%
as_tibble() %>%
readr::write_csv(., "G:/test2.csv")
Edit:
Another way to approach this is with the janitor package. You can adorn counts and percentages pretty easily and merge the datasets together. After that it is easy to export to a csv/Excel. One downside here is you have to loop through your variables to get a table for each and then combine them together, however the code below is a good start to create it:
library(janitor)
datatry <- data_in %>%
janitor::tabyl( q13_3,treatment_cur) %>%
adorn_totals("col") %>%
adorn_totals("row")
datatry2 <- data_in %>%
janitor::tabyl( q13_3,treatment_cur) %>%
janitor::adorn_percentages(denominator = 'col') %>%
adorn_totals("row") %>%
adorn_totals("col") %>%
mutate(Total = ifelse(is.na(q13_3), Total, ifelse(q13_3 == 'Total',1, Total)))
datatry3 <- inner_join(datatry, datatry2, by = 'q13_3') %>%
mutate(variable ='q13_3')
Assuming that you constructed data_in as above:
library(dplyr)
library(purrr)
# reformat
tt <- data_in$treatment_cur
data_in$treatment_cur <- NULL
data_in %>% map(function(a)
{
ret <- data.frame(Treatment.n=rep(0, 6), Control.n=rep(0, 6))
b <- table(a[tt=="Treatment"])
ret[names(b), "Treatment.n"] <- b
b <- table(a[tt=="Control"])
ret[names(b), "Control.n"] <- b
ret$Treatment.percent <- ret$Treatment.n / sum(ret$Treatment.n)
ret$Control.percent <- ret$Control.n / sum(ret$Control.n)
ret
}) %>% do.call(what=cbind)
It assumes answers data is \in 1..6 and NA are ignored.

Create a new variable based on existing variable

My current dataset look like this
Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6
I want to create a new variable called "V2" based on the variables "Order" and "V1". For every 8 items in the "Order" variable, I want to assign a value of "0" in "V2" if the varialbe "Order" has observation equals to 1; otherwise, "V2" takes the value of previous item in "V1".
This is the dataset that I want
Order V1 V2
1 7 0
2 5 7
3 8 5
4 5 8
5 8 5
6 3 8
7 4 3
8 2 4
1 8 0
2 6 8
3 3 6
4 4 3
5 5 4
6 7 5
7 3 7
8 6 3
Since my actual dataset is very large, I'm trying to use for loop with if statement to generate "V2". But my code keeps failing. I appreciate if anyone can help me on this, and I'm open to other statements. Thank you!
(Up front: I am assuming that the order of Order is perfectly controlled.)
You need simply ifelse and lag:
df <- read.table(text="Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6 ", header=T)
df$V2 <- ifelse(df$Order==1, 0, lag(df$V1))
df
# Order V1 V2
# 1 1 7 0
# 2 2 5 7
# 3 3 8 5
# 4 4 5 8
# 5 5 8 5
# 6 6 3 8
# 7 7 4 3
# 8 8 2 4
# 9 1 8 0
# 10 2 6 8
# 11 3 3 6
# 12 4 4 3
# 13 5 5 4
# 14 6 7 5
# 15 7 3 7
# 16 8 6 3
with(dat,{V2<-c(0,head(V1,-1));V2[Order==1]<-0;dat$V2<-V2;dat})
Order V1 V2
1 1 7 0
2 2 5 7
3 3 8 5
4 4 5 8
5 5 8 5
6 6 3 8
7 7 4 3
8 8 2 4
9 1 8 0
10 2 6 8
11 3 3 6
12 4 4 3
13 5 5 4
14 6 7 5
15 7 3 7
16 8 6 3

T test and permutation testing

I have a data frame which looks like this. There are 2 separate groups and 5 different variables.
df <- read.table(text="Group var1 var2 var3 var4 var5
1 3 5 7 3 7
1 3 7 5 9 6
1 5 2 6 7 6
1 9 5 7 0 8
1 2 4 5 7 8
1 2 3 1 6 4
2 4 2 7 6 5
2 0 8 3 7 5
2 1 2 3 5 9
2 1 5 3 8 0
2 2 6 9 0 7
2 3 6 7 8 8
2 10 6 3 8 0", header = TRUE)
I'm calculating the significance of each variable for distinguishing between the 2 groups using the T test (as below). However I'd like to implement permutation testing to calculate the p values as this is quite a small dataset. What is the best method for doing this in R?
t(sapply(df[-1], function(x)
unlist(t.test(x~df$Group)[c("p.value")])))

R, Using reshape to pull pre post data

I have a simple data frame as follows
x = data.frame(id = seq(1,10),val = seq(1,10))
x
id val
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I want to add 4 more columns. The first 2 are the previous two rows and the next two are the next two rows. For the first two rows and last two rows it needs to write out as NA.
How do I accomplish this using cast in the reshape package?
The final output would look like
1 1 NA NA 2 3
2 2 NA 1 3 4
3 3 1 2 4 5
4 4 2 3 5 6
... and so on...
Thanks much in advance
After your give the example , I change the solution
mat <- cbind(dat,
c(c(NA,NA),head(dat$id,-2)),
c(c(NA),head(dat$val,-1)),
c(tail(dat$id,-1),c(NA)),
c(tail(dat$val,-2),c(NA,NA)))
colnames(mat) <- c('id','val','idp','valp','idn','valn')
id val idp valp idn valn
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
Here is a soluting with sapply. First, choose the relative change for the new columns:
lags <- c(-2, -1, 1, 2)
Create the new columns:
newcols <- sapply(lags,
function(l) {
tmp <- seq.int(nrow(x)) + l;
x[replace(tmp, tmp < 1 | tmp > nrow(x), NA), "val"]})
Bind together:
cbind(x, newcols)
The result:
id val 1 2 3 4
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA

Is there an expand.grid like function in R, returning permutations?

to become more specific, here is an example:
> expand.grid(5, 5, c(1:4,6),c(1:4,6))
Var1 Var2 Var3 Var4
1 5 5 1 1
2 5 5 2 1
3 5 5 3 1
4 5 5 4 1
5 5 5 6 1
6 5 5 1 2
7 5 5 2 2
8 5 5 3 2
9 5 5 4 2
10 5 5 6 2
11 5 5 1 3
12 5 5 2 3
13 5 5 3 3
14 5 5 4 3
15 5 5 6 3
16 5 5 1 4
17 5 5 2 4
18 5 5 3 4
19 5 5 4 4
20 5 5 6 4
21 5 5 1 6
22 5 5 2 6
23 5 5 3 6
24 5 5 4 6
25 5 5 6 6
This data frame was created from all combinations of the supplied vectors. I would like to create a similar data frame from all permutations of the supplied vectors. Notice that each row must contain exactly 2 fives, yet not necessarily the fist two in line.
Thank you.
The code below works. (relies on permutations from gtools)
comb <- t(as.matrix(expand.grid(5, 5, c(1:4,6),c(1:4,6))))
perms <- t(permutations(4,4))
ans <- apply(comb,2,function(x) x[perms])
ans <- unique(matrix(as.vector(ans), ncol = 4, byrow = TRUE))
Try ?allPerms in the vegan package.

Resources