How to split data into random equal groups based on continuous variable? - r

I want to make 2 random groups (of 6 participants each) for an experiment based on my variable x (continuous between -3.5 and 3.5).
The groups should be formed in a way that after formation a t-test to compare groups will be non-significant (e.g. group 1 has mean x of 2.05 and group 2 of 2.15).
Hence, I would also like an additional column to the dataset which basically says for each participant either group1 or group2 and keep all other columns.
So far I've played around with the package Dplyr but haven't found a solution.
Here is a reproducible sample:
ID <- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14")
x <- c("0.65","1.25","1.55","1.80","1.95","2.05","2.25","2.30","2.45","2.6","2.85","2.9","3.00","3.05")
age <- c("36","26","87","27","24","50","27","36","46","44","33","38","47","41")
gender <- c("M","M","F","M","F","F","M","F","M","F","F","M","F","F")
df <- data.frame(ID, x, age, gender)

Related

How do I assign grouped values (per-subject) from one df to another df that's grouped by trial (e.g. repeated rows for each subject)

I am using R.
I have two dfs, A and B.
A is grouped by trial, so contains numerous observations for each subject (e.g. reaction times per trial).
B is grouped by subject, so contains just one observation per subject (e.g. self-reported individual difference measures).
I want to transfer the B values so they repeat per participant across trials in A. There are numerous variables I wish to transfer from B to A, so I'm looking for an elegant solution.
What you want is to use dplyr::left_join to do this elegantly.
library(dplyr)
C <- A %>%
left_join(B, by = "subject_id")

R: Subset factor levels that co-occur with two levels from another factor

I have a data frame consisting of multiple columns. I want to subset the data frame to only include rows where levels from one factor co-occur with more than one level in another factor. With the simplified data example below, I would be left with just the first two rows, i.e. GeneA, GeneA and TissueA TissueB.
A <- c("GeneA","GeneA","GeneB","GeneB","GeneC","GeneC")
B <- c("TissueA","TissueB","TissueA","TissueA","TissueA","TissueA")
df <- data.frame(Gene = A, Tissue = B)
Thanks in advance.
Here is one idea. You define groups with Gene. In each group, you want to check if there is more than one unique value.
group_by(df, Gene) %>%
filter(n_distinct(Tissue) >= 2)
Gene Tissue
<fct> <fct>
1 GeneA TissueA
2 GeneA TissueB

Combine 2 variables from a 8 variables set, with the difference from each row

Hi, and thanks for all help. I have a dataset with 8 variables and 5 observations. What i want to do is to take 2 variables from the dataset with every 5 observations. In these variables and have digits such as high.price and low.price from five different days hence the observations. What i want is to take the variables High.price and Low.price into a new dataset and plot a genom_line with the difference between the high.price and low.price such as the difference to be "y" on the plot and "x" as date the 5 observations.
What i want is that i want to calculate the difference between High.price and Low.price for each five days, and then plot the difference "spread".
If I understand correctly, it's a simple case of subsetting. if dataset1 is the first dataset with 8 column and five rows, you can simply subset using:
dataset2 <- dataset1[c(1,2),] where 1 and 2 are the lines to keep. Since data is not in the dataset, you can build graph using date vectors as X and data from dataset2 for y.
I made an example:
df <- data.frame (a=c(2,4,6,8,9),
b=c(1,5,7,9,10),
c=c(6,8,5,7,7),
d=c(1,2,3,4,5),
e=c(4,5,6,2,1),
f=c(2,5,4,7,1),
g=c(1,1,2,1,2),
h=c(5,6,5,5,5))
Vmin <- unlist(lapply(df, min))
Vmax <- unlist(lapply(df, max))
spread <- Vmax-Vmin
plot(spread, type = "o",pch=20, xaxt="n")
axis(1,1:8,colnames(df)) #or your date

Random population sample divided into age groups

I want to create a random population data column of around 4000 rows and then randomly distribute each row of this population data column into 4 age group columns (like 0-24, 25-64, 64-85 and 85+).
Sorry for the silly reply earlier, I this what you are looking for:
Population=as.integer(runif(4000,10000,1000000))
df <- matrix(runif(16000,0,1), nc=4)
df <- sweep(df, 1, rowSums(df), FUN="/")
df=as.data.frame(df)
df=cbind(Population,df)
colnames(df)=c('Population','0-24','25-64','64-85','85 above')
df1=cbind(Population,round(df$Population*df[,2:5],0))

using aggregate to generate report based on multiple categories in r

I have a .dbf containing roughly 2.8 million records that contain residential parcel data with a year built category field, a county code field, and a windzone field (for building code restrictions). There are 3 year built categories and 5 wind zones. I need to get the number of parcels for each year built category in each windzone for each county. Basically I have a county (CNTY_ID = 11) with three year built categories (BUILT_CAT = "1" , "2" , "3") each that are also assigned to one of five windspeed categories (WINDSPEED = "100", "110", "120", etc.). I think I need to use the aggregate() function but haven't had any luck. Optimally the generated table would look something like:
CNTY_ID = 11
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
CNTY_ID = 12
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
Is this kind of task possible to accomplish?
Actually, you're better of using table, that's less hassle and more performant. You get an array back, and this one is easily converted to a data frame.
Some test data:
n <- 10000
df <- data.frame(
windspeed = sample(c(110,120,130), n, TRUE),
built_cat = sample(c(1,2,3),n,TRUE),
cnty_id = sample(1:20,n,TRUE)
)
Constructing the table and converting to a data frame:
tbl <- with(df, table(windspeed, built_cat, cnty_id))
as.data.frame(tbl)
Note that I use with here so I have the variable names automatically as the dimnames of my table. That helps with the conversion.
What you essentially need is a way to group your data.
I think dplyr is the way to go. You can use aggregate too.
Using dplyr
library(dplyr)
library(datasets)
temp <- airquality %>%
group_by(Month, Day) %>%
summarise(TOT = sum(Ozone))
View(temp)
This will give you the data in a normalized format where the data is grouped first by Month and then by Day of the month and then sums the provided variable. Ozone in this case. You can also count the values by using length in stead.
Using aggregate
temp2 <- aggregate(Ozone ~ Month + Day, data = airquality, sum)
View(temp2)
The key difference in the approach is the treatment of NA.
Since base R functions do not have a very intuitive treatment of NAs and would add the record whenever it encounters it. As a result in group by the sum fails for that grouped entity and it is dropped from the resultant.
Here is a link to other group by treatments using data.table or ddply. You can also achieve this by plyr or tapply.

Resources