I want to create a random population data column of around 4000 rows and then randomly distribute each row of this population data column into 4 age group columns (like 0-24, 25-64, 64-85 and 85+).
Sorry for the silly reply earlier, I this what you are looking for:
Population=as.integer(runif(4000,10000,1000000))
df <- matrix(runif(16000,0,1), nc=4)
df <- sweep(df, 1, rowSums(df), FUN="/")
df=as.data.frame(df)
df=cbind(Population,df)
colnames(df)=c('Population','0-24','25-64','64-85','85 above')
df1=cbind(Population,round(df$Population*df[,2:5],0))
Related
I have a CSV where the first column is my subject ID number,
there's a total of 311 subjects, and on average 1000 values per subject.
The subject ID numbers are quite random (although integers only) ranging from 72 to 2265988.
What I would like to do is neatly rename them to numbers 1-311.
What would be the quickest way to do this in R, preferably, or in excel?
OK. Here is your solution. You just replace with your names:
df <- data.frame(ID = c(1,3,5,3,6,7),
var = c(3,6,8,5,7,8))
temp <- data.frame(old_id = sort(unique(df$ID)),
new_id =seq(1,5))
replace_ID <- temp$new_id[match(df$ID, temp$old_id)]
df$ID <- replace_ID
df
I have followed this example Remove last N rows in data frame with the arbitrary number of rows but it just deletes only the last 50 rows of the data frame rather than the last 50 rows of every study site within the data frame. I have a really big data set that has multiple study sites and within each study site there's multiple depths and for each depth, a concentration of nutrients.
I want to just delete the last 50 rows of depth for each station.
E.g.
station 1 has 250 depths
station 2 has 1000 depths
station 3 has 150 depth
but keep all the other data consistent.
This just seems to remove the last 50 from the dataframe rather than the last 50 from every station...
df<- df[-seq(nrow(df),nrow(df)-50),]
What should I do to add more variables (study site) to filter by?
A potential base R solution would be:
d <- data.frame(station = rep(paste("station", 1:3), c(250, 1000, 150)),
depth = rnorm(250 + 1000 + 150, 100, 10))
d$grp_counter <- do.call("c", lapply(tapply(d$depth, d$station, length), seq_len))
d$grp_length <- rep(tapply(d$depth, d$station, length), tapply(d$depth, d$station, length))
d <- d[d$grp_counter <= (d$grp_length - 50),]
d
# OR w/o auxiliary vars: subset(d, select = -c(grp_counter, grp_length))
we can use slice function from dplyr package
df2<-df %>% group_by(Col1) %>% slice(1:(n()-4))
At first it groups by category column and if arranged in proper order it can remove last n number of rows (in this case 4) from dataframe for each category.
I'm an R novice, so I can't make a sample data frame for you, for which I apologize. However, I'm doing bacterial community analysis and I have a table that has species for each column and each sample for each row. Each column is an identifier for each species. Within the data frame is the species abundance for each sample. My goal is to identify the most abundant species (column) for each sample (row). I think making a data frame that has samples (rows) with the most abundant species column identifier would be the most useful!
Iterations I've tried (using phyloseq package, but can be used without this package).
beagle <- names(sort(taxa_sums(top.pdy.phyl),T, function(x) which(max==x)))
beagle2 <- names(taxa_sums(top.pdy.phyl),T, function(x) colnames(which.max(top.pdy.phyl))))
Any help would be appreciated! Thank you!
How about:
names(top.pdy.phyl)[ apply(top.pdy.phyl, 1, which.max) ]
and
apply(top.pdy.phyl, 1 , max)
I want to make 2 random groups (of 6 participants each) for an experiment based on my variable x (continuous between -3.5 and 3.5).
The groups should be formed in a way that after formation a t-test to compare groups will be non-significant (e.g. group 1 has mean x of 2.05 and group 2 of 2.15).
Hence, I would also like an additional column to the dataset which basically says for each participant either group1 or group2 and keep all other columns.
So far I've played around with the package Dplyr but haven't found a solution.
Here is a reproducible sample:
ID <- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14")
x <- c("0.65","1.25","1.55","1.80","1.95","2.05","2.25","2.30","2.45","2.6","2.85","2.9","3.00","3.05")
age <- c("36","26","87","27","24","50","27","36","46","44","33","38","47","41")
gender <- c("M","M","F","M","F","F","M","F","M","F","F","M","F","F")
df <- data.frame(ID, x, age, gender)
I have a dataframe that lists studentnumber <- c( 1,2,3.. nth) and schoolnumber<- c(1,1,2,3,4,4) so pupil 1 is in school 1, pupil 2 is in school 1, pupil 3 is in school 3....
I have social economic status for each pupil and I want to calculate a new column where the SESs are actual SES minus the mean SES of a particular school. The function for this is apparently:
mydata$meansocialeconomicstatus <- with(mydata, tapply(ses, schoolnumber, mean))
But I receive an error term because the new column is not repeating each value depending on if the school number has repeated. So this gives me a discrepancy in the number of rows in the new column not matching the dataframe. This is because each mean is only being given once.
My question is, what could I add to make the mean ses repeat in the new column depending on the school number?
You can use the dplyr package.
library(dplyr)
# Calculate the mean socialeconomicstatus per schoolnumber.
mydata2 <- mydata %>%
group_by(schoolnumber) %>%
summarise(meansocialeconomicstatus = mean(ses))
# Join the mean socialeconomicstatus back to the original dataset based on schoolnumber.
left_join(mydata,mydata2,by="schoolnumber")