Reg to find range and frequency of number IN R programming - r

I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.

You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)

There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]

Related

R: locate element previous in vector within for loop and report in new column

I've looked through many older posts but nothing is really hitting the answer I need. In short: I have a data frame that contains observation data and the time of observation in days.
My goal is to add a column for weeks. I have already subsetted the data so that I only have the time vector at intervals of 7 (t == 7, 14, 21, etc). I just need to make a for loop that creates a new vector of "weeks" that I can then cbind to my data. I'd prefer it to be a character string so I can use it more easily in ggplot geom_historgram, but isn't as necessary as just creating the new vector successfully.
The tricky part of the data is that there is not an equal number of observations per time- t # 28 has maybe 5x as many observations as t #7, etc.
I want to create code that evaluates what t is, then checks to see if it is greater than the last element in the t vector. If it isn't, then populate the week vector with the last value it did, and if so, then increase it by 1.
I know this is bad from a like, computer science/R perspective in a lot of ways, but any help would be useful:
#fake data (in reality this is a huge data set with many observations at intervals of 1 for t
L = rnorm(50, mean=10, sd=2)
t = c((rep.int(7,3)), (rep.int(14,6)), rep.int(21,8), rep.int(28,12), (rep.int(31, 5)), (rep.int(36,16)))
fake = cbind(L,t)
#create df that has only the observations that are at weekly time points
dayofweek = seq(7,120,7)
df = subset(fake, t %in% dayofweek)
#create empty week vector
week = c()
#for loop with if-else statement nested to populate the week vector
for (i in 1:length(dayofweek)){
if (t = t[t-1]){
week = i
} else if (t > t[t-1]{
week = i+1
}
}
Thanks!!
I'm not sure I can follow what you want to do. If you want to determine which week the data fall within, why not:
set.seed(1)
L = rnorm(50, mean=10, sd=2)
...
fake <- data.frame(L=L, t=t)
fake$week <- floor(fake$t/7) # comment this out so t==7 becomes week==1 + 1
head(fake)
# L t week
# 1 8.747092 7 2
# 2 10.367287 7 2
# 3 8.328743 7 2
# 4 13.190562 14 3
# 5 10.659016 14 3
# 6 8.359063 14 3

How do I aggregate results to get the most common number in each row of a dataset in R Studio

Im doing some consensus clustering, and it returns a set called "consensus_imouted" of 3000 rows with ten repetitions each with the cluster number (ranging from 1-6). I want to return just one column for each row with the most common cluster number for each. for example, the first row is 3 3 3 3 3 3 3 3 6 3, so i would want it to be 3 etc. any help?
You can use the apply function as follows:
sampledata <- matrix(sample(1:6,30000,replace = TRUE), ncol = 10, nrow = 3000)
sampledata <- data.frame(sampledata)
sampledata$mostCounts <- apply(sampledata,1, function(row0) {
as.numeric(names(which.max(table(row0))))
})
To get the most frequent value, just count the values in the row via table. Then, choose the value with the highest count using which.max. In a table, the values corresponding to the counts are the names of the table, hence use names to extract the original value. Now, since you know it is number, just cast the character to a numeric using as.numeric.

R code to iteratively and randomly delete entire rows from a data frame based on a column value, and saving as a new data frame each time

Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2

Is there an R function to select n individuals based on group?

I have a data set of 12.5 million records and I need to randomly select about 2.5 million. However, these individuals are in 55284 groups and I want to keep groups intact.
So basically I want to remove groups until I've got 2.5 million records left OR select groups until I have about 2.5 million individuals.
If this is my data:
data <- data.frame(
id = c(1, 2, 3, 4, 5),
group = (1, 1, 2, 2, 3)
)
I wouldn't want to remove id1 and keep id2, I'd like to either keep them both or discard both, because they are in the same group(2).
So ideally, this function randomly selects a group, counts these individuals and puts them in a data set, then does the same thing again, keeps counting the individuals until it has about 2.5 million (it is okay to say: if n exceeds 2.5 stop putting groups into new data set).
I haven't been able to find a function and I am not yet skilled enough to put something together myself, unfortunately.
Hope someone can help me out!
Thanks
Too long for a comment hence answering. Do you need something like this ?
#Order data by group so rows with same groups are together
data1 <- data[order(data$group), ]
#Get all the groups in first 2.5M entries
selected_group <- unique(data1$group[1:2500000])
#Subset those groups so you have all groups intact
final_data <- data1[data1$group %in% selected_group, ]
For a random approach, we can use while loop
#Get all the groups in the data
all_groups <- unique(data$group)
#Variable to hold row indices
rows_to_sample <- integer()
#While the number of rows to subset is less than 2.5M
while (length(rows_to_sample) <= 2500000) {
#Select one random group
select_group <- sample(all_groups, 1)
#Get rows indices of that group
rows_to_sample <- c(rows_to_sample, which(data$group == select_group))
#Remove that group from the all_groups
all_groups <- setdiff(all_groups, select_group)
}
data[rows_to_sample, ]
here is a possibility. I demonstrate it using toydata and threshold of 33 (instead of 2.5) million. First I create the toy group vector:
threshold <- 33
set.seed(111)
mygroups <- rep(1:10, rpois(10, 10))
In this toy example group 1 has 10 individuals, group 2 has 8 individuals and so on.
Now I put the groups in random order and use cumsum to determine when the threshold is exceeded:
x <- cumsum(table(mygroups)[sample(1:10)])
randomgroups <- as.integer(names(x[x <= threshold]))
randomgroups
[1] 1 7 5

Subset data frame based on first letters of column name

I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1.
Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name.
Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the #paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]

Resources