In R, how to group by value sign? - r

I have a data.frame with columns order, x, sign. I want to create groups by sign but keeping order columns. The column sign describes by 0 that x value is positive number, and by 1 that x value is a negative number or zero. The output that I want is kind of:
Group1: order = 0
Group2: order = 1 and 2
Group3: order = 3,4,5,6,7
Group4: order = 8,9,10,11
Group5: order = 12,13,14
Group6: order = 15
After that, I would like to calculate the mean of x values by my Group1,Group2....
Table image description here

Related

R querying second dataframe by the range to which a numerical value in column belongs and returning the corresponding value

I have a dataframe column with numbers ranging from 0 - 50 (column A). I have another dataframe with two columns, one column shows a numerical range (column B) and the other shows a corresponding value (column C). In the first dataframe, I would like to add a column D that is the result of finding the range (in column B) to which the value in column A belongs and returning the corresponding value (column C).
A
1
50
B
C
0-10
Low
...
...
41-50
High
A
D
1
Low
50
High
If your categories are adjacent (no gaps between ranges), findInterval might do the trick (replace 2nd dataframe with named lookup vector as fit). Example:
values = c(12, 2, 42)
## define categories by lower bound:
categories = c(low=0, middle=10, high=40)
names(categories)[
findInterval(values, categories)
]

Identifying, grouping unique entries in data frame (R)

I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1

Reg to find range and frequency of number IN R programming

I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.
You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)
There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]

R index variable shrunk to number of unique groups

I have a data frame, dat, with 214 rows of data. Each row contains these variables: Species and Mode either red or green. I have sorted the data by Species. I would like to create a numeric index variable where if mode is red then index = 0 else index = 1.
Further, the index can only be as long as the unique number of species that exist (N=72), such that, if there are 5 of speciesA, red and 7 of speciesB, green that is a red species, then row 1 = 0 and row 2 = 1and so on. Here is the code I have tried so far:
index <- for (q in 1:unique(species)) {
ifelse(mode[q]=='red',0,1)
}
index <- as.numeric(factor(my_dataframe$mode))
A factor, under the hood, is stored as an integer. So the conversion from factor to numeric index is 1 to 1.

Problems with using subset in r

I need to subset my data frame, but I do not know what condition to use.
df2<-subset(df, condition )
A part of the dataframe, `df`:
state value
a 1
b 2
c 3
a 1
b 4
c 5
I count the sum of the value column for each state using : table(df$state)
I need to create a date frame where I show just the rows where the sum of the value column is bigger then a given value x.
If x is 3, I need to have in the new data frame just the rows that have the "state" column equal to b or c.
What should I replace "condition" with? How can I use : table(df$state) in the condition?
It is not clear what are you trying to do.
table(df$state) count the occurence of each state in your data, not the sum of variable "value" for each "state".You should instead use something like this:
vv <- tapply(dat$value,dat$state,sum)
vv
a b c
2 6 8
Now you can use the result within subset, to get the sum of the value column is bigger then a given value x. For example x == 3:
subset(dat,state %in% names(vv)[vv>3])
or without using `subset ( more efficient)
dat[dat$state %in% names(vv)[vv>3],]

Resources