Changing data based on conditions in R

Changing data based on conditions in R - r

I have posted a similar question to this before and got a quick answer, but am just an R beginner and haven't been able to adapt it to what I need.
Basically I want to take the below code (says if Date_Index is between two numbers and df is < X, then turn df to Y) and make it so it only applies to entries that meet a certain criteria, i.e:
HAVE: df[df$Date_Index >= 50 & df$Date_Index <= 52 & df < .0000001]=1
ADD: if df$Date_Index <= 49 AND df = 0.00 ignore the above statement, else execute:
In other words I need the equivalent to an if, then, else clause. If Date_Index <= 49 and df = 0, leave alone, else if Date_Index >=50 and Date Index <= 52 and df < .001 then replace data (in Date Index rows 50-52) with 1.
This (simple) data set should illustrate it enough:
xx <- matrix(0,52,5)
xx[,1]=1
xx[,3]=1
xx[,5]=1
xx[50:52,]=0
xx[,1]=1:52
xx[50,3]=1
So what I'd like is column 2 and column 4 to stay all 0's but for the bottom of column 3 and 5 to continue to be all 1's.

I suppose you're looking for this:
xx[xx[,1] >= 50 & xx[,1] <= 52, c(FALSE, !colSums(!xx[xx[,1] <= 49, -1]))] <- 1

Related

Can you use rowSums to select for rows that commonly have value '1', but exclude rows if all columns have value '1'?

I have a data frame containing 30 bacteria (columns), and the genes they have (rows). Values in the data frame are 0 if the bacteria lacks the gene, while values are 1 when bacteria have the gene.
I want to see which genes are common but still vary between the 30 bacteria and the genes that are rarely found.
cleansymbols3 is my data frame and I want to create a new data frame, commonsymbols that contains only genes that are found in at least 20, but not all, bacteria.
The line below selects genes that are common in 20 or more bacteria, but how do you set a range 20:29?
commonsymbols <- cleansymbols3[rowSums(cleansymbols3) >= 20, ]
I tried this line
commonsymbols <- cleansymbols3[ which( rowSums(cleansymbols3) >= 20 | rowSums(cleansymbols3) <= 29 ), ]
but it selects all genes, probably because it first looks for rowSums > 20 and then rowSums < 29, which together are all rows.

Use & instead of |:
commonsymbols <-
cleansymbols3[rowSums(cleansymbols3) >= 20 & rowSums(cleansymbols3) <= 29, ]
| is OR, but & is AND. So the code you wrote is looking for rows where the rowsum is either greater than 20, or less than 30. Literally every number there is satisfies those criteria!
Using & looks only for rows in which the rowsum is both greater that 20, and less than 30. That is, it's between those two values.
PS I removed which()
You don't need to use which in this context, because x[c(2,4,5)] is equivalent to x[c(FALSE, TRUE, FALSE, TRUE, TRUE)]

Alternative answer using dplyr
You could also do this using the dplry package:
require(dplyr)
commonsymbols <-
rowwise(cleansymbols3) |>
mutate(total = sum(c_across())) |>
filter(total > 20, total < 30) |>
select(-total)

How can I multiply pairs of rows of a column conditional on the difference between the rows of another column without using nested loops?

Let's say I have a dataframe.
x_coord y_coord u
1 12 16 100
2 17 16 105
3 22 12 95
4 27 12 98
I want to calculate the product of pairs of rows under u under multiple conditions based on the other columns which I've done with nested loops:
prod_pairs<- NULL
prod_pairs<- matrix(nrow=4, ncol=1)
for (i in 1:4) {
for (j in 1:4) {
if(i!=j & data$y_coord[i]==data$y_coord[j] & data$x_coord[i]-data$x_coord[j]==-5) {
prod_pairs[i]<- data$u[i]*data$u[j]
break
}
}
}
My actual dataset is much larger and I am repeating this multiple times with other columns in place of u and other value in the 3rd condition under the if statement (it's -5 here; so I will repeat with +5, -10, +10 etc).
The nested loops are quite slow and I've been trying to vectorize this but to no avail. Is there a way I can speed it up?
Also, I want to try to create a function so I can input other columns and values in the 3rd condition of the if statement. I was trying to combine vectorization with a function that can do this but could not make it work.
How would I go about doing this?
Thanks.

One approach might be to join the table on itself, requiring that the join be equal on y_coord, and be offset by (-5, 5, 10, etc) on the xcoord. This function does that, and also allows you to pass a different u column (default is "u", and a different offset or xdiff (default is -5)
get_paired_products<- function(df, ucol="u", xdiff = -5) {
result = df[df[,x2:= x_coord - xdiff], on=.(y_coord, x_coord=x2), nomatch=0]
result[, prod:=get(ucol)*get(paste0("i.",ucol))][, .(row_a = row, row_b=i.row, prod)]
}
If input is:
df = data.table(
x_coord = c(12,17,22,27),
y_coord = c(16,16,12,12),
u=c(100,105, 95, 98),
row = c(1,2,3,4)
)
Then output is:
> get_paired_products(df)
row_a row_b prod
1: 2 1 10500
2: 4 3 9310

Is there a way to determine how many rows in a dataset have the same categorical variable for multiple conditions (columns)?

For example, i have the dataset below where 1 = yes and 0 = no, and I need to figure out how many calls were made by landline that lasted under 10 minutes.
Image of example dataset

You can also specifically define the values you're looking for in each column when you're finding the sum. (This will help if you need count rows with values other than 1 in a column.)
sum(df$landline == 1 & df$`under 10 minutes` == 1)

We can use sum
sum(df1[, "under 10 minutes"])
If two columns are needed
colSums(df1[, c("landline", "under 10 minutes")])
If we are checking both columns, use rowSums
sum(rowSums(df1[, c("landline", "under 10 minutes")], na.rm = TRUE) == 2)

The grep function finds the rows where landline=1. We then only call those rows and sum the under 10 min column.
sum( df[ grep(1,df[,1]) ,4] )

R will conveniently treat 1 and 0 as if they mean TRUE and FALSE, so we can apply logical Boolean operations like AND (&) and OR (|) on them.
df <- data.frame(x = c(1, 0, 1, 0),
y = c(0, 0, 1, 1))
> sum(df$x & df$y)
[1] 1
> sum(df$x | df$y)
[1] 3
For future questions, you should look up how to use functions like dput or other ways to give an example data set instead of using an image.

Subset based on a variable value interpreted as a column

I'm trying to subset a data frame based on a variable I'm passing into it. My goal is to form a column name inside a function using some values I am passing into it and filter on that newly form column name.
Here's a reproducible example:
var_as_col_name <- function(df, col_var, filter_var) {
subset(df, col_var == filter_var)
}
# this should return what subset(df, cty == 18) would return
var_as_col_name(mpg,"cty", 18)
# this should return what subset(df, cyl == 4) would return
var_as_col_name(mpg,"cyl", 4)
Also, apart from the filters on mpg$cty and mpg$cyl above, I might have another filter that is hardcoded, which I don't want to change, i.e. my requirement should hold for more than one filter. Is there a better approach without using subset (since it is meant for interactive use)?
I am doing this because I have some columns in my dataset like t_1, t_2, t_3...t_24 and I need to filter on either of them and another flag column, so I'm doing:
df_1 <- subset(my_df,flag == 0 & t_1 > 0 & t_1 < 1) when I want data after filtering on t_1
df_2 <- subset(my_df,flag == 1 & t_2 > 0 & t_2 < 1) when I want data after filtering on t_2
...
Instead of this I was thinking of writing a function that takes:
n from 1 to 24, filters on that t_n
takes 1 or 0 for the flag.
and then returns the subsetted dataframe that I want.
Let me know if you need clarification on the question and thanks for your help...

Delete specific rows out of a dataset

I have a dataset with 40 columns with 100.000 rows each. Because the number of columns is to big, I want to delete some of them. I want to delete the rows from 10.000-20.000; from 30.000-40.000 and from 60.000-70.000; so that I have as a result a dataset with 40 columns with 70.000 rows. The first column is an ID starts with 1 (called ItemID) and ends at 100.000 for the last one. Can someone please help me.
Tried this to delete the columns from 10000 to 20000, but it´s not working (let´s the the data set is called "Data"):
Data <- Data[Data$ItemID>10000 && Data$ItemID<20000]

Severeal ways of doing this. Something like this suit your needs?
dat <- data.frame(ItemID=1:100, x=rnorm(100))
# via row numbers
ind <- c(10:20,30:40,60:70)
dat <- dat[-ind,]
# via logical vector
ind <- with(dat, { (ItemID >= 10 & ItemID <= 20) |
(ItemID >= 30 & ItemID <= 40) |
(ItemID >= 60 & ItemID <= 70) })
dat2 <- dat[!ind,]
To take it to the scale of your data set, just ind according to the size of your data set (multiplication might do).

I think you should be able to do
data <- data[-(10000:20000),]
and then remove the other rows in a similar manner.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Changing data based on conditions in R - r

I suppose you're looking for this: xx[xx[,1] >= 50 & xx[,1] <= 52, c(FALSE, !colSums(!xx[xx[,1] <= 49, -1]))] <- 1

Related

Can you use rowSums to select for rows that commonly have value '1', but exclude rows if all columns have value '1'?

How can I multiply pairs of rows of a column conditional on the difference between the rows of another column without using nested loops?

Is there a way to determine how many rows in a dataset have the same categorical variable for multiple conditions (columns)?

Subset based on a variable value interpreted as a column

Delete specific rows out of a dataset

Categories

Resources