How to manipulate data by row in a data frame - r

I'm getting a bit confused. I've got data like this in a data frame
index times
1 1 56.60
2 1 150.75
3 1 204.41
4 2 44.71
5 2 98.03
6 2 112.20
and I know that the times indexed 1 are biased, whereas the times indexed otherwise are not. I need to create a copy of that data frame removing the bias from the samples indexed 1. I've been trying several combinations of apply, by, and the likes. The closest I got was with
by(lct, lct$index, function(x) { if(x$index == 1) x$times = x$times-50 else x$times = x$times } )
which returned an object of class by, which is unusable for me. I need to write the data back to a csv file in the same format (index, times) of the original file. Ideas?

Something like this should work:
df$times[df$index ==1] <- df$times[df$times == 1] - 50
The trick here is to take the subset of df$times that fits your filter, and realize that R can also assign to a subset.
Alternatively, you can use ifelse:
df$times = ifelse(df$index == 1, df$times - 50, df$times)
and use it in dplyr:
library(dplyr)
df = data.frame(index = sample(1:5, 100, replace = TRUE),
value = runif(100)) %>% arrange(index)
df %>% mutate(value = ifelse(index == 1, value - 50, value))
# index value
#1 1 -49.95827
#2 1 -49.98104
#3 1 -49.44015
#4 1 -49.37316
#5 1 -49.76286
#6 1 -49.22133
#etc

How about,
index <- c(1, 1, 1, 2, 2, 2)
times <- c(56.60, 150.75, 204.41, 44.71, 98.03, 112.20)
df <- data.frame(index, times)
df$times <- ifelse(df$index == 1, df$times - 50, df$times)
> df
#index times
#1 1 6.60
#2 1 100.75
#3 1 154.41
#4 2 44.71
#5 2 98.03
#6 2 112.20

Related

In dplyr::mutate, refer to a value conditionally, based on the value of another column

Apologies for the unclear title. Although not effective, I couldn't think of a better way to describe this problem.
Here is a sample dataset I am working with
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
I want to create a new column (called "Value_Standardized") whose values are calculated by grouping the data by GroupNum and then dividing each Value observation by the Value observation of the group when the Index is 1.
Here's what I've come up with so far.
test2 = test %>%
group_by(GroupNum) %>%
mutate(Value_Standardized = Value / special_function(Value))
The special_function would represent a way to get value when Index == 1.
That is also precisely the problem - I cannot figure out a way to get the denominator to be the value when index == 1 in that group. Unfortunately, the value when the index is 1 is not necessarily the max or the min of the group.
Thanks in advance.
Edit: Emphasis added for clarity.
There is a super simple tidyverse way of doing this with the method cur_data() it pulls the tibble for the current subset (group) of data and acts on it
test2 <- test %>%
group_by(GroupNum) %>%
mutate(output=Value/cur_data()$Value[1])
The cur_data() grabs the tibble, then you extract the Values column as you would normally using $Value and because the denominator is always the first row in this group, you just specify that index with [1]
Nice and neat, there are a whole bunch of cur_... methods you can use, check them out here:
Not sure if this is what you meant, nor if it's the best way to do this but...
Instead of using a group_by I used a nested pipe, filtering and then left_joining the table to itself.
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
test %>%
left_join(test %>%
filter(Index == 1) %>%
select(Value,GroupNum),
by = "GroupNum",
suffix = c('','_Index_1')) %>%
mutate(Value = Value/Value_Index_1)
output:
Value Index GroupNum Value_Index_1
1 1.0 1 1 1
2 2.0 2 1 1
3 3.0 3 1 1
4 4.0 4 1 1
5 5.0 5 1 1
6 1.0 1 2 5
7 0.8 2 2 5
8 0.6 3 2 5
9 0.4 4 2 5
10 0.2 5 2 5
A quick base R solution:
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5)),
Value_Standardized = NA
)
groups <- levels(factor(test$GroupNum))
for(currentGroup in groups) {
test$Value_Standardized[test$GroupNum == currentGroup] <- test$Value[test$GroupNum == currentGroup] / test$Value[test$GroupNum == currentGroup & test$Index == 1]
}
This only works under the assumption that each group will have only one observation with a "1" index though. It's easy to run into trouble...

Looping over multiple columns to generate a new variable based on a condition

I am trying to generate a new column (variable) based on the value inside multiple columns.
I have over 60 columns in the dataset and I wanted to subset the columns that I want to loop through.
The column variables I am using in my condition at all characters, and when a certain pattern is matched, to return a value of 1 in the new variable.
I am using when because I need to run multiple conditions on each column to return a value.
CODE:
df read.csv("sample.csv")
*#Generate new variable name*
df$new_var <- 0
*#For loop through columns 16 to 45*
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
grepl("I8501", df[[i]]) ~ 1
))
}
This does not work as when I table the results, I only get 1 value matched.
My other attempt was using:
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
df[[i]] == "I8501" ~ 1
))
}
Any other possible ways to run through multiple columns with multiple conditions and change the value of the variable accordingly? to be achieved using R ?
If I'm understanding what you want, I think you just need to specify another case in your case_when() for keeping the existing values when things don't match "I8501". This is how I would do that:
df$new_var <- 0
for (index in (16:45)) {
df <- df %>%
mutate(
new_var = case_when(
grepl("I8501", df[[index]]) ~ 1,
TRUE ~ df$new_var
)
)
}
I think a better way to do this though would be to use the ever useful apply():
has_match = apply(df[, 16:45], 1, function(x) sum(grepl("I8501", x)) > 0)
df$new_var = ifelse(has_match, 1, 0)
Kindly check if this works for your file.
Sample df:
df <- data.frame(C1=c('A','B','C','D'),C2=c(1,7,3,4),C3=c(5,6,7,8))
> df
C1 C2 C3
1 A 1 5
2 B 7 6
3 C 3 7
4 D 4 8
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.numeric(any(str_detect(c_across(2:last_col()), "7")))) # change the 2:last_col() to select your column range ex: 2:5
Output for finding "7" in any of the columns:
C1 C2 C3 new_var
<chr> <dbl> <dbl> <dbl>
1 A 1 5 0
2 B 7 6 1
3 C 3 7 1
4 D 4 8 0

Using a for loop across columns with similar names

I am trying to use the tidyverse (purrr) package to run a for loop across my dataset. I want to check whether some number of conditions are true across certain columns along the dataset. Note, I am trying to become more familiar with tidyverse and its functions rather than rely on Base R.
Here is the code that I want to write a for loop for.
nrow(subset(data, flwr_clstr1>1 & bud_clstr1==0))
nrow(subset(data, flwr_clstr2>1 & bud_clstr2==0))
nrow(subset(data, flwr_clstr3>1 & bud_clstr3==0))
I have columns of data (in this case, it would be flwr_clstr) that are similar, but differ by the last digit. Also, if there is another way to use tidyverse to check these 'conditions', that would be great too.
Here is my attempt at the for loop.
check1 <- vector("double", ncol(data_phen))
for (i in seq_along(data_phen)) {
check[[i]] <- nrow(subset(data, flwr_clstr[[i]]>1 & bud_clstr[[i]]==0))
}
It would be easier to help if you could provide a reproducible example, however I created a sample of what your data might look like based on my understanding.
We can use map2_int from purrr since we are trying to count number of rows in each pair of columns
library(dplyr)
library(purrr)
map2_int(data %>% select(starts_with("flwr_clstr")),
data %>% select(starts_with("bud_clstr")),
~sum(.x > 1 & .y == 0)) %>% unname()
#[1] 2 3 1
However, base R isn't that bad either. This can be solved using mapply
col1 <- grep("^flwr_clstr", names(data))
col2 <- grep("^bud_clstr", names(data))
mapply(function(x, y) sum(x > 1 & y == 0), data[col1], data[col2])
data
Assuming you have equal number of columns for both "flwr_clstr.." and "bud_clstr.."
data <- data.frame(flwr_clstr1 = c(2, 1, 2, 1, 0), flwr_clstr2 = c(2, 2, 2, 1, 0),
flwr_clstr3 = c(1, 1, 2, 1, 1), bud_clstr1 = 0, bud_clstr2 = 0,bud_clstr3 = 0)
which looks like
data
# flwr_clstr1 flwr_clstr2 flwr_clstr3 bud_clstr1 bud_clstr2 bud_clstr3
#1 2 2 1 0 0 0
#2 1 2 1 0 0 0
#3 2 2 2 0 0 0
#4 1 1 1 0 0 0
#5 0 0 1 0 0 0

Vectorization/data.table - increase efficiency of for loop for 12kk records DF

I need to associate the group to 20k groups which total amounts to 12M rows.
To solve this problem I wrote a for loop but it is clearly totally inefficient and I am sure this task can be easily vectorized. However, I am struggling in understanding how to write this instruction in a vectorized fashion.
The problem is the following:
I have an auxiliary_table with 3 features: ID, start_row, end_Row.
start_row is the row index of the first element in my_DF belonging to ID x;
end_row is the row index of the last element in my_DF belonging to ID x.
The vectorized instruction should do the following:
Considering the auxiliary_table like the following:
auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
Considering a DF like the following:
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1)
We need to associate the ID based on the start_row and end_row index information contained in the auxiliary_table.
The solution_df is:
solution_df <- data.frame(my_df, ID=(1,1,1,2,2,2,2,3,3,3,3,3,4,4)
I asked for a vectorization of the for loop but I am open for example to data.table solutions.
I hope I was clear and presented my question correctly.
The auxiliary_table is kind of run-length encoded. Therefore, I suggest to try the inverse.rle() function with an appropriately transformed auxiliary_table:
1. dplyr
library(dplyr)
my_df %>%
mutate(ID = auxiliary_table %>%
transmute(lengths = end_row - start_row + 1L, values = ID) %>%
inverse.rle())
Var_a ID
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 4 2
8 6 3
9 4 3
10 3 3
11 1 3
12 2 3
13 1 4
14 1 4
2. data.table
This adds the ID column without copying my_df.
library(data.table)
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
, .(lengths = end_row - start_row + 1L, values = ID)])][]
Depending on the size of auxiliary_table the code below might be somewhat more efficient because it transforms auxiliary_table in place:
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
, lengths := end_row - start_row + 1L][
, c("end_row", "start_row") := NULL][
, setnames(.SD, "ID", "values")])][]
I have designed a user defined function and applying it on the auxillary_table. See if this helps -
auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1))
solution_df <- data.frame(my_df, ID=c(1,1,1,2,2,2,2,3,3,3,3,3,4,4))
aux_to_df <- function(aux_row){
# 1,2,3 can be replaced by column names
value = aux_row[1]
start_row = aux_row[2]
end_row = aux_row[3]
my_df[start_row:end_row, "ID"] <<- value # <<- means assigning to global out of scope variable
}
apply(auxiliary_table, 1, aux_to_df)
my_df

How to remove rows that do not have more than 3 values in r?

this is my first time asking a question and hopefully I can get your help!
I need to remove rows that have values for only one or two genes using R
basically I need to get rid of 50S, ABCC8, and ACAT1 because these have a n<3.
My desired output is
thank you very much!
If this is in a data.frame, you can use dplyr package to do some manipulation. We can group the data by the Genes and count how many instances are there. Then we simply set the filter criteria to remove the records.
require(dplyr)
df <- data.frame(
Genes=c('50S' ,'abcb1' ,'abcb1' ,'abcb1' ,'ABCC8' ,'ABL' ,'ABL' ,'ABL' ,'ABL' ,'ACAT1' ,'ACAT1' ),
Values=c(-0.627323448, -0.226358414, 0.347305901 ,0.371632631 ,0.099485307 ,0.078512979 ,-0.426643782, -1.060270668, -2.059157991, 0.608899174 ,-0.048795611)
)
#group, filter and join back to get subset the data
df %>% group_by(Genes)
%>% summarize(count=n())
%>% filter(count>=3)
%>% inner_join(df)
%>% select(Genes,Values)
As per #Lamia's comments, it is possible to simplify it to just:
df %>% group_by(Genes) %>% filter(n()>=3)
# generating data
x <- c(NA, NA, NA, NA, 2, 3) # has n < 3!
y <- c(1, 2, 3, 4, 5, 6)
z <- c(1 ,2, 3, NA, 5, 6)
df <- data.frame(x,y,z)
colsToKeep <- c() # making empty vector I will fill with column numbers
for (i in 1:ncol(df)) { # for every column
if (sum(!is.na(df[,i]))>=3) { # if that column has greater than 3 valid values (i.e., ones that are not na...
colsToKeep <- c(colsToKeep, i) # then save that column number into this vector
}
}
df[,colsToKeep] # then use that vector to call the columns you want
Note that R treats FALSE as 0 and TRUE as 1, so that is how the sum() function works here.
Another possible solution by using table:
gene <- c("A","A","A","B","B","C","C","C","C","D")
value <- c(seq(1,10,1))
df<-data.frame(gene,value)
df
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9
su<-data.frame(table(df$gene))
df_keep <-df[which(df$gene %in% su[which(su$Freq>2),1]),]
df_keep
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9

Resources