Number of occurences in a dataframe - r

I've the following data frame and I want to count the occurrences of each row by the first column and append as another column say "freq" to the data frame:
df:
gene a b c
abc 1 NA 1
bca NA 1 1
cba 1 2 1
my df is bigger, so this is only an example to scalable.
The desire dataframe is that:
gene a b c freq
abc 1 NA 1 2
bca NA 1 1 2
cba 1 2 1 3
the codes what I have tried is that:
g <- df %>% mutate(numtwos = rowSums(. > 0))
or
df$freq <- apply(df , 1, function(x) length(which(x>0)))
But it is not working because if in a row should have (for example) 150 repetitions, I obtain only 2 for every row.
Any help or other point of view is welcome!
Thanks

We can use first convert the Na to "NA"
library(dplyr)
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(select(., a:c), na.rm = TRUE))
# gene a b c freq
#1 abc 1 NA 1 2
#2 bca NA 1 1 2
#3 cba 1 1 1 3
Here, the values are all 1s, so it is the same as getting the sum of non-NA
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(!is.na(select(., a:c))))
data
df <- structure(list(gene = c("abc", "bca", "cba"), a = c("1", "Na",
"1"), b = c("Na", "1", "1"), c = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))

I haven't used R for a while, so I won't paste in the code, but you can create a new df groupping the initial one by gene and merge/join it to your initial df in another line of code.

Related

How to create a variable to a dataset conditioning on missing values and another dataframe at the same time?

I have these two dataframes (imagine them very big) :
df = data.frame(subjects = 1:10,
var1 = c('a',NA,'b',NA,'c',NA,'d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
and I want to put the variable score from the g dataframe into the df dataframe, with the condition that if var1 = NA, then the score in df will be equal to NA. How can we make that with a simple function ? thanks.
Second scenario :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
now I want that the score for each subject that was not calculated to be NAs to become as follows :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'),
score = c(1,NA,2,NA,1,NA,3,2,4,1))
We could do a join by 'subjects' which return 'score' with NA where there are no corresponding 'subject's in 'g'. If we need the 'score' to be NA also when 'var1' is NA, do a replace on the next step with NA check on 'var1'
library(dplyr)
df <- left_join(df, g, by= "subjects") %>%
mutate(score = replace(score, is.na(var1), NA))
-output
df
subjects var1 score
1 1 a 1
2 2 e NA
3 3 b 2
4 4 c NA
5 5 c 1
6 6 b NA
7 7 d 3
8 8 e 2
9 9 f 4
10 10 g 1

How to create new variables using all possible subtractions combinations of the original ones in R?

So I have this big data set with 32 variables and I need to work with relative values of these variables using all possible subtractions among them. Ex. var1-var2...var1-var32; var3-var4...var3-var32, and so on. I'm new in R, so I would like to do this without going full manually on the process. I'm out of idea, other than doing all manually. Any help appreciated! Thanks!
Ex:
df_original
id
Var1
Var2
Var3
x
1
3
2
y
2
5
7
df_wanted
id
Var1
Var2
Var3
Var1-Var2
Var1-Var3
Var2-Var3
x
1
3
2
-2
-1
1
y
2
5
7
-3
-5
-2
You can do this combn which will create combination of columns taking 2 at a time. In combn you can apply a function to every combination where we can subtract the two columns from the dataframe and add the result as new columns.
cols <- grep('Var', names(df), value = TRUE)
new_df <- cbind(df, do.call(cbind, combn(cols, 2, function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste0(x, collapse = '-'))
}, simplify = FALSE)))
new_df
# id Var1 Var2 Var3 Var1-Var2 Var1-Var3 Var2-Var3
#1 x 1 3 2 -2 -1 1
#2 y 2 5 7 -3 -5 -2
data
df <- structure(list(id = c("x", "y"), Var1 = 1:2, Var2 = c(3L, 5L),
Var3 = c(2L, 7L)), class = "data.frame", row.names = c(NA, -2L))

Dynamically select all columns but among ones that start with a certain word exclude all but keep one

I have many data frames that come in such a format:
df1 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, Group = 1:2,
FORMULA_RULE = 1:2, FORMULA_TRANSFORM = 1:2, FORMULA_UNITE = 1:2,
FORMULA_CALCULATE = 1:2, FORMULA_JOIN = 1:2), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, FORMULA_RULE = 1:2,
FORMULA_META = c(NA, NA), FORMULA_DATA = 1:2, FORMULA_JOIN = 1:2,
FORMULA_TRANSFORM = 1:2, Group = 1:2), class = "data.frame", row.names = c(NA,
-2L))
View:
df1
ID Name Gender Group FORMULA_RULE FORMULA_TRANSFORM FORMULA_UNITE FORMULA_CALCULATE FORMULA_JOIN
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
df2
ID Name Gender FORMULA_RULE FORMULA_META FORMULA_DATA FORMULA_JOIN FORMULA_TRANSFORM Group
1 1 1 1 1 NA 1 1 1 1
2 2 2 2 2 NA 2 2 2 2
I want to write a code that would work on all such dataframes in a way that all columns are kept, but among the columns starts with FORMULA_, only FORMULA_TRANSFORM is selected. Please note that columns that do NOT start with FORMULA_ are not always the same, that is to say, I cannot simply write a code that always selects ID, Name, Gender, Group, and FORMULA_TRANSFORM, because there are some data frames that contain many other columns that do not start with FORMULA_ which I want to keep.
My attempt to solve this problem is this ugly code which works as expected:
library(tidyverse)
for(i in 1:length(ls(pattern = "df"))){
get(paste0("df", i)) %>%
select(-starts_with("FORMULA"),
(names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T))[!names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T) %in% "FORMULA_TRANSFORM"])
%>% print
}
Is there a more straight-forward way to do this?
With dplyr we can use select and it's pretty straight forward using starts_with and contains.
library(dplyr)
df1 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
Let's try with a dataframe without "FORMULA_TRANSFORM" column
df3 <- df1
df3$FORMULA_TRANSFORM <- NULL
df3 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2
With minus sign we are removing the columns that starts_with "FORMULA_" and selecting the one with "FORMULA_TRANSFORM". Instead of contains we can also use one_of() or matches() and it would still work.
Using base R we can use grep with invert and value set as TRUE
df1[c(grep("^FORMULA_", names(df1), invert = TRUE, value = TRUE),
"FORMULA_TRANSFORM")]
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
This creates a vector of column names where column name doesn't start with "FORMULA_" and we add "FORMULA_TRANSFORM" manually later.
The above method assumes that you always have "FORMULA_TRANSFORM" column in your dataframe and it will fail if there isn't. Safer option would be
get_selected_cols <- function(df1) {
cbind(df1[grep("^FORMULA_", names(df1), invert = TRUE)],
df1[names(df1) == "FORMULA_TRANSFORM"])
}
get_selected_cols(df1)
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
get_selected_cols(df3)
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2

Append dataFrame columns to other columns with different names and order?

I am struggling with reordering a dataFrame in R.
My dataFrame has data coming from two different sensors. So in the beginning every column has a name with the syntax "sensor number.sample number". The rowname is a coordinate of each sample.
Sadly the columns are not ordered with an ascending sample number.
How can I make an automatic ordering where after number 1 comes 2 and not 10?
With correct ordered columns I would like to cut all columns of the second sensor and append it under the rows from the first sensor. This is also tricky as the number of columns of each sensor varies in the reality.
To distinguish between both sensors I would add a postfix "a" or "b" for the new rownames.
Here my problem is that I know "rbind" but it requires identical column names, I cannot provide here. And I would also need to select the columns manually as I have no clue how to automatically select all of the second sensor.
My idea for the moment is to make subsets for each sensor, rename the columns and then use rbind with both subsets. Is this a good idea?
The rownames I then could modify with paste().
I now present simplified frames as the original is quite big. So the numbers (c(1:3)) are just exemplary.
This is how my dataFrame looks at the beginning:
myDf = data.frame(a.10= c(1:3),a.11= c(1:3),a.12= c(1:3),a.13= c(1:3),a.2= c(1:3),a.3= c(1:3),a.4= c(1:3),a.5= c(1:3),a.6= c(1:3),a.7= c(1:3),a.8= c(1:3),a.9= c(1:3),
b.1= c(1:3),b.10= c(1:3),b.11= c(1:3),b.2= c(1:3),b.3= c(1:3),b.4= c(1:3),b.5= c(1:3),b.6= c(1:3),b.7= c(1:3),b.8= c(1:3),b.9= c(1:3))
My goal is to transform the dataFrame that is looks like that:
desiredDf =data.frame(n9=rep(c(1:3),2), n10=rep(c(1:3),2), n11=rep(c(1:3),2), n12=c(c(1:3),NA, NA, NA), n13=c(c(1:3), NA, NA, NA))
rownames(desiredDf)<-(c("1a","2a","3a","1b","2b","3b"))
Thank you very much!
Here is an option.
library(tidyverse)
myDF2 <- myDf %>% gather(measure, result, a.10:b.9) %>%
separate(measure, into = c("letter", "number"), sep = "\\.") %>%
group_by(letter, number)%>%
mutate(n = row_number()) %>%
unite(col, n, letter, sep = "") %>%
ungroup() %>%
arrange(as.numeric(number))%>%
mutate(number = paste0("n", number))%>%
mutate(number = factor(number, levels = unique(number)))%>%
spread(number, result)%>%
arrange(col)
row.names(myDF2) <- myDF2$col
myDF2$col <- NULL
Convert the row names to a column, reshape into long form and separate the key, i.e. the original column names, into columns group and no converting the latter to numeric. Sort, reshape back to wide form, sort again, combine the rowname and group and preface each column name with n.
library(dplyr)
library(tibble)
library(tidyr)
myDf %>%
rownames_to_column %>%
gather(key, value, -rowname) %>%
separate(key, c("group", "no"), convert = TRUE) %>%
arrange(group, no) %>%
spread(no, value) %>%
arrange(group, rowname) %>%
unite(rowname, rowname, group, sep = "") %>%
column_to_rownames %>%
rename_all(~ paste0("n", .))
giving:
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13
1a NA 1 1 1 1 1 1 1 1 1 1 1 1
2a NA 2 2 2 2 2 2 2 2 2 2 2 2
3a NA 3 3 3 3 3 3 3 3 3 3 3 3
1b 1 1 1 1 1 1 1 1 1 1 1 NA NA
2b 2 2 2 2 2 2 2 2 2 2 2 NA NA
3b 3 3 3 3 3 3 3 3 3 3 3 NA NA
Note
Above we used this for myDf, the input.
myDf <-
structure(list(a.10 = 1:3, a.11 = 1:3, a.12 = 1:3, a.13 = 1:3,
a.2 = 1:3, a.3 = 1:3, a.4 = 1:3, a.5 = 1:3, a.6 = 1:3, a.7 = 1:3,
a.8 = 1:3, a.9 = 1:3, b.1 = 1:3, b.10 = 1:3, b.11 = 1:3,
b.2 = 1:3, b.3 = 1:3, b.4 = 1:3, b.5 = 1:3, b.6 = 1:3, b.7 = 1:3,
b.8 = 1:3, b.9 = 1:3), class = "data.frame", row.names = c(NA,
-3L))

How delete all rows that contain a certain value regardless of what column it is in

I need to delete all rows that contain a value of 2 or -2 regardless of what column it is in except column one.
Example dataframe:
df
a b c d
zzz 2 2 -1
yyy 1 1 1
xxx 1 -1 -2
Desired output:
df
a b c d
yyy 1 1 1
I have tried
df <- df[!grepl(-2 | 2, df),]
df <- subset(df, !df[-1] == 2 |!df[-1] == -2)
My actual dataset has over 300 rows and 70 variables
I believe I need to use some sort of apply function but I am not sure.
Any help is appreciated please let me know if you need more info.
We can create a logical index by comparing the absolute value of the dataset with that of 2, get the row wise sum and if there are no values, it will be 0 (by negating !, it returns TRUE for those 0 values and FALSE for others) and subset based on the logical index
df[!rowSums(abs(df[-1])==2),]
# a b c d
#2 yyy 1 1 1
Or another option is to compare within each column using lapply, collapse it to a logical vector with | and use that to subset the rows
df[!Reduce(`|`,lapply(abs(df[-1]), `==`, 2)),]
# a b c d
#2 yyy 1 1 1
We could also do this with tidyverse
library(tidyverse)
df %>%
select(-1) %>% #to remove the first column
map(~abs(.) ==2) %>% #do the columnwise comparison
reduce(`|`) %>% #reduce it to logical vector
`!` %>% #negate to convert TRUE/FALSE to FALSE/TRUE
df[., ] #subset the rows of original dataset
# a b c d
# 2 yyy 1 1 1
data
df <- structure(list(a = c("zzz", "yyy", "xxx"), b = c(2L, 1L, 1L),
c = c(2L, 1L, -1L), d = c(-1L, 1L, -2L)), .Names = c("a",
"b", "c", "d"), class = "data.frame", row.names = c(NA, -3L))
Option with dplyr:
library(dplyr)
a <- c("zzz","yyy","xxx")
b <- c(2,1,1)
c <- c(2,1,-1)
d <- c(-1,1,-2)
df <- data.frame(a,b,c,d)
filter(df,((abs(b) != 2) & (abs(c) != 2) & (abs(d) != 2)))
a b c d
1 yyy 1 1 1

Resources