finding top 1 percentile in datatable in r

finding top 1 percentile in datatable in r - r

I have a data table like
sample1 sample2 sample3
fruit1 10 20 30
fruit2 1 5 6
fruit3 3 7 8
etc.
I want to find the top 1 percentile of fruits in each sample in R (according to the number in each sample). Is there a simple way to do this?

You can lapply over your data and for each column, subset the rownames of df with a logical vector which is TRUE when the corresponding value in the column is in the 1 percentile (i.e. above the 100 - 1 percentile).
Create example data
set.seed(2019)
df <- as.data.frame(matrix(sample(1e4, replace = T), 1e3, 10))
names(df) <- paste0('sample', seq_along(df))
rownames(df) <- paste0('fruit', seq_len(nrow(df)))
Step described above:
lapply(df, function(x) rownames(df)[x > quantile(x, (100 - 1)/100)])
# $`sample1`
# [1] "fruit57" "fruit76" "fruit149" "fruit471" "fruit520" "fruit682" "fruit805"
# [8] "fruit949" "fruit966" "fruit975"
#
# $sample2
# [1] "fruit49" "fruit109" "fruit232" "fruit274" "fruit312" "fruit795" "fruit883"
# [8] "fruit884" "fruit955" "fruit958"
#
# $sample3
# [1] "fruit37" "fruit189" "fruit231" "fruit256" "fruit473" "fruit654" "fruit729"
# [8] "fruit742" "fruit820" "fruit979"
#
# ...

Assuming your data frame is calle "fruit"
fruit <- fruit[order(fruit$sample1,decreasing = TRUE)]
top.1.percent <- fruit[1:length(fruit$sample1)/100,]
This should do the trick for sample1

Related

Remove dataframes from list that matches a column in a dataframe in R

I have a list of dataframes. I want to remove some of the dataframes that doesnt match entries from a column on a separate dataframe. Sample code is below.
my.list <- list(1.1,1.2,1.3,1.4,1.5)
df <- data.frame(ID = c(1.1,1.3,1.5))
I want to remove dataframes from my.list based on whatever IDs I have in df. So in this case output should look like
my.list
$`1.1`
...
$`1.3`
...
$`1.5`

The example input is not very clear, I assume you meant list of dataframes with names 1.1, 1.2, etc., see example:
# list of dataframes example, here we just have 1 to 5,
# in your case this would be 5 dataframes.
my.list <- as.list(1:5)
names(my.list) <- as.character(c(1.1,1.2,1.3,1.4,1.5))
my.list
# $`1.1`
# [1] 1
#
# $`1.2`
# [1] 2
#
# $`1.3`
# [1] 3
#
# $`1.4`
# [1] 4
#
# $`1.5`
# [1] 5
df <- data.frame(ID = c(1.1,1.3,1.5))
my.list[ as.character(df$ID) ]
# $`1.1`
# [1] 1
#
# $`1.3`
# [1] 3
#
# $`1.5`
# [1] 5

R split array into Data frame

VERY new to R and struggling with knowing exactly what to ask, have found a similar question here
How to split a character vector into data frame?
but this has fixed length, and I've been unable to adjust for my problem
I've got some data in an array in R
TEST <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
The data is stored in pairs, and I'm looking to split out into a dataframe as such:
Variable Value
Value01 100
Value02 200
Value03 300
Value04 1
Value05 2
StillAValueButNamesAreNotConsistent 12345.6789
AlsoNotAllLinesAreTheSameLength 1
The Variable name is a string and the value will always be a number
Any help would be great!
Thanks

One can use tidyr based solution. Convert vector TEST to a data.frame and remove the last | from each row as that doesn't carry any meaning as such.
Now, use tidyr::separate_rows to expand rows based on | and then separate data in 2 columns using tidyr::separate function.
library(dplyr)
library(tidyr)
data.frame(TEST) %>%
mutate(TEST = gsub("\\|$","",TEST)) %>%
separate_rows(TEST, sep = "[|]") %>%
separate(TEST, c("Variable", "Value"), ":")
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1

We can do it in base R with one line. Just change the | characters to line breaks then use : as the sep value in read.table(). You can also set column names there too.
read.table(text = gsub("\\|", "\n", TEST), sep = ":",
col.names = c("Variable", "Value"))
# Variable Value
# 1 Value01 100.00
# 2 Value02 200.00
# 3 Value03 300.00
# 4 Value04 1.00
# 5 Value05 2.00
# 6 StillAValueButNamesAreNotConsistent 12345.68
# 7 AlsoNotAllLinesAreTheSameLength 1.00

With Base R:
(I've broken out each step to hopefully make the code clear)
# your data
myvec <- c("Value01:100|Value02:200|Value03:300|","Value04:1|Value05:2|",
"StillAValueButNamesAreNotConsistent:12345.6789|",
"AlsoNotAllLinesAreTheSameLength:1|")
# convert into one long string
all_text_str <- paste0(myvec, collapse="")
# split the string by "|"
all_text_vec <- unlist(strsplit(all_text_str, split="\\|"))
# split each "|"-group by ":"
data_as_list <- strsplit(all_text_vec, split=":")
# collect into a dataframe
df <- do.call(rbind, data_as_list)
# clean up the dataframe by adding names and converting value to numeric
names(df) <- c("variable", "value")
df$value <- as.numeric(df$value)

With help of strsplit and unlist function. Each command is shown with output below.
Input
TEST
# [1] "Value01:100|Value02:200|Value03:300|"
# [2] "Value04:1|Value05:2|"
# [3] "StillAValueButNamesAreNotConsistent:12345.6789|"
# [4] "AlsoNotAllLinesAreTheSameLength:1|"
Splitting by | and then by :
my_list <- strsplit(unlist(strsplit(TEST, "|", fixed = TRUE)), ":", fixed = TRUE)
my_list
# [[1]]
# [1] "Value01" "100"
# [[2]]
# [1] "Value02" "200"
# [[3]]
# [1] "Value03" "300"
# [[4]]
# [1] "Value04" "1"
# [[5]]
# [1] "Value05" "2"
# [[6]]
# [1] "StillAValueButNamesAreNotConsistent" "12345.6789"
# [[7]]
# [1] "AlsoNotAllLinesAreTheSameLength" "1"
Converting above list to data.frame
df <- data.frame(matrix(unlist(my_list), ncol = 2, byrow=TRUE))
df
# X1 X2
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1
Colnames to dataframe
names(df) <- c("Variable", "Value")
df
# Variable Value
# 1 Value01 100
# 2 Value02 200
# 3 Value03 300
# 4 Value04 1
# 5 Value05 2
# 6 StillAValueButNamesAreNotConsistent 12345.6789
# 7 AlsoNotAllLinesAreTheSameLength 1

Table scraped from a web page is read as a single character vector: how to convert into a dataframe?

I have scraped a large table from a web page using the rvest package, but it is reading it as a single vector:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
that I need to deal with as a dataframe that looks like this:
bar<-as.data.frame(cbind(Animal=c("Dog","Cat","Goat"),A=c(1,4,7),B=c(2,5,8),C=c(3,6,9)))
This might be a simple dilemma but I'd appreciate the help.

you can create a matrix from your vector and turn it into a data frame:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal" , foo)
m <- matrix(foo , ncol = 4 , byrow = TRUE)
df <- as.data.frame(m[-1,] , stringsAsFactors = FALSE)
colnames(df) <- m[1,]
# I assume you want numerics for your A,B,C columns:
df[,2:4]<-apply(df[,2:4],2,as.numeric)
lapply(df,class)
$Animal
[1] "character"
$A
[1] "numeric"
$B
[1] "numeric"
$C
[1] "numeric"

Just split it into required number of rows and rbind it. I added "Animal" at the start of foo to make the elements equal in each row when splitting
foo = c("Animal", foo)
df = data.frame(do.call(rbind, split(foo, ceiling(seq_along(foo)/4))),
stringsAsFactors = FALSE)
colnames(df) = df[1,]
df = df[-1,]
df
# Animal A B C
#2 Dog 1 2 3
#3 Cat 4 5 6
#4 Goat 7 8 9

If you want the proper column types, you can try this. Split into a list, name the list, then convert the column types before coercing to data frame.
l <- setNames(split(tail(foo, -3), rep(1:4, 3)), c("Animal", foo[1:3]))
as.data.frame(lapply(l, type.convert)) ## stringsAsFactors=FALSE if desired
# Animal A B C
# 1 Dog 1 2 3
# 2 Cat 4 5 6
# 3 Goat 7 8 9

Here is a convenient tool to work with list,
seqList <-
function(character,by= 1,res=list()){
### sequence characters by
if (length(character)==0){
res
} else{
seqList(character[-c(1:by)],by=by,res=c(res,list(character[1:by])))
}
}
Once you convert your characters into lists it's easier to manipulate them for instance you can do.
options(stringsAsFactors=FALSE)
foo <-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal",foo)
df <- data.frame(t(do.call("rbind",
lapply(1:4,function(x) do.call("cbind",lapply(seqList(foo,4),"[[",x))))))
colnames(df) <- df[1,]
df <- df[-1,]
## > df
## Animal A B C
## 2 Dog 1 2 3
## 3 Cat 4 5 6
## 4 Goat 7 8 9
Note:
I haven't tested the efficiency of the function. It might not be very efficient for large amount of characters.
The use of matrices might a better tool for this job.

R grouping by name and perform stats (t-test)

I have two data.frames:
word1=c("a","a","a","a","b","b","b")
word2=c("a","a","a","a","c","c","c")
values1 = c(1,2,3,4,5,6,7)
values2 = c(3,3,0,1,2,3,4)
df1 = data.frame(word1,values1)
df2 = data.frame(word2,values2)
df1:
word1 values1
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
df2:
word2 values2
1 a 3
2 a 3
3 a 0
4 a 1
5 c 2
6 c 3
7 c 4
I would like to split these dataframes by word*, and perform two sample t.tests in R.
For example, the word "a" is in both data.frames. What's the t.test between the data.frames for the word "a"? And do this for all the words that are in both data.frames.
The result is a data.frame(result):
word tvalues
1 a 0.4778035
Thanks

Find the words common to both dataframes, then loop over these words, subsetting both dataframes and performing the t.test on the subsets.
E.g.:
df1 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
df2 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
common_words <- sort(intersect(df1$word, df2$word))
setNames(lapply(common_words, function(w) {
t.test(subset(df1, word==w, x), subset(df2, word==w, x))
}), common_words)
This returns a list, where each element is the output of the t.test for one of the common words. setNames just names the list elements so you can see which words they correspond to.
Note I've created new example data here since your example data only have one word in common (a) and so don't really resemble your true problem.
If you just want a matrix of statistics, you can do something like:
t(sapply(common_words, function(w) {
test <- t.test(subset(df1, word==w, x), subset(df2, word==w, x))
c(test$statistic, test$parameter, p=test$p.value,
`2.5%`=test$conf.int[1], `97.5%`=test$conf.int[2])
}))
## t df p 2.5% 97.5%
## a 0.9141839 8.912307 0.38468553 -0.4808054 1.1313220
## b -0.2182582 7.589109 0.83298193 -1.1536056 0.9558315
## c -0.2927253 8.947689 0.77640684 -1.5340097 1.1827691
## d -2.7244728 12.389709 0.01800568 -2.5016301 -0.2826952
## e -0.3683153 7.872407 0.72234501 -1.9404345 1.4072499

How to filter data.frame which has more than n number of entries in a column

How can i remove row from the dataset which have more than n number of genes.
data1 <- Re_leve logp chr start end CNA Genes
1 1.5 1 739400 756200 gain Trp1,Eggier
1 8.3 1 127730 128210 gain Zranb3,R3hdm1,.....

You may try
library(stringr)
n <- 1
df1[!str_count(df1$Genes, ',')+1 >n,]

Try this:
#dummy data
data1 <- data.frame(x=1:3,
Gene=c("asdf,asdf,ee,d","asdf","dfd,sdf"),
stringsAsFactors = FALSE)
#minimum number of genes
n <- 1
#subset
data1[sapply(data1$Gene,function(i)length(unlist(strsplit(i,",")))) > n, ]
# x Gene
# 1 1 asdf,asdf,ee,d
# 3 3 dfd,sdf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

finding top 1 percentile in datatable in r - r

I have a data table like sample1 sample2 sample3 fruit1 10 20 30 fruit2 1 5 6 fruit3 3 7 8 etc. I want to find the top 1 percentile of fruits in each sample in R (according to the number in each sample). Is there a simple way to do this?

Assuming your data frame is calle "fruit" fruit <- fruit[order(fruit$sample1,decreasing = TRUE)] top.1.percent <- fruit[1:length(fruit$sample1)/100,] This should do the trick for sample1

Related

Remove dataframes from list that matches a column in a dataframe in R

R split array into Data frame

Table scraped from a web page is read as a single character vector: how to convert into a dataframe?

R grouping by name and perform stats (t-test)

How to filter data.frame which has more than n number of entries in a column

Categories

Resources