subset all columns in a data frame less than a certain value in R - r

I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865

An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0

Related

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Pyspark column population with computation

I am stuck up with this issue , below is my dataframe
a b c
0 0 126
30 0 0
Now I need to repopulate with column c with formula c(previous-a+b) that is the resulting dataframe should be as . From below dataframe 96 is populated as (126-30+0)
a b c
0 0 126
30 0 96
Please help me in crossing this hurdle
You can use lag function to get the previous value as below
df.withColumn("id", monotonically_increasing_id())
.withColumn("c", lag($"c", 1, 126).over(Window.orderBy("id")) - $"a" + $"b")
.drop("id").show(false)
Hope this helps!

Not-equal to character in R

What is the command for printin the string that are not equal to a specific character? From the data below I would like to print the number of rows where the t5-column does not start with d-. (In this example that is all the rows)
I tried
dim(df[df$t5 !="d-",])
df:
name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
6 seq_10002_x17 17 hsa-miR-10a-5p 23 44 5GT 0 d-T 0 TATATACC TGTGTAAG miRNA 1
19 seq_100091_x3 3 hsa-miR-142-3p 54 74 0 u-CA d-TG 0 AGGGTGTA TGGATGAG miRNA 1
20 seq_100092_x1 1 hsa-miR-142-3p 54 74 0 u-CT d-TG 0 AGGGTGTA TGGATGAG miRNA 1
23 seq_100108_x5 5 hsa-miR-10a-5p 23 44 4NC 0 d-T 0 TATATACC TGTGTAAG miRNA 1
26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
28 seq_100121_x1 1 hsa-miR-192-5p 25 45 1CT u-CT d-C d-A GGCTCTGA AGCCAGTG miRNA 1
df1 <- df[!grepl("^d-",df[,8]),]
nrow(df1)
print(df1)
There is one row in your data that has a t5 entry that does not start with "d-". To find this row, you could try:
df[!grepl("^(d-)",df$t5),]
# name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
#26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
If you only want to know the row number, you can get it with rownames()
> rownames(df[!grepl("^(d-)",df$t5),])
#[1] "26"
or with which(),
> which(!grepl("^(d-)",df$t5))
#[1] 5
depending on whether you want the row number counting from the top of your data frame or the row number according to the value on the left.

Subset a data frame based on values of another column in data frame

It is possible to take one column of numeric values like in dup$Number and subset columns in DG that match dup$number and return this as a new data frame?
dup
Number Letter
59 Q
91 Q
19 Q
17 Q
DG
chr pos id ref alt refc altc qual cov line_21 line_26 line_28 line_31 line_32 line_38 line_40 line_41 line_42 line_45 line_48 line_49 line_57 line_59 line_69 line_73 line_75 line_83
1 2R 7006506 2R_7006506_SNP C A 169 26 999 29 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 -
Try
indx <- grep('line', names(DG))
DG[indx[as.numeric(sub('.*_', '', names(DG)[indx])) %in% dup$Number]]
# line_59
#1 0

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.
maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

Resources