Input Values on Rows Set by a Quantile Threshold - r

I have been working on this project and I am stuck in the following:
I have 7 columns on which over 30% of the rows are NAs.
All my columns are numeric, by the way.
On these High Missing Values Columns I want to create 4 new columns base on the values of the these columns' quantiles.
1st column- input 1 in rows which contains data; 0 otherwise.
2nd column- input 1 in rows below the first quantile; 0 otherwise.
3rd column- input 1 in rows that are in the 2nd quantile range; 0 otherwise.
4th column- input 1 in rows that are above the 3rd quantile; 0 otherwise.
I got the first column. But the rest, based on the quantiles' threshold value has been a challenge.
Here is what I have so far...
My next 3 columns are base on just 3 quantiles: 33.33333%, 66.66667% and 100%
quantile(High_NAS_set1$EFX, prob=c(33/99,66/99,99/99),na.rm=TRUE)
#1st column: assign 1 for a row that contains data; 0 otherwise
New.EFX_<-High_NAS_set1$EFX #creating a new column
New.EFX[!is.na(New.EFX)]<-1
New.EFX[is.na(New.EFX)]<-0
#2nd Column:assign 1 in rows below the first quantile; 0 otherwise
New.EFX2_<-High_NAS_set1$EFX #creating a new column
quant<-quantile(New2.EFX_Emp,probs=33/99,na.rm=TRUE)
which(New2.EFX_Emp_Total<=quant)<-1 # assign 1 for rows which indexes are below quant
which(New2.EFX_Emp_Total!=quant)<-0
The last 2 lines are giving me an error:
Error in which(New2.EFX_Emp_Total <= quant) <- 1 :
could not find function "which<-"
Any help will be really appreciated.
Thanks,
Jean

Related

Problem with data frame transformation using dplyr package

Problem
Let's consider two data frames :
One containing only 1's and 0's and second one with data :
set.seed(20)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
#zero_one data frame
sample.0.1..5..T. sample.0.1..5..T..1 sample.0.1..5..T..2
1 0 1 0
2 1 0 0
3 1 1 1
4 0 0 0
5 1 0 1
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
#with data
append.rnorm.4...10. append.runif.4....5. append.rexp.4...20.
1 0.08609139 0.2374272 0.3341095
2 -0.63778176 0.2297862 0.7537732
3 0.22642990 0.9447793 1.3011998
4 -0.05418293 0.8448115 1.2097271
5 10.00000000 -5.0000000 20.0000000
Now what I want to do is to change values in second data frame for which first data frame takes values 0 by mean calculated for values for which first data frame takes value one.
Example
In first column I want to replace 0.08609139 and -0.05418293 (values for which first column in first data frame takes values 0) by mean(-0.63778176, 0.22642990,10.00000000) (values for which first column in first data frame takes values 1).
I want to do it using mutate_all() function from dplyr package.
My work so far
df1<-df1 %>% mutate_all(
function(x) ifelse(df[x]==0, mean(x[df==1],na.rm=T,x)))
I know that the condition df[x] is meaningless, but I have no idea what should i put there. Could you please help me with that ?
You could follow #deschen's suggestion and multiply the two data frames together.
Here is another approach to consider using mapply. For each column, identify the positions (indices) in df where value is zero.
Then, substitute the corresponding df1 column of those positions with the mean of other values in the column. y[-idx] should be all values in the df1 column that exclude those positions.
Note that my set.seed is different - when I used yours of 20 I got different values, and a column with all zeroes. Please let me know if you are able to reproduce.
set.seed(12)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
my_fun <- function(x, y) {
idx <- which(x == 0)
y[idx] <- mean(y[-idx])
return(y)
}
mapply(my_fun, df, df1)

how to fill one column with desired values and other column with 0 in R

I have a data frame (c0) that contains some columns and one row
>c0
Sample_Name Chr_No Frequence
0 0 0
I have one variable (allchr) that contain 22 chromosome names. I want to add allchr name to c0$Chr_No and other columns as 0. Is there is a way to do this?
c1= data.frame(Chr_No=allchr,
Sample_Name=rep(0,length(allchr)),
Frequence=rep(0,length(allchr)),
stringAsFactors=FALSE)
If you want to keep the first row use rbind(c0,c1)

The number of two specific elements between two columns in R

I have the following matrix:
x=c(0,0,0,1,1,1,2,2,2,0,1,2,0,1,2,0,1,2)
M=matrix(x,9,2)
The matrix M is:
> M
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
How do I find that the number of (0,0), (0,1), (0,2), ... (that is the first row, the second, the third and so on) in the whole rows are equal to 1?
If we need to get the frequency, use the table,
tbl <- table(paste(M[,1], M[,2], sep="_"))
This can be converted to a 3 column data.frame by splitting the names of 'tbl' into two columns and cbinding the value of 'tbl'
cbind(read.table(text=names(tbl), sep="_", header = FALSE), value = as.vector(tbl))
If you want to check if every row appears a single time you can use
duplicated(data.frame(M))
If any of the resulting values is TRUE then you know some rows appear more than one time (and you know where they are).

Sum values in Rows

here is my question.
I have a dataframe with 30 rows (corresponding to 30 questions in a questionnaire) with values from 1 to 5 as answers.
I would like to sum all values equal to 1 that appears in the 30 rows.
I tried with the command aggregate, but it doesn't work.
The question could use more clarity, code would help, but I will give you a theoretical of what I believe you are asking for
If you have a data frame df such that:
questions ob1 ob2 ob 3
q1 5 3 1
q2 2 1 1
q3 4 1 5
and you want to add up all the values where something is equal to answer of 1 you have a number of options, but the most obvious is simply subset with a logical
or you could
sumob1<- sum(df$ob1[ , which(df$ob1==1)])
Watch for the leading comma in the [] it tells R to include all rows (on the left side of the comma) and just the values equal to the subset column on the right.
Which basically says I would like to make sumob1 equal to the sum of the column ob1 for all row cells in which column df$ob1 has a value of 1.
You can do that for each column.

How to bin data based on values in one column, and sum occurrences from another column in R?

I have a dataframe df and want to bin rows using data from column A, and then for each bin, count the number of times that a value is present in another column B. Here is an example using only 2 columns (although my real example has many columns):
A B
5.4
4.6 36_8365
2.4
3.6
0.6
8.9 83_7433
4
7.6
4.7 54_3874
1.5 54_8364
I want look in column A, and find all values less than 1, greater than 1 but less than 2, and so on, and for each bin, I want to count the number of times that a value appears in column B. For the table above, this would give the following results:
Class Number
<1 0
1<=A<2 1
2<=A<3 0
3<=A<4 0
4<=A<5 2
5<=A<6 0
6<=A<7 0
7<=A<8 0
8<=A<9 1
9<=A<10 0
The following is close, but it will sum the values when instead I just want to count them:
with(df, sum(df[A >= 1 & A < 2, "B"]))
I'm not sure what to replace "sum" with to get just counts, instead of a sum. I know I can identify which rows in column B have a value by using
thing <- B==''
or make a table using
thing_table <- table(B=='')
However, I'm not sure how to search through column A, test if the value is between 2 other values, and then count the items in B that meet those criteria. Can anyone point me in the right direction?
Thanks!
First:
newdf<-na.omit(df)
This will shrink the df down to only rows with data in them. Make sure the empty cells are showing up as NAs before attempting.
Second:
Replace sum with length
with(newdf, length(newdf[A>=1 $ A < 2, "B"]))

Resources