Problems with an if statement - r

I have a complex matrix with several rows per individual. I create a script where I summarize different variables per individual. In order to do that, I first create a list with the new summarized variables in it. In order to get some of these variables I need to introduce if clases like the following:
this_iids_roh <- dat[class,]
my_list<-c("Froh"=(sum(this_iids_roh$KB)/2881033),
"chr1"= if (this_iids_roh$CHR==1) {(sum(this_iids_roh$KB)/247249.719)*100},
"chr2"= if (this_iids_roh$CHR==2) {(sum(this_iids_roh$KB)/242193.529)*100},
"chr3"= if (this_iids_roh$CHR==3) {(sum(this_iids_roh$KB)/198295.559)*100})
return(my_list)
However when I do run this script (this is just a small part) I only get the "Froh" and "chr1" variables. I tried several things but I'm not being able to get other variables after "chr1".
I hope you can help me!

Instead of If condition outside you can directly use the condition to subset the data.
this_iids_roh <- NULL
this_iids_roh$CHR = rep(c(1,2,3),10)
this_iids_roh$KB = runif(30)*100000
this_iids_roh = as.data.frame(this_iids_roh)
The way to do this is
my_list<-c("Froh"=(sum(this_iids_roh$KB)/2881033),
"chr1"= {(sum(this_iids_roh$KB[this_iids_roh$CHR==1])/247249.719)*100},
"chr2"= {(sum(this_iids_roh$KB[this_iids_roh$CHR==2])/242193.529)*100},
"chr3"= {(sum(this_iids_roh$KB[this_iids_roh$CHR==3])/198295.559)*100})
> my_list
Froh chr1 chr2 chr3
0.60958 203.99334 251.06703 324.65984
Hope this solves the problem. Note that the conditions are written inside the square brackets above.
alternativly
my_list<-c(Froh= sum(this_iids_roh$KB)/2881033,
chr1= sum(this_iids_roh$KB[this_iids_roh$CHR==1])/2472.49719,
chr2= sum(this_iids_roh$KB[this_iids_roh$CHR==2])/2421.93529,
chr3= sum(this_iids_roh$KB[this_iids_roh$CHR==3])/1982.95559)
my_list
also fine with with()
my_list <- with(this_iids_roh, c(Froh= sum(KB)/2881033,
chr1= sum(KB[CHR==1])/2472.49719,
chr2= sum(KB[CHR==2])/2421.93529,
chr3= sum(KB[CHR==3])/1982.95559))
my_list

Related

Merge multiple lists in a list with loop function

I have 4 datasets that contains the same var called "siteid_public". The ultimate goal is: I want to see how many unique "siteid_public" in this four datasets. I will add them together and then use length (unique()) to get the number.
I use very stupid way to get this goal,the code like this:
site1<-dflist[[1]]%>%
select(siteid_public)
site2<-dflist[[2]]%>%
select(siteid_public)
site3<-dflist[[3]]%>%
select(siteid_public)
site4<-dflist[[4]]%>%
select(siteid_public)
site<-c(site1$siteid_public, site2$siteid_public,site3$siteid_public,site4$siteid_public)
length(unique(site))
But now, I want to improve its efficiency.
So, first, I use this code to create a list called "sitelist" which contains 4 lists coming from for datasets.(The dflist[[i]] in the code is the place where I store these 4 datasets.) After I run the code below, each list has one same var called siteid_public. The code is here:
sitelist<-list()
for (i in 1:4){
sitelist[[i]]<-dflist[[i]]%>%
select(siteid_public)
}
Now I want to add all 4 lists in sitelist as one list, and then use unique to see how many unique siteid_public value in this combined list. Could people help me to continue this code and achieve that goal? thanks a lot~~!
You can use lapply to iterate on a list of frames, either on the whole list or just as easily a subset (including one or zero).
Your site1 through site4 can be created as a list with
sites <- lapply(dflist[1:4], function(z) select(z, siteid_public))
and you can do your unique-counting with
unique(unlist(sites))
This works as well with
sites <- lapply(dflist, ...) # all of it
sites <- lapply(dflist[3], ...) # singleton, note not the `[[` index operator
indices <- ... # integer or logical of indices to include
sites <- lapply(dflist[indices], ...)

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

Loop over several dataframes in R

I have several data frames that I would like to be used in the same code, one after the other. In the code lines that I have written, I am using the variable "my_data" (which is basically a dataframe). Thus, I thought the easiest solution would be to assign each of my other dataframes to "my_data", one after the other, so that all the code that follows can be executed for each data frame in a loop without changing the code I already have.
The structure I have looks as follows:
#Datasets:
my_data
age_date
gender_data
income_data
## Code that uses "my_data" follows here" ##
How can I create a loop that first assigns "age_data" to "my_data" and executes the code where "my_data" was used as a variable. Then, after it reaches the end, restarts and assigns "gender_data" to the variable "my_data" and does the same until this has been done for all variables.
Help is much appreciated!
I am attempting to answer based upon information provided:
datanames <- c("age_data","gender_data","income_data")
for (dname in datanames){
my_data <- data.frame()
my_data <- get(dname)
# here you can write rest of the code
rm(mydata)
}
Maybe you can try get within for loop
for (i in c( "age_date", "gender_data","income_data")) {
my_data <- get(i)
}

apply fisher test in a large dataset that join all contingency tables

I have a dataset like this:
contingency_table<-tibble::tibble(
x1_not_happy = c(1,4),
x1_happy = c(19,31),
x2_not_happy = c(1,4),
x2_happy= c(19,28),
x3_not_happy=c(14,21),
X3_happy=c(0,9),
x4_not_happy=c(3,13),
X4_happy=c(17,22)
)
in fact, there are many other variables that come from a poll aplied in two different years.
Then, I apply a Fisher test in each 2X2 contingency matrix, using this code:
matrix1_prueba <- contingency_table[1:2,1:2]
matrix2_prueba<- contingency_table[1:2,3:4]
fisher1<-fisher.test(matrix1_prueba,alternative="two.sided",conf.level=0.9)
fisher2<-fisher.test(matrix2_prueba,alternative="two.sided",conf.level=0.9)
I would like to run this task using a short code by mean of a function or a loop. The output must be a vector with the p_values of each questions.
Thanks,
Frederick
So this was a bit of fun to do. The main thing that you need to recognize is that you want combinations of your data. There are a number of functions in R that can do that for you. The main workhorse is combn() Link
So in the language of the problem, we want all combinations of your tibble taken 2 at a time link2
From there, you just need to do some looping structure to get your tests to work, and extract the p-values from the object.
list_tables <- lapply(combn(contingency_table,2,simplify=F), fisher.test)
unlist(lapply(list_tables, `[`, 'p.value'))
This should produce your answer.
EDIT
Given the updated requirements for just adjacement data.frame columns, the following modifications should work.
full_list <- combn(contingency_table,2,simplify=F)
full_list <- full_list[sapply(
full_list, function(x) all(startsWith(names(x), substr(names(x)[1], 1,2))))]
full_list <- lapply(full_list, fisher.test)
unlist(lapply(full_list, `[`, 'p.value'))
This is approximately the same code as before, but now we have to find the subsets of the data that have the same question prefix name. This only works if the prefixes are exactly the same (X3 != x3). I think this is a better solution than trying to work with column indexes, and without the guarantee of always being next to one another. The sapply code does just that. The final output should be what you need for the problem.

How to append a random or arbitrary column to data frame [R]

Hear me out. Consider an arbitrary case where the new column's elements do not require any information from other columns (which I frustrates base $ and mutate assignment), and not every element in the new column is the same. Here is what I've tried:
df$rand<-rep(sample(1:100,1),nrow(df))
unique(df$rand)
[1] 58
and rest assured, nrow(df)>1. I think the correct solution might have to do with an apply function?
Your code repeats one single random number nrow(df) times. Try instead:
df$rand<-sample(1:100, nrow(df))
This samples without replacement from 1:100 nrow(df) times. Now this would give you an error if nrow(df)>100 because you would run out of numbers from 1:100 to sample. To make sure you don't get this error, you can instead sample with replacement:
df$rand<-sample(1:100, nrow(df), replace = TRUE)
If, however, you don't want any random numbers to repeat but would also like to prevent the error, you can do something like this:
df$rand<-sample(1:nrow(df), nrow(df))
if I understand this correctly ,I think this is pretty easily doable in dplyr or data.table .
for e.g dplyr soln on iris
iris%>%mutate(sample(n()))

Resources