Need an explanation for a particular R code snippet - r

The following is the code for which i need an explanation for:
for (i in id) {
data <- read.csv(files[i] )
c <- complete.cases(data)
naRm <- data[c, ]
completeCases <- rbind(completeCases, c(i, nrow(naRm)))
as i understand, the variable c here stores multiple logical values. The line after, that seems foreign to me. How does data[c, ] work?
FYI, I am an R newbie.

complete.classes looks for all rows that are "complete", have no missing values. Here is the man page. Thus the completeCases object will tell you the number of "complete" rows in each file you have just read. You really don't need to store the value of i in the rbind call though as it is just the row number, so it is redundant. A vector would do just fine for this application.
Also looks like you are missing a close brackets or this isn't a complete chunk of code.

Related

Why after I use "subset", the filtered data is less than it should be?

I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.

Convert R list to Pythonic list and output as a txt file

I'm trying to convert these lists like Python's list. I've used these codes
library(GenomicRanges)
library(data.table)
library(Repitools)
pcs_by_tile<-lapply(as.list(1:length(tiled_chr)) , function(x){
obj<-tileSplit[[as.character(x)]]
if(is.null(obj)){
return(0)
} else {
runs<-filtered_identical_seqs.gr[obj]
df <- annoGR2DF(runs)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
#print(score)
return(score)
}
})
dt_text <- unlist(lapply(tiled_chr$score, paste, collapse=","))
writeLines(tiled_chr, paste0("x.txt"))
The following line of code iterates through each row of the DataFrame (only 2 columns) and splits them into the list. However, its output is different from what I desired.
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
But I wanted the following kinda output:
[20350, 20355], [20357, 20359], [20361, 20362], ........
If I understand your question correctly, using as.tuple from the package 'sets' might help. Here's what the code might look like
library(sets)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
....
df_text = unlist(lapply(score, as.tuple),recursive = F)
This will return a list of tuples (and zeroes) that look more like what you are looking for. You can filter out the zeroes by checking the type of each element in the resulting list and removing the ones that match the type. For example, you could do something like this
df_text_trimmed <- df_text[!lapply(df_text, is.double)]
to get rid of all your zeroes
Edit: Now that I think about it, you probably don't even need to convert your dataframes to tuples if you don't want to. You just need to make sure to include the 'recursive = F' option when you unlist things to get a list of 0s and dataframes containing the numbers you want.

R row selection providing partial results

I'm having an issue, which I have found a solution for, but would like to understand what was going on in the original coding.
So I started with a table pulled from an SQL database and wanted information for 1 client, who is covered by 2 client numbers.
Originally I was running this to select those account numbers.
match <- c("C524",'5568')
gtc <- gtc[gtc$AccountNumber == match,]
However this was only returning about half of the desired results, and the results returned vary at different times (this was running as a weekly report), and depending on the PC running it.
Now, I've set up a loop which works fine and extracts all the results, but would really like to know what was going on with the original query.
match <- c("C524",'5568')
for (each in match) {
gtcLoop<- gtc[gtc$AccountNumber == each,]
result<-rbind(result,gtcLoop)
}
Also, long time lurker, first time poster so let me know if I've done anything wrong in this question.
You need to replace == by %in%:
gtc <- data.frame(AccountNumber = sample(c(match, "something"), 10, replace = TRUE))
gtc[gtc$AccountNumber %in% match,]
Just to tag onto Qaswed's answer (+1), you need to understand what is happening when you compute vector comparisons like ==. See:
?`==`
and
?`%in%`
then try something like 1 == c(1,2) and 1 %in% c(1,2).
The reason you are getting half the results is because the row subset is using the first evaluation only, as in:
df <- data.frame(id=c(1:5), acct_cd = letters[1:5])
df[df$acct_cd == c("a","c"),] # this is wrong, for demo only
df[df$acct_cd %in% c("a","c"),] # this is correct

What's wrong with order function in data frame?

DF <- data.frame(CpGId, tframe$t, tframe$p, q)
dimnames(DF)[[2]] <- c("CpGId", "t_value", "p_value", "q_value")
DFhyper <- DF[with(DF, q_value < 0.05 & t_value> 0), ]
DFhyper <- data.frame(DFhyper, row.names = NULL)
DFhyper <- DFhyper [order(p_value), ]
Until fourth line of code, things work fine but then why R gives an error stating p_value object not found?
R executes the bracketed expression first, without paying any attention to how it is going to be used. When you type
DFhyper[order(p_value),]
R will look for p_value in the current scope (probably the global scope), however, as this is bound into the dataframe, it will not be able to find it. You need to do something to tell it where this is located.
Either
DFhyper[order(DFhyper$p_value),]
or
DFhyper[with(DFhyper,order(p_value)),]
(or nearly equivalent, with(DFhyper,DFHyper[order(p_value),])) will work. The first command tells R specifically that you are referencing the column in the data frame, and the second tells R to look in the dataframe for the variable if it can't find it in scope.
Finally, you can just bind the dataframe into the scope as well, executing
attach(DFhyper)
DFhyper[order(p_value),]
The attach command adds the dataframe columns to the current scope. It can be useful for when you have many operations on the dataframe columns, but don't want to keep referencing it. You can then detach it with detach(DFhyper) when you are done.
It needs to be
DFhyper <- DFhyper [order(Dfhyper$p_value), ]

Subset function in R wont work with vector selection

I have this weird problem where I have something like this in my code:
#(2,1,6,3)
states.vector <- unique(data$state)
I am iterating through the vector to subset data for each value in the "state" column. At some point through my iteration, the following line of code gives me an empty data frame:
#When state == 1
data.state <- subset(data,state==states.vector[state])
If state is == 1, it means that states.vector[state] == 2. But when I do the following, it works just fine:
subset(data,state==2)
What is weird is that I used this process multiple times, and it worked fine for the exact same task, with the same format for "data", but with some different values inside.
What am I doing wrong?
I think jlhoward has already explained what the problem is.
Why don't you use something like the following lines of code to loop through your states?
states.vector <- unique(data$state)
for (selected_state in states.vector) {
data.state <- subset(data,state==selected_state)
#...
}

Resources