How to use loop to make a stack in R? - r

I'm a newbie to this world.
I am currently working with R codes to analyze some sequencing data and just stuck now.
Here's some problem description.
What I'd like to do is to select first word of $v3 from pat1_01_exonic data(115 rows)and make it file. (I used strsplit function for this)
till now, I tried below code 1 attached, and it worked for first line.
but the problem is I can't do this for 115 times.
so, It seems like a loop is necessary.
I'm not really confident with making a loop by myself. and as I expected it didn't work.
for making stack I thought about using append or rbind or stack.
Can anyone give me some advice about how to fix this problem?
Big thanks in advance
#code1
pat1_01_exonic$V3 <-as.character(pat1_01_exonic$V3)
pat1 <- data.frame(head(strsplit(pat1_01_exonic$V3, ":")[[1]],1))
#code2
for (i in 1: nrow(pat1_01_exonic)) {
pat1_output <- vector()
sub[i] <- data.frame(head(strsplit(pat1_01_exonic$V3, ":")[[i]],1))
pat1_0utput <- append(sub[i])
i <- i+1
}

Many of the times, you can avoid for loop in R. If I have understood you correctly, here you can use sub to get first string before ":"
pat1_01_exonic$new_col <- sub(":.*", "", pat1_01_exonic$V3)
pat1_01_exonic
# V3 new_col
#1 abc:def:avd abc
#2 afd:adef afd
#3 emg:rvf:temp emg
data
pat1_01_exonic <- data.frame(V3 = c("abc:def:avd", "afd:adef", "emg:rvf:temp"),
stringsAsFactors = FALSE)

The below code is an an example to create a new variable "V3_First_Word" that selects the first word in the original string.
Want<-pat1_01_exonic%>%
mutate(V3_First_Word=word(V3,1,1)) # This creates new varaible and selects first word

In base R, we can use read.table
pat1_01_exonic$new_col <- read.table(text = pat1_01_exonic$V3, sep=":",
header = FALSE, fill = TRUE, stringsAsFactors = FALSE)[,1]
pat1_01_exonic$new_col
#[1] "abc" "afd" "emg"
Or strsplit and select the first element
sapply(strsplit(pat1_01_exonic$V3, ":"), `[`, 1)
data
pat1_01_exonic <- data.frame(V3 = c("abc:def:avd", "afd:adef", "emg:rvf:temp"),
stringsAsFactors = FALSE)

Related

Using loop to create modicate dataframe

I have been struggling with finding a way to create a new data frame using a loop, where the main goal is to filter the data when is >= 0.5.
I´m using Rstudio; however, python is an option too.
Here is how looks like my data frame (csv file) and some lines of the script (incomplete):
df <- read.table(choose.files(), header = T, sep = ",", comment.char = "")
Site,Partition,alpha,beta,omega,alpha=beta,LRT,p-value,Total branch length
1,1,"0.000","0.000","NaN","0.000","0.000","1.000","0.000"
2,1,"0.060","0.046","0.774","0.048","0.049","0.825","0.000"
Then I use select function to take only two columns that interest me:
sdf <- subset(df, select = c("ï..Site", "alpha.beta"))
ï..Site alpha.beta
1 1 0.000
2 2 0.048
...
Then I thought in use a loop to create a new csv file, when the second column has a value >= 0.5 print this value, it doesn´t have a value that satisfies this requisite pass and print a 0.
Here I try differents ways; obviously neither works for me. Here are the last lines that I tried.
for (i in names(sdf1)) {
f_sdf1 <- sdf1[sdf1[, i] >= 0.5]
write.csv(f_sdf1, paste0(i, ".csv"))
}
So in this post I´m looking for some ideas to generate this script. Maybe it´s simple, but in this case, I need to ask how?
You could use subset to filter your data as in
# first get some example data
expl <- data.frame(site = 1:10, alpha.beta = runif(10))
print(expl)
# now do the filtering
expl.filtered = subset(expl, alpha.beta >= .5)
print(expl.filtered)
# Now write.table or write.csv...

Parsing colnames text string as expression in R

I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.

Getting all the children nodes of XML file to data.frame or data.table

As example, I have the following XML code
tt = '<Nummeraanduiding>
<identificatie>0010200000114849</identificatie>
<aanduidingRecordInactief>N</aanduidingRecordInactief>
<aanduidingRecordCorrectie>0</aanduidingRecordCorrectie>
<huisnummer>13</huisnummer>
<officieel>N</officieel>
<postcode>9904PC</postcode>
<tijdvakgeldigheid>
<begindatumTijdvakGeldigheid>2010051100000000</begindatumTijdvakGeldigheid>
</tijdvakgeldigheid>
<inOnderzoek>N</inOnderzoek>
<typeAdresseerbaarObject>Verblijfsobject</typeAdresseerbaarObject>
<bron>
<documentdatum>20100511</documentdatum>
<documentnummer>2010/NR002F</documentnummer>
</bron>
<nummeraanduidingStatus>Naamgeving uitgegeven</nummeraanduidingStatus>
<gerelateerdeOpenbareRuimte>
<identificatie>0010300000000444</identificatie>
</gerelateerdeOpenbareRuimte>
</Nummeraanduiding> '
The goal is to convert this node(Nummeraanduiding) to a data.table (or data.frame is also fine). One challenge is that I have a lot of these Nummeraanduiding nodes (millions of them).
The following code is able to process the data:
library(XML)
# This parses the doc...
doc = xmlParse(tt)
# Solution (1) - this is the most obvious solution..
XML::xmlToDataFrame(doc)
# Solution (2) - apparently converting to a list is also possible..
unlist(xmlToList(doc))
# Solution (3) - My own solution
data.frame(as.list(unlist(xmlToList(doc))))
Not all solutions produce the desired result... In the end only the version of Solution (3) satisfies my needs.
It is in a data.frame/data.table format
It contains all the child-child-nodes and has distinct names for each column
It does not 'merge' the information of child-child-nodes
However, running this piece of code for all my data becomes quite slow. It took 8+ hours to complete it for a file containing 2290000 times the 'Nummeraanduiding'-node.
Do you guys know any way to speed up this process? Can my method be improved? Am I missing some useful function maybe?
Given that each field is already on a separate line just grep them out, read what is left using read.table and convert from long to wide using tapply to produce the resulting matrix (which can be converted to a data frame or data.table if desired). Note that in read.table we bypass quote, comment and class processing. Finally, test it out to see if it is faster. No packages are used.
nms <- c("identificatie", "aanduidingRecordInactief", "aanduidingRecordCorrectie",
"huisnummer", "officieel", "postcode", "tijdvakgeldigheid.begindatumTijdvakGeldigheid",
"inOnderzoek", "typeAdresseerbaarObject", "bron.documentdatum",
"bron.documentnummer", "nummeraanduidingStatus",
"gerelateerdeOpenbareRuimte.identificatie")
rx <- paste(nms, collapse = "|")
g <- chartr("<", ">", grep(rx, readLines(textConnection(tt)), value = TRUE))
long <- read.table(text = g, sep = ">", quote = "", comment.char = "",
colClasses = "character")[2:3]
names(long) <- c("field", "value")
long$field <- factor(long$field, levels = nms) # maintain order of columns
long$recno <- cumsum(long$field == "identificatie")
with(long, tapply(value, list(recno, field), c))
If all records have exactly the same set of fields, such as those listed in nms, then the last line could be replaced with this (which is likely faster):
matrix(long$value, ncol = length(nms), byrow = TRUE, dimnames = list(NULL, nms))
Another alternative to the tapply line would be to use reshape from base R or to use dcast from the reshape2 package.

How to correct R syntax for summing two fields?

In a dbf I make the new field xyz then attempt to sum existing item1 and item2 fields and replace field xyz with sum and then create a new dbf-- but does not work. All working without the for loop. I hope someone can help. Thank you.
library(foreign)
setwd("C:/temp")
dbfdata <- read.dbf("sldu_500ka.dbf", as.is = TRUE)
dbfdata$xyz <- 1:nrow(dbfdata)
for(i in 1:nrow(dbfdata)) {
row <- dbfdata[i,]
dbfdata$xyz <- dbfdata$item1 + dbfdata$item2
}
write.dbf(dbfdata, "sldu_500k1.dbf")
I'm not sure whether I understand you correctly, but
library(foreign)
setwd("C:/temp")
dbfdata <- read.dbf("sldu_500ka.dbf", as.is = TRUE)
dbfdata$xyz <- dbfdata$item1 + dbfdata$item2
write.dbf(dbfdata, "sldu_500k1.dbf")
should do the job. Instead of looping overall rows, you can add the entire column at once.

R, Getting the top in every category from a data frame?

I have the following data frame
id,category,value
A,21,0.89
B,21,0.73
C,21,0.61
D,12,0.95
E,12,0.58
F,12,0.44
G,23,0.33
Note, they are already sorted by value within each (id,category). What I would like to be able to do is to get the top from each (id,category) and make a string, followed by the second in each (id,category) and so on. So for the above example it would look like
A,D,G,B,E,C,F
Is there a way to do it easily in R? Or am I better off relying on a Perl script to do it?
Thanks much in advance
This appears to work, but I'm certain we could simplify it somewhat, particularly if you are able to relax your ordering requirements:
library(plyr)
d <- read.table(text = "id,category,value
A,21,0.89
B,21,0.73
C,21,0.61
D,12,0.95
E,12,0.58
F,12,0.44
G,23,0.33",sep = ',',header = TRUE)
d <- ddply(d,.(category),transform,r = seq_along(category))
d <- arrange(d,id)
> paste(d$id[order(d$r)],collapse = ",")
[1] "A,D,G,B,E,C,F"
This version is probably more robust to ordering, and avoids plyr:
d$r <- unlist(sapply(rle(d$category)$lengths,seq_len))
d$s <- 1:nrow(d)
with(d,paste(id[order(r,s)],collapse = ","))

Resources