Why is df.to_string printing out weird labels? - jupyter-notebook

If I run the code like so:
print(df['Col1'].to_string(index=False))
I get:
1
2
3
Now if I use the code like so (without print):
s = df['Col1'].to_string(index=False)
s
I get:
'1\n2\n3'
Where are the backslashes and 'n' strings coming from? What is the appropriate way of listing a single columns with an ultimate goal of assigning to an array?

if you want to convert a data column to a list (array), then use this code:
col_list = df['Col1'].values
or
col_list = list(df['Col1'])
The \n sequence is a popular one found in many languages that support escape sequences. It is used to indicate a new line in a string. And print function will format the given string & inserts a new line

Related

remove specific data within the string in R

im new to R, i have this data frame and im trying to delet all the infromation from this column except the genes symbols which always comes secound in place within the string.
enter image description here
best regards!
i tried this function (gsub) but it deleted the specific element only . i`m wandring if i can use it to keep the gene symbol only ( which is always come in the secound place in the string) and delet every thing else
If your data is consistently in the format shown in the image (where the gene ID is always the third "word" of the string), then the word() function from the stringr package can extract the data you want.
library(stringr)
dat = data.frame(gene_assignment = rep(c('idnumbers // geneID // Other stuff'),10))
dat$geneID = word(dat$gene_assignment, 3)
Note that this makes the following assumptions:
Your data is always in the format where there are some id numbers, followed by " // ", followed by the gene ID, followed by a space, and then anything else
Neither the ID numbers in the front nor the gene ID ever contain a space in them
These assumptions are necessary because word() uses spaces to determine when each word starts and ends.

Use substr with start and stop words, instead of integers

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

Create a function in R to extract character from string by using position? The positions of characters are figured out based on pattern condition

I want to create a function that extract characters from strings by using substring, but got some problems to find out the end_position to cut the character.
I got a string that stored in term of log file like that:
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"}")
My idea is identify the position of each descriptions in the log and cut the character behind
start_position = as.numeric(str_locate(string,'\"country\":\"')[,2])
end_position = ??????
country = substring(x,start_position,end_postion)
The sign to recognize the end of character that I want to cut is the symbol "," at the end. FOR EXAMPLE: \"country\":\"UNITED STATES\",
Could you guys tell me any way to get the position of "," with condition of specific pattern in front? I intend to create a function later to extract character based on the recognized pattern. In this example, they are "country" and "country code"
Instead of using substring have a look into strsplit, that will split according to a pattern.
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"})")
strsplit(string,",")[[1]][1]
[1] "{\"country\":\"UNITED STATES\""
You can change the pattern with every regex you like

Split CSV type file in R using strsplit

I am trying to split a string that would eventually be taken out of a CSV file using readLine(). (I know read.csv() works better, but the CSV file can have different number of columns for each row. For example, 1st row have 2 column, 2nd row 4 line, 3rd row 2 line.)
Say, the string I am going to parse looks like this:
2011-05-04, "weqr, wrqw", "qweqrw", 12
Eventually, I want it to be split into four parts, meaning I am splitting on commas but only when the comma is outside the quotation marks.
A quick google gives me a JAVA solution which takes advantage of the regular expression ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
But doing something like a<-strsplit(x,",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)") will generate an error: invalid regular expression.

R: Add paste() elements to file

I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).

Resources