Strange unexpected tokens inside the string - r

I have two simmingly identical strings in two data frames. For example, both
df_cont$winner[20]
df_assist$winner[609]
return "ivarovskaya"
But the comparison
identical(df_cont$winner[20], df_assist$winner[609])
returns FALSE.
So, dplyr joins don't work on them and when I count characters in those strings, I get different numbers.
Then I found out that copying those strings from View() panel into Rscript results in this:
Output of problem variables looks like this:
> df_cont$winner[20]
[1] "iva­rov­ska­ya"
> df_assist$winner[609]
[1] "ivarovskaya"
> nchar(df_cont$winner[20])
[1] 14
> nchar(df_assist$winner[609])
[1] 11
dput() function also results in identical strings:
> dput(df_cont$winner[20])
"iva­rov­ska­ya"
> dput(df_cont$winner[20])
"iva­rov­ska­ya"
How can I get rid of those strange red dots?

Related

How to use sprintf() function to format columns in R

I am a newbie and learning R. Please accept my apology for stupid questions.
I am Stuck with an error and unable to understand what it is exactly?
"Error in [.data.frame: undefined columns selected"
Can anyone help me with this, please?
When I use the below:
sprintf("%s", "Return (%)")
It worked fine and render the table for me correctly.
But If I use sprintf("%.1f", "Return (%)"), it gives the error with
"Error in sprintf: invalid format '%.1f'; use format %s for character objects"
Now, I converted the column in numeric with
sprintf("%.1f", as.numeric("Return (%)")).
Then I get the below error again:
"Error in [.data.frame: undefined columns selected"
R uses the extract operator to access information within an object. Since a data frame is also an R object, we use the extract operator to access columns in a data frame.
The first sprintf() function fails because it interprets "Return (%)" as text, and therefore generates an invalid format error.
The second sprintf() function fails because as.numeric("Return (%)") is parsed as an incomplete reference to a column in a data frame.
To extract columns from a data frame, multiple forms of the extract operator can be used:
1. dataframe[rows,columns]
2. dataframe[,"columnName"]
3. dataframe[["columnName"]]
4. dataframe$columnName
Since columns can be reference by name or position (from left to right),
one way to solve the problem described in the question is to use the column number with the extract operator to reference the percentage return column, as follows.
textFile <- "symbol,Return (%)
AAPL,23.1
GOOG,30.5
IBM,11.8
MSFT,14"
stockReturns <- read.csv(text=textFile)
# version that works: Return (%) is second column in data frame
sprintf("%.1f",stockReturns[,2])
...and the output:
> sprintf("%.1f",stockReturns[,2])
[1] "23.1" "30.5" "11.8" "14.0"
>
Why did the error occur?
When we print the data frame, we see how R parsed the column name when reading the header via read.csv().
stockReturns
> stockReturns
symbol Return....
1 AAPL 23.1
2 GOOG 30.5
3 IBM 11.8
4 MSFT 14.0
>
Interesting! R converted the space and special characters to ..... Now let's try sprintf() again with how R interpreted the column name, and the [[ form of the extract operator.
> sprintf("%.1f", stockReturns[["Return...."]])
[1] "23.1" "30.5" "11.8" "14.0"
>
Now that we know the problem is caused by special characters in the column name, we can use the colnames() function to rename the Return (%) column name to something we can more easily reference in R code. This allows one to use multiple forms of the extract operator to access the column from a data frame.
colnames(stockReturns) <- c("symbol","return_pct")
sprintf("%.1f",stockReturns[["return_pct"]])
sprintf("%.1f",stockReturns$return_pct)
sprintf("%.1f",stockReturns[,"return_pct"])
...and the output:
> colnames(stockReturns) <- c("symbol","return_pct")
> sprintf("%.1f",stockReturns[["return_pct"]])
[1] "23.1" "30.5" "11.8" "14.0"
> sprintf("%.1f",stockReturns$return_pct)
[1] "23.1" "30.5" "11.8" "14.0"
> sprintf("%.1f",stockReturns[,"return_pct"])
[1] "23.1" "30.5" "11.8" "14.0"
>
Note: this answer references content from my blog article, Forms of the Extract Operator.

Inner join is not working between two overlapping data frames in R

I have two large tables, both of which have a matching column that looks like this:
> head(introns2$Name)
[1] "chr1:12058:12178" "chr1:12228:12612" "chr1:12698:12974" "chr1:12722:13220"
[5] "chr1:13053:13220" "chr1:13375:13452"
> head(sqtl2$cluster_pos)
[1] "chr1:259025:261550" "chr1:804222:807217" "chr1:804222:807217"
[4] "chr1:804222:807217" "chr1:804222:807217" "chr1:804222:807217"
Whenever I run the following command:
combined <- inner_join(sqtl2, introns2, by=c("cluster_pos"="Name"))
I get a combined table with 0 rows. So far, I have made sure that both columns are of identical type by setting introns2$Name to char type like so: introns2$Name <- sapply(introns2$Name, as.character), and I have tried using a non-dplyr-based way of doing this same thing: combined <- merge(x=sqtl2,y=introns3,by.x="cluster_pos", by.y="Name")
I am assuming that there are overlapping hits between these two tables, since they come from the same source and are each enormous in size:
> nrow(introns2)
[1] 357746
> nrow(sqtl2)
[1] 1537363
Is there anything that I am overlooking? Again, I just want to join the two tables together per row on the basis of matches found in these columns.

R is changing my variable value by itself

I have a dataframe that has an id field with values as these two:
587739706883375310
587739706883375408
The problem is that, when I ask R to show these two numbers, the output that I get is the following:
587739706883375360
587739706883375360
which are not the real values of my ID field, how do I solve that?
For your information: I have executed options(scipen = 999) to R does not convert my number to a scientific notation.
This problem also happens in R console, if I enter these examples numbers I also get the same printing as shown above.
EDIT: someone asked
dput(yourdata$id)
I did that and the result was:
c(587739706883375360, 587739706883375360, 587739706883375488, 587739706883506560, 587739706883637632, 587739706883637632, 587739706883703040)
To compare, the original data in the csv file is:
587739706883375310,587739706883375408,587739706883375450,587739706883506509,587739706883637600,587739706883637629,587739706883703070
I also did the following test with one of these numbers:
> 587739706883375408
[1] 587739706883375360
> as.double(587739706883375408)
[1] 587739706883375360
> class(as.double(587739706883375408))
[1] "numeric"
> is.double(as.double(587739706883375408))
[1] TRUE
You can use the bit64 package to represent such large numbers:
library(bit64)
as.integer64("587739706883375408")
# integer64
# [1] 587739706883375408
as.integer64("587739706883375408") + 1
# integer64
# [1] 587739706883375409

Need to count character

I have a dataframe LoopVariable and the following couple of lines of code:
print(unique(LoopVariable[,"Job..R"]))
[1] "14047/2" "18331/3"
My output are two character and that is all good. My question now is: How can I count my output for further calculation usage? In other words: I have two characters and I need them to be as an integer for further calculation usage. In my example here the integer value would be "2".
Use the length() function for this. You can find more about the function by typing ?length into your console.
This is likely what you should expect:
length(unique(LoopVariable[,"Job..R"]))
[1] 2

Compute Column in R

What is the difference between the two statements below. They are rendering different outcomes, and since I am trying to come to R from SPSS, I am a little confused.
ds$share.all <- ds[132]/ ds[3]
mean(ds$share.all, na.rm=T)
and
ds$share.all2 <- ds$col1/ ds$Ncol2
mean(ds$share.all2, na.rm=T)
they render the same mean, but on the first, the output is printed as
col1
0.02669424
and the second only prints the .02xxxxx.
Any help will be much appreciated.
Indicating a column of a data frame with single brackets (your first example) produces a data frame with just that column, but using the $ operator (as in your second example) is just a vector. Printing something will print the names associated with it if it has names (the col1 in your first example). The data frame you get with ds[132] has a name attribute, but the vector you get with ds$col1 does not. The equivalent of ds$col1 would be to use double instead of single brackets: ds[[132]]. For example:
> x<-data.frame(1:10)
> names(x)<-"var"
> class(x$var)
[1] "integer"
> class(x[1])
[1] "data.frame"
> identical(x[1],x$var)
[1] FALSE
> identical(x[[1]],x$var)
[1] TRUE

Resources