R Plot that shows how many User% did Postings% - r

Hi everyone i have a question about plotting in R.
I need a line plot that shows how many %users wrote which %postings.
Example would be: 25% of users wrote 80% of postings
Dput output:
data
I read the data into R from csv and attached it with the headers.
Now when i try to plot it with:
plot(UserPc,PostingsPc,ylab = "Users", xlab= "Postings",type="l")
the plot is just a black square, halp

UserPc and PostingsPc cointain "," and "%" so read.csv interprets them as strings (which it reads as factors) rather than numbers. You'll see this if you run str(myData). If you want to plot them, you need to convert them into numbers, which looking at your data requires replacing "," with "." and removing the "%". gsub is a useful function for this, and it's convenient to make the whole operation its own function. Something like this:
MyData <- read.csv(file="data.csv", header=TRUE, sep=";",stringsAsFactors = FALSE)
#write a function that removes all "%" from a string converts "," to "." and returns a numeric
#divide by 100 because it's a percentage
convert <- function(stringpct){
as.numeric(gsub("%","",gsub(",",".",stringpct)))/100
}
MyData$UserPc <- convert(MyData$UserPc)
MyData$PostingsPc <- convert(MyData$PostingsPc)
attach(MyData)
plot(UserPc,PostingsPc,ylab = "Users", xlab= "Postings",type="l")

Related

Outputting an R dataframe to a .txt file - Align positive and negative values

I am trying to output a dataframe in R to a .txt file. I want the .txt file to ultimately mirror the dataframe output, with columns and rows all aligned. I found this post on SO which mostly gave me the desired output with the following (now modified) code:
gene_names_only <- select(deseq2_hits_table_df, Gene, L2F)
colnames(gene_names_only) <- c()
capture.output(
print.data.frame(gene_names_only, row.names=F, col.names=F, print.gap=0, quote=F, right=F),
file="all_samples_comparison_gene_list.txt"
)
The resultant output, however, does not align negative and positive values. See:
I ultimately want both positive and negative values to be properly aligned with one another. This means that -0.00012 and 4.00046 would have the '-' character from the prior number aligned with the '4' of the next character. How could I accomplish this?
Two other questions:
The output file has a blank line at the beginning of the output. How can I change this?
The output file also seems to put far more spaces between the left column and the right column than I would want. Is there any way I can change this?
Maybe try a finer scale treatment of the printing using sprintf and a different format string for positive and negative numbers, e.g.:
> df = data.frame(x=c('PICALM','Luc','SEC22B'),y=c(-2.261085123,-2.235376098,2.227728912))
> sprintf('%15-s%.6f',df$x[1],df$y[1])
[1] "PICALM -2.261085"
> sprintf('%15-s%.6f',df$x[2],df$y[2])
[1] "Luc -2.235376"
> sprintf('%15-s%.7f',df$x[3],df$y[3])
[1] "SEC22B 2.2277289"
EDIT:
I don't think that write.table or similar functions accept custom format strings, so one option could be to create a data frame of formatted strings and the use write.table or writeLines to write to a file, e.g.
dfstr = data.frame(x=sprintf('%15-s', df$x),
y=sprintf(paste0('%.', 7-1*(df$y<0),'f'), df$y))
(The format string for y here is essentially what I previously proposed.) Next, write dfstr directly:
write.table(x=dfstr,file='filename.txt',
quote=F,row.names=F,col.names=F)

How to use fread or read_delim in R on characters with no linebreak

I have several .txt files which need to be imported to R as dataframes for some data analysis. One of these files has no EOL in any form, so I'm left wondering how I would go about to import that.
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
This is how the first ~500 characters of that .txt file look like. The EOL would need to be placed like this:
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"
\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
Normally I would just gsub a "\n" to the places I need it to be, but there is no reoccurring string at the places where I would place a \n, so I don't think that gsub would work in this instance.
Seeing how the missing values are clearly indicated with NA, is there a function similar to read_delim that has a "col_number = x" argument? Like the first x values are the headers, the next x values are the values of the first row and so on and so forth?
If it changes anything, these .txt files are rather big (>300mb).
Big thank you to Julian_Hn. Works like a charm.
I would probably just read this in as a vector and then reformat as matrix with the number of columns you know are in the dataset. This essentially does what you want
str <- "\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\";\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA"
vec <- strsplit(str,";")[[1]]
//EDIT: add byrow = T To stay in the right format. Thanks Yuriy
table <- matrix(vec,ncol=23,nrow=3, byrow = T)
df <- as.data.frame(table)

Getting subscripts from Excel into R

I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:
I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))
To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))

Replace a string with the date value from the row above

I have a dataset where I have about 200 \N and I'd like to replace \N with the date/day value in the row above. Such as for row 641, I want to change the date to 10-Nov-14 and day to Mon:
If this can be done in R, does the format of date matter? As currently, these dates are shown as factors.
Easy in Excel. Replace all \N with nothing, select the two relevant columns, HOME > Editing, Find & Select, Go To Special..., Blanks, then
=
↑
Ctrl+Enter.
Then copy range again and HOME > Clipboard - Paste, Paste Special..., Values, OK over the top.
If that's an Excel file that you are importing into R then you need to understand how R works with the backslash characters (which is what is showing in the screenshot) and which is used to "escape" characters. See ?Quotes. Once the data is in R it will probably all be factor columns.
If the dataframe is named 'dat' then this should work to really make true missing values:
is.na( dat) <- dat == "\\N" # need to escape the escape character.
Then use na.locf from package zoo:
library(zoo) # lots of useful methods in zoo.
dat$date <- na.locf(dat$date)
dat$day_of_week <- na.locf(dat$day_of_week)
These methods should work with any class of column, and these would not be R Date-classed variables until you made the conversion.
It can be solved easily in R with the following code:
ListNa <- grep("\\N", a$date)
ListPrewRow <- ListNa-1
data[ListNa,c("date", "day")] <- data [ListPrewRow,c("date", "day")]
Where:
"data" is the data table
"date" and "day" is the columns to be replaced.
This can be done rapidly in Excel with a quick use of Find and FindNext. Assuming you want to replace all of the \N on the ActiveSheet, this code will terminate once it has replaced them all. Offset(-1) gets the value one row up.
Sub ReplaceWithValueAbove()
Dim rng_search As Range
Set rng_search = ActiveSheet.UsedRange.Find("\N")
While Not rng_search Is Nothing
'set to row above
rng_search = rng_search.Offset(-1)
'find the next one
Set rng_search = ActiveSheet.UsedRange.FindNext()
Wend
End Sub
Before and After pictures

How to make write.table retain dimnames?

I need to save a number of tables in a single CSV file and am having difficulty seeing how to retain dimension names. I searched SO and the closest I found was:
How to get dimnames in xtable.table output?
The problem he has with xtable is the problem I've got with write.table – dimnames exist in the table (and prop.table and ftable as well if I use that) but get dropped by write.table. I'm using write.table not write.csv for append=T.
The dataset is from a survey and the aim is to create the complete set of crosstabs, with labelled axes. In this case, actual row/column labels are not important, only dimension labels. I'm new to R, so hope I haven't missed something obvious.
d<-read.csv('dataset.csv') # dataset with column headings, no row labels
cat('BEGIN\n',file='xtabs.csv')
for (i in 1:ncol(d)) {
for (j in 1:ncol(d)) {
cat(paste('\ni=',i,' j=',j,'\n'),file='xtabs.csv',append=T)
t<-table(d[,i],d[,j],dnn=c(names(d[i]),names(d[j])))
pt<-prop.table(t,1)
write.table(pt,'xtabs.csv',sep=',',dec='.',row.names=F,col.names=F,append=T)
print(pt) # shows dimnames in the console as expected
}
}
Try this:
tbl <- with(warpbreaks, table(wool, tension))
pt <- prop.table(tbl)
write.ftable(ftable(pt),file = "~/Desktop/table.csv", sep = ",",
quote = FALSE)
I'm possibly abusing ftables here, which are intended for multi-dimensional tabular data (i.e. more than two variables). But it's the only thing I've found that will write the table to a text file with (seemingly) the formatting you want.

Resources