Create Scatterplot in R from CSV - r

I have the following CSV document (3 columns, 1 header line, 10 data lines):
Artikelname,Anzahl Zeichen,Anzahl Fehler
Sport/Ski alpin,5459,42
People,2302,20
Nationale Politik,4012,43
Reportage ueber die Lebensmittelindustrie,11202,101
Wirtschaft,3192,22
Interview,2989,21
Sport/Tennis,1509,14
Filmkritik,2498,65
Regionalpolitik,3987,32
Mali-Reportage,10782,91
I now wish to plot the second column on the x-Axis and the third column on the y-Axis of a scatterplot in R.
getwd()
file <- "[filepath]"
data <- read.csv(file, skip = 1) # skip header line
data # print the data into the console
plot(data[2],data[3])
I believe it is because my data is not of the correct type yet, but I don't have any idea how to fix this.

First of all, remove skip=1. Your current code uses the first line of data as the column names instead of data. After reading the data correctly, you want
plot(data[,2],data[,3])
or perhaps more simply
plot(data[,2:3])

Related

How to get output to show more than the first 10 rows with sink or crossing function

I am attempting to get a cartesian combination of 3 data frames in RStudio and then have that output sinked to a file. I used the crossing function to get the combinations and the sink function to have the output added into a file. The code works and provides the Cartesian product of the 3 data sets (verified in the environment tab), but the output only shows the first 10 lines of output. I have tried options(max.print=999999), to no avail. This is what it says after the first 10 lines: # … with 61,992,438 more rows. I am trying to have it show all rows so that all rows are coded to the file. I've looked around stack and haven't found any answers to this. My code (yes its very messy as of now) is attached below:
library(tidyr) # Loads in the crossing and expand functions
Q14fb=read.csv("q14fb.csv") # Reads in the data sets
Q13e=read.csv("Q13e.csv")
Q14e=read.csv("Q14e.csv")
sink(file = "Q3combooutput.txt", append = TRUE, split=TRUE) # Creates a file
x=data.frame(x=c(Q13e)) # Codes in the data
y=data.frame(y=c(Q14e))
z=data.frame(z=c(Q14fb))
result=crossing (x,y,z) # Provides all possible combinations of the data sets
result # Provides the output
sink(append=TRUE) # Adds the output to the file
The problem
sink() will only print output from the R Console to the file. Since result was created by tidyr::crossing(), it is a tibble, and the default printing for tibbles are just the first 10 rows.
Some solutions
You can print all lines of a tibble by wrapping it in print(result, n = Inf). However, you are probably better of using some of the standard ways of exporting a file from R, such as write.csv() to export a CSV file.

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

Processing data from text file in R

So, I have been trying to turn a text file (each line is a chat log) into R to turn it into a data frame and further tidy the data.
I am using read.Lines so I can have each log as a single line. Because read.Lines reads them a single long char; I then convert them to strings (I need to parse the log).
I loaded a text file as data frame as per below.
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)
names(rawchat) <- "chat"
I am currently trying to identify any row (42000) that starts with the number 1. I can't seem to apply correctly the startsWith() function or the dplyr starts_with(), even grepl with regular expressions.
Could it be the format of the observations of the data frame (chr)?

Difficulties with understanding read.csv code

I'm improving my R-skills rebuilding some of the amazing stuff they do on r-bloggers. Right now im trying to reproduce this:
http://wiekvoet.blogspot.nl/2015/06/deaths-in-netherlands-by-cause-and-age.html. The relevant dataset for this excersize could be found here:
http://statline.cbs.nl/Statweb/publication/?VW=D&DM=SLNL&PA=7052_95&D1=0-1%2c7%2c30-31%2c34%2c38%2c42%2c49%2c56%2c62-63%2c66%2c69-71%2c75%2c79%2c92&D2=0&D3=0&D4=0%2c10%2c20%2c30%2c40%2c50%2c60%2c63-64&HD=150710-0924&HDR=G1%2cG2%2cG3&STB=T
If I'm diving into the code (to be found at the bottom of the first link) and am running into this piece of code:
r1 <- read.csv(sep=';',header=FALSE,
col.names=c('Causes','Causes2','Age','year','aantal','count'),
na.strings='-',text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
Could anybody help me seperating the steps that are taken here?
Here is an explanation of what each line in the call to read.csv() is doing from your example. Note that the assignment of the last parameter text is complicated and is dependent on the script from the link you gave above. From a high level, he is first reading in all lines from the file "Overledenen__doodsoo_170615161506.csv" which contain the string "Centraal", using only the third to final lines from that filtered set. There is an additional step applied to these lines as well.
r1 <- read.csv( # columns separate by semi-colon
sep=';',
# first row is data (i.e. is NOT a header)
header=FALSE,
# names of the six columns
col.names=c('Causes','Causes2','Age','year','aantal','count'),
# treat hyphen as NA
na.strings='-',
# read from third line to final line of the original input
# Overledenen__doodsoo_170615161506.csv, after some
# filtering has been applied
text=txtlines[3:length(txtlines)]) %>% select(.,-aantal,-Causes2)
The read.csv, read the csv file, separating column with the separator ";"
so that an input like this a;b;c will be separated in: first column=a, second=b, third=c
header=FALSE -> It specifies no header in the original file was given.
col.names assigns the listed names to your columns in r
na.strings='-' substitutes NA values with '-'
text=txtlines[3:length(txtlines)]) read the lines from position 3 till the end.
%>% select(.,-aantal,-Causes2) filter the data frame

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

Resources