csv to frequency polygon using R or python

csv to frequency polygon using R or python - r

I have a result.csv file to which contains information in the following format :
date,tweets
2015-06-15,tweet
2015-06-15,tweet
2015-06-12,tweet
2015-06-11,tweet
2015-06-11,tweet
2015-06-11,tweet
2015-06-08,tweet
2015-06-08,tweet
i want to plot a frequency polygon with number of entries corresponding to each date as y axis and dates as x axis
i have tried the following code :
pf<-read.csv("result.csv")
library(ggplot2)
qplot(datetime, data =pf, geom = "freqpoly")
but it shows the following error :
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
can anyone tell me how to solve this problem. I am totally new to R so any kind of guidance will be of great help to me

Your issue is that you are trying to treat datetime as continuous, but it's imported it as a factor (discrete/categorical). Let's convert it to a Date object and then things should work:
pf$datetime = as.Date(pf$datetime)
qplot(datetime, data =pf, geom = "freqpoly")

Based on your code, I assume that the result.csv has a header: datetime, atweet. By default, read.csv takes the first line of the CSV file as column names. That means you will be able to access the two columns with pf$datetime and pf$atweet.
If you look at the documentation of read.csv, you will find that stringsAsFactors = default.stringsAsFactors(), which is FALSE. That is, the strings from CSV files are kept as factors.
Now, even if you change the value of stringsAsFactors, you still get the same error. That is because ggplot does not know how to order the dates, as it does not recognize the strings as such.
To transform the strings into logical dates, you can use strptime.
Here is the working example:
pf<-read.csv("result.csv", stringsAsFactors=FALSE)
library(ggplot2)
qplot(strptime(pf$datetime, "%Y-%m-%d"), data=pf, geom='freqpoly')

Related

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

I have a data frame like this I read in from a .csv (or .xlsx, I've tried both), and one of the variables in the data frame is a vector of dates.
Generate the data with this
Name <- rep("Date", 15)
num <- seq(1:15)
Name <- paste(Name, num, sep = "_")
data1 <- data.frame(
Name,
Due.Date = seq(as.Date("2020/09/24", origin = "1900-01-01"),
as.Date("2020/10/08", origin = "1900-01-01"), "days")
)
When I reference one of the cells specifically, like this: str(project_dates$Due.Date[241]) it reads the date as normal.
However, the exact position of the important dates varies from project to project, so I wrote a command that identifies where the important dates are in the sheet, like this: str(project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"])
This code worked on a few projects, but on the current project it now returns a character vector of length 2. One of the values is the date, and the other value is NA. And to make matters worse, the location of the date and the NA is not fixed across dates--the date is the first value in some cells and the second in others (otherwise I would just reference, e.g., the first item in the vector).
What is going on here, but more importantly, how do I fix this?!
Clarification on the second command:
When I was originally reading from an Excel file, the command was project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"]$Due.Date because it was returning a 1x1 tibble, and I needed the value in the tibble.
When I switched to reading in data as a csv, I had to remove the $Due.Date because the command was now reading the value as an atomic vector, so the $ operator was no longer valid.
Help me, Oh Blessed 1's (with) Knowledge! You're my only hope!
Edited to include an image of the data like the one that generates the error

I feel sheepish.
I was able to remove the NAs with
data1<- data1[!is.na(data1$Due.Date), ].
I assumed that command would listwise delete the rows with any missing values, so if the cell contained the 2-length vector, then I would lose the whole row of data. Instead, it removed the NA from the cell, leaving only the date.
Thank you to everyone who commented and offered help!

How to use corrplot with is.corr=FALSE

I previously made a beautiful functional and perfect actual corrolation plot with corrplot (my plot). Now I have to get the underlying data in the same look. So my goal is to have triangular similarity matrixes in the same colours as my corrolation plot. Imagine it like the conditional formatting in excel.
My Data: my Data from excel
Link to CSV Data file
it is loaded in as a csv and it can read the csv perfectly
My Code:corrplot(Phylogeny, is.corr=FALSE,method="number", cl.lim=c(0,1))
The error it throws me: Error in if (any(corr < cl.lim[1]) || any(corr > cl.lim[2])) { : Missing value, where TRUE/FALSE is required
i made sure all colums are numeric
i made sure to fill the missing bits with NA's (because that was a problem somwhere before)
i made sure all my values are between 0 and 1 like i want the limit to be (in between it told me that my values are not within the limit, when i tried around with some stuff)
the error does not change when i change the limit
the error does not change when i take the is.corr=FALSE out (default=TRUE)
i played around with corrplot.mixed and its still not working
have been referencing information from Corrplot Intro
I have looked into the condformat function but i am not really sure if it can do a filling of each cell with one colour according to the overall gradient like i used for my corrolation plot.
What am I missing here that it does not want to give me my table back with pretty colours?

I had the same error, but I was able to fix it by converting my data.frame to a matrix. I ended up with corrplot(as.matrix(df), is.corr = FALSE).

If I am understanding correctly, your posted data are already a correlation matrix - although not a fully symmetrical one of the sort that would be produced with the call cor on raw data.
In that case, the problem is just that you have variable names (Species) as a column in your data. Change this column to row names, drop the variable names, and call corrplot as user9536160 suggests:
# read in your data
phyl <- as.data.frame(read_csv("Phylogeny.csv"))
# name rows and drop variable names in the df itself
row.names(phyl) <- phyl$Species
phyl <- phyl %>%
select(-Species)
# call corrplot
corrplot(as.matrix(phyl), is.corr = FALSE)
The result:

Getting subscripts from Excel into R

I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:

I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))

To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))

R: Issues with line graph of germination trough time

I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,

Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.

Reading CSV file in R and formatting dates and time while reading and avoiding missing values marked as?

I am trying to Reading CSV file in R . How can I read and format dates and times while reading and avoid missing values marked as ?. The data I load after reading should be clean.
I tried something like
data <- read.csv("Data.txt")
It worked, but the dates and times were as is.
Also how can I extract a subset of data from specific data range?
For this I tried something like
subdata <- subset(data,
Date== 01/02/2007 & Date==02/02/2007,
select = Date:Sub_metering_3)
I get error Error in eval(expr, envir, enclos) : object 'Date' not found
Date is the first column.

The functions read.csv() and read.table() are not set up to do detailed fancy conversion of things like dates that can have many formats. When these functions don't automatically do what's wanted, I find it best to read the data in as text and then convert variables after the fact.
data <- read.csv("Data.txt",colClasses="character",na.strings="?")
data$FixedDate <- as.Date(data$Date,format="%Y/%m/%d")
or whatever your date format is. The variable FixedDate will then be of type Date and you can use equality and other conditions to subset.
Also, in your example code you are putting 01/02/2007 as bare code, which results in dividing 1 by 2 and then by 2007 yielding 0.0002491281, rather than inserting a meaningful date. Consider as.Date("2007-01-02") instead.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

csv to frequency polygon using R or python - r

Your issue is that you are trying to treat datetime as continuous, but it's imported it as a factor (discrete/categorical). Let's convert it to a Date object and then things should work: pf$datetime = as.Date(pf$datetime) qplot(datetime, data =pf, geom = "freqpoly")

Related

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

How to use corrplot with is.corr=FALSE

Getting subscripts from Excel into R

R: Issues with line graph of germination trough time

Reading CSV file in R and formatting dates and time while reading and avoiding missing values marked as?

Categories

Resources