I'm facing quiet a lot of challenges currently by doing text analysis with R.
Therefore I have in a table the columns Date, Text and Likes
I want to count how often a certain word occurs within the texts of a column (max 1 per column) and how often not.
I want to plot the results by displaying the result like in this picture
but I would like dots for "occurrence" and "not occurrence" of the searched word with different colors as dots and aggregate it monthly on y-axis and likes on x-axis
It would be great if you could help me with this challenge
As update I have here the sample data available https://drive.google.com/file/d/1IWqDoRFBTL8er8VmvisHDeB5uM3BGgJe/view?usp=sharing
It looks like there are several moving parts here so let me outline the tasks I think you are looking for assistance with:
Determine if a word appears in text, row by row.
Plot this information.
Display the information by category, i.e. word found or not found.
Provide some sort of smoothed fit over the data.
You can accomplish the first task by using your choice of pattern matching function. grepl for example will search with the pattern as its first argument. You may want to look into other parameters such as case sensitivity to ensure they match your needs. You'll want to store this result into another column, assuming you use ggplot. Then, you can pass the data to ggplot and use the col argument to have it separate out categories for you.
It doesn't appear that your data is readily available from your question. In the future, it generally helps if you can share some sample data. I have made my own sample which should be similar to what you describe. See the example code below.
library(tidyverse)
library(ggplot2)
set.seed(5)
data <- data.frame(Date = seq.Date(from = as.Date("2021-01-01"),
to = as.Date("2021-03-01"),
by = "day"),
fruit = sample(c("banana", "orange", "apple")),
likes = runif(60, 100, 1000))
data$good_fruit <- ifelse(grepl("orange", data$fruit), "orange", "not orange")
data %>%
ggplot() +
geom_point(aes(Date, likes, col = good_fruit)) +
geom_smooth(aes(Date, likes))
Since I threw together literally random data, there is not much a pattern here, but I think this illustrates the general idea of what you wanted to show? If you wanted a more specific kind of aggregation, I would recommend performing that manipulation before passing to ggplot, but for a rough fit this should work.
Sample Image
I am looking at some data downloaded from ICPSR and I am specifically using their R data file (.rda). Beneath the column name of each data file, there are some descriptions of the variables (a.k.a labels). An example is attached as well.
I tried various ways to get the label including base::label, Hmisc::label, labelled::var_label, sjlabelled::get_label and etc. But none worked.
So I am asking any ideas on how to extract the labels from this data file?
Thanks very much in advance!
this could work using purrr
#load library
library(purrr)
#get col n
n <- ncol(yourdata)
#extract labels as vector
labels <- map_chr(1:n, function(x) attr(yourdata[[x]], "label") )
This worked for me (I am working with ICPSR 35206):
attributes(yourdata)$variable.labels -> labels
Make sure that your attribute referring to the labels is actually called "variable.labels".
Hi everyone i have a question about plotting in R.
I need a line plot that shows how many %users wrote which %postings.
Example would be: 25% of users wrote 80% of postings
Dput output:
data
I read the data into R from csv and attached it with the headers.
Now when i try to plot it with:
plot(UserPc,PostingsPc,ylab = "Users", xlab= "Postings",type="l")
the plot is just a black square, halp
UserPc and PostingsPc cointain "," and "%" so read.csv interprets them as strings (which it reads as factors) rather than numbers. You'll see this if you run str(myData). If you want to plot them, you need to convert them into numbers, which looking at your data requires replacing "," with "." and removing the "%". gsub is a useful function for this, and it's convenient to make the whole operation its own function. Something like this:
MyData <- read.csv(file="data.csv", header=TRUE, sep=";",stringsAsFactors = FALSE)
#write a function that removes all "%" from a string converts "," to "." and returns a numeric
#divide by 100 because it's a percentage
convert <- function(stringpct){
as.numeric(gsub("%","",gsub(",",".",stringpct)))/100
}
MyData$UserPc <- convert(MyData$UserPc)
MyData$PostingsPc <- convert(MyData$PostingsPc)
attach(MyData)
plot(UserPc,PostingsPc,ylab = "Users", xlab= "Postings",type="l")
I have a result.csv file to which contains information in the following format :
date,tweets
2015-06-15,tweet
2015-06-15,tweet
2015-06-12,tweet
2015-06-11,tweet
2015-06-11,tweet
2015-06-11,tweet
2015-06-08,tweet
2015-06-08,tweet
i want to plot a frequency polygon with number of entries corresponding to each date as y axis and dates as x axis
i have tried the following code :
pf<-read.csv("result.csv")
library(ggplot2)
qplot(datetime, data =pf, geom = "freqpoly")
but it shows the following error :
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
can anyone tell me how to solve this problem. I am totally new to R so any kind of guidance will be of great help to me
Your issue is that you are trying to treat datetime as continuous, but it's imported it as a factor (discrete/categorical). Let's convert it to a Date object and then things should work:
pf$datetime = as.Date(pf$datetime)
qplot(datetime, data =pf, geom = "freqpoly")
Based on your code, I assume that the result.csv has a header: datetime, atweet. By default, read.csv takes the first line of the CSV file as column names. That means you will be able to access the two columns with pf$datetime and pf$atweet.
If you look at the documentation of read.csv, you will find that stringsAsFactors = default.stringsAsFactors(), which is FALSE. That is, the strings from CSV files are kept as factors.
Now, even if you change the value of stringsAsFactors, you still get the same error. That is because ggplot does not know how to order the dates, as it does not recognize the strings as such.
To transform the strings into logical dates, you can use strptime.
Here is the working example:
pf<-read.csv("result.csv", stringsAsFactors=FALSE)
library(ggplot2)
qplot(strptime(pf$datetime, "%Y-%m-%d"), data=pf, geom='freqpoly')
I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,
Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.