Creating a data frame from looping through text - r

Thanks in advance! I have been trying this for a few days, and I am kind of stuck. I am trying to loop through a text file (imported as a list), and create a data frame from the text file. The data frame starts a new row if the item in the list has a day of the week in the text, and will populate in the first column (V1). I want to put the rest of the comments in the second column (V2) and I may have to concatenate strings together. I am trying to use a conditional with grepl(), but I am kind of lost on the logic after I set up the initial data frame.
Here is an example text I am bringing into R (it is Facebook data from a text file). The []'s signify the list number. It is a lengthy file (50K+ lines) but I have the date column set up.
[1]
Thursday, August 25, 2016 at 3:57pm EDT
[2]
Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
[3]Sunday, August 14, 2016 at 9:17am EDT
[4]Michael shared Jason post.
[5]This bird is a lot smarter than the majority of political posts I have read recently here
[6]Sunday, August 14, 2016 at 8:44am EDT
[7]Michael and Kurt are now friends.
The end result would be data frame where the day of the week starts a new row in the data frame, and the rest of the list is concatenated into the second column of the data frame. So the end data fame would be
Row 1 ([1] in V1 and [2] in V2)
Row 2 ([3] in V1 and [4],[5] in V2)
Row 3 ([6] in V1 and [7] in V2)
Here is the start of my code, and I can get V1 to populate correctly, but not the second column of the data frame.
### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")
### Remove empty lines from the text file
temp <- temp[temp!=""]
### Create the temp char file as a list file
tmp <- as.list(temp)
### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}
### Loop through the list
for (n in 1:length(tmp)){
### Search to see if there is a day in the list item
for(i in 1:length(days)){
if(grepl(days[i], tmp[n])==1){
### Bind the row to the df if there is a day in the list item
df<- rbind(df, tmp[n])
}
}
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.
d <- c(d, tmp[n])
}

Here's an option using the tidyverse:
library(tidyverse)
text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT
[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
[3]Sunday, August 14, 2016 at 9:17am EDT
[4]Michael shared Jason post.
[5]This bird is a lot smarter than the majority of political posts I have read recently here
[6]Sunday, August 14, 2016 at 8:44am EDT
[7]Michael and Kurt are now friends."
df <- data_frame(lines = read_lines(text)) %>% # read data, set up data.frame
filter(lines != '') %>% # filter out empty lines
# set grouping by cumulative number of rows with weekdays in them
group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
# collapse each group to two columns
summarise(V1 = lines[1], V2 = list(lines[-1]))
df
## # A tibble: 3 × 3
## grp V1 V2
## <int> <chr> <list>
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>
This approach uses a list column for V2, which is probably the best approach in terms of preserving your data, but use paste or toString if you need.
Roughly equivalent base R:
df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)
df <- df[df$V2 != '', , drop = FALSE]
df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))
df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})
df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})
df
## grp V1
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT
## V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2 [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3 [7]Michael and Kurt are now friends.

Related

How to plot multiple, separate graphs in R

I have a dataset of over 300K rows and over 20 years. I'm trying to create a Load Duration Curve for every year for XX years (so # of MW used every hour of the year (8760 hours for every year or 8784 for leap year). Currently I make a new dataframe by filtering by year and then reordering by descending order of MW used (descending order for the curve) and then create another column to match the row order so that I can use that column as a placeholder for the x-axis. Seems pretty inefficient and could be difficult to update if needed (see playground for what I've been doing). I also don't want to use facet_wrap() because the graphs are too small for what is needed.
Dummy_file:
Where hrxhr is the running total of hours in a given year.
YEAR
MONTH
DAY
HOUR OF DAY
MW
Month_num
Date
Date1
hrxhr
2023
Dec
31
22
2416
12
2023-12-31
365
8758
2023
Dec
31
23
2412
12
2023-12-31
365
8759
2023
Dec
31
24
2400
12
2023-12-31
365
8760
2024
Jan
01
1
2271
12
2024-01-01
1
1
2023
Jan
01
2
2264
12
2024-01-01
1
2
### ------------ Load in source ------------ ###
dummy_file <- 'Dummydata.csv'
forecast_df <- read_csv(dummy_file)
### ---- Order df by MW (load) and YEAR ---- ###
ordered_df <- forecast_df[order(forecast_df$MW, decreasing = TRUE), ]
ordered_df <- ordered_df[order(ordered_df$YEAR, decreasing = FALSE), ]
### -------------- Playground -------------- ###
## Create a dataframe for the forecast for calendar year 2023
cy23_df <- ordered_df[ordered_df$YEAR == 2023,]
## Add placeholder column for graphing purposes (add order number)
cy23_df$placeholder <- row.names(cy23_df)
## Check df structure and change columns as needed
str(cy23_df)
# Change placeholder column from character to numeric for graphing purposes
cy23_df$placeholder <- as.numeric(cy23_df$placeholder)
# Check if changed correctly
class(cy23_df$placeholder) #YES
## Load duration curve - Interactive
LF_cy23_LDC <- plot_ly(cy23_df,
x= ~placeholder,
y= ~MW,
type= 'scatter',
mode = 'lines',
hoverinfo = 'text',
text = paste("Megawatts: ", cy23_df$MW,
"Date: ", cy23_df$MONTH, cy23_df$DAY,
"Hour: ", cy23_df$hrxhr)) %>%
layout(title = 'CY2023 Load Forecast - LDC')
# "Hour: ", orderby_MW$yrhour))
saveWidget(LF_cy23_LDC, "cy23_LDC.html")
Current Output for CY2023:
Yaxis Megawatts used (MW) and Xaxis is a placeholder (placeholder) and then I just repeat the playground code for the rest of the years, but change 2023 to 2024, then 2025, etc.
Sorry if this is a long post, tmi, or not enough information. I'm fairly new to R and this community. Many thanks for your help!
Simply generalize your playground process in a user-defined method, then iterate through years with lapply.
# USER DEFINED METHOD TO RUN A SINGLE YEAR
build_year_plot <- function(year) {
### -------------- Playground -------------- ###
## Create a dataframe for the forecast for calendar year
cy_df <- ordered_df[ordered_df$YEAR == year,]
## Add placeholder column for graphing purposes (add order number)
cy_df$placeholder <- row.names(cy_df)
## Check df structure and change columns as needed
str(cy_df)
# Change placeholder column from character to numeric for graphing purposes
cy_df$placeholder <- as.numeric(cy_df$placeholder)
# Check if changed correctly
class(cy_df$placeholder) #YES
## Load duration curve - Interactive
LF_cy_LDC <- plot_ly(
cy_df, x = ~placeholder, y = ~MW, type= 'scatter',
mode = 'lines', hoverinfo = 'text',
text = paste(
"Megawatts: ", cy_df$MW,
"Date: ", cy_df$MONTH, cy_df$DAY,
"Hour: ", cy_df$hrxhr
)
) %>% layout( # USING BASE R 4.1.0+ PIPE
title = paste0('CY', year, ' Load Forecast - LDC')
)
saveWidget(LF_cy_LDC, paste0("cy", year-2000, "_LDC.html"))
return(LF_cy_LDC)
}
# CALLER TO RUN THROUGH SEVERAL YEARS
LF_cy_plots <- lapply(2023:2025, build_year_plot)
Consider even by (object-oriented wrapper to tapply and roughly equivalent to split + lapply) and avoid the year indexing. Notice input parameter changes below and variables used in title and filename:
# USER DEFINED METHOD TO RUN A SINGLE DATA FRAME
build_year_plot <- function(cy_df) {
### -------------- Playground -------------- ###
## Add placeholder column for graphing purposes (add order number)
cy_df$placeholder <- row.names(cy_df)
...SAME AS ABOVE...
) %>% layout(
title = paste0('CY', cy_df$YEAR[1], ' Load Forecast - LDC')
)
saveWidget(LF_cy_LDC, paste0("cy", cy_df$YEAR[1]-2000, "_LDC.html"))
return(LF_cy_LDC)
}
# CALLER TO RUN THROUGH SEVERAL YEARS
LF_cy_plots <- by(ordered_df, ordered_df$YEAR, build_year_plot)
Counterparts in tidyverse would be purrr.map:
# METHOD RECEIVES YEAR (lapply counterpart)
LF_cy_plots <- purrr::map(2023:2025, build_year_plot)
# METHOD RECEIVES DATA FRAME (by counterpart)
LF_cy_plots <- ordered_year %>%
split(.$YEAR) %>%
purrr::map(build_year_plot)

If else statement with a value that is part of a continuous character in R

My dataframe (df) contains a list of values which are labelled following a format of 'Month' 'Name of Site' and 'Camera No.'. I.e., if my value is 'DECBUTCAM27' then Dec-December, BUT-Name of Site and CAM27-Camera No.
I have 100 such values with 19 different site names.
I want to write an If else code such that only the site names are recognised and a corresponding number is added.
My initial idea was to add the corresponding number for all the 100 values, but since if else does not work beyond 50 values I couldnt use that option.
This is what I had written for the option that i had tried:
df <- df2 %>% mutate(Site_ID =
ifelse (CT_Name == 'DECBUTCAM27', "1",
ifelse (CT_Name == 'DECBUTCAM28', "1",
ifelse (CT_Name == 'DECI2NCAM01', "2",
ifelse (CT_Name == 'DECI2NCAM07', "2",
ifelse (CT_Name == 'DECI5CAM39', "3",
ifelse (CT_Name == 'DECI5CAM40', "3","NoVal")))))))
I am looking for a code such that only the sites i.e., 'BUT', 'I2N' and 'I5' would be recognised and a corresponding number is added.
Any help would be greatly appreciated.
Extract the sitename using regex and use match + unique to assign unique number.
df2$site_name <- sub('...(.*)CAM.*', '\\1', df2$CT_Name)
df2$Site_ID <- match(df2$site_name, unique(df2$site_name))
For example, see this example :
CT_Name <- c('DECBUTCAM27', 'DECBUTCAM28', 'DECI2NCAM07', 'DECI2NCAM01',
'DECI5CAM39', 'DECI5CAM40')
site_name <- sub('...(.*)CAM.*', '\\1', CT_Name)
site_name
#[1] "BUT" "BUT" "I2N" "I2N" "I5" "I5"
Site_ID <- match(site_name, unique(site_name))
Site_ID
#[1] 1 1 2 2 3 3
Here is a tidyverse solution:
You haven't provided a reproducible example, but let's use the CT_Names that you have supplied to create a test dataframe:
data <- tribble(
~ CT_Name,
"DECBUTCAM27",
"DECBUTCAM28",
"DECI2NCAM01",
"DECI2NCAM07",
"DECI5CAM39",
"DECI5CAM40"
)
Let's assume that the string format is 3 letters for months, 2 or more letters or numbers for site and CAM + 1 or more digits for camera number (adjust these as needed). We can use a regular expression in tidyr's extract() function to split up the string into its components:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera"))
(add remove = FALSE if you want to keep the original CT_Name variable)
This yields:
# A tibble: 6 x 3
Month Site Camera
<chr> <chr> <chr>
1 DEC BUT CAM27
2 DEC BUT CAM28
3 DEC I2N CAM01
4 DEC I2N CAM07
5 DEC I5 CAM39
6 DEC I5 CAM40
We can then group by site and assign a group ID as your Site_ID:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera")) %>%
group_by(Site) %>%
mutate(Site_ID = cur_group_id())
This produces:
# A tibble: 6 x 4
# Groups: Site [3]
Month Site Camera Site_ID
<chr> <chr> <chr> <int>
1 DEC BUT CAM27 1
2 DEC BUT CAM28 1
3 DEC I2N CAM01 2
4 DEC I2N CAM07 2
5 DEC I5 CAM39 3
6 DEC I5 CAM40 3
Here is a quick example using regex to find the site code and using an apply function to return a vector of code.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$loc <- apply(df, 1, function(x) gsub("CAM.*$","",gsub("^.{3}",'',x[1])))
unique(df$loc) # all the location of the file
df$n <- as.numeric(as.factor(df$loc)) # get a number for each location
Mind that here I use the x[1] because the code are in the first column of my data.frame, which may vary for you.
---EDIT--- This was a previous answer also working but with more work for you to do. However it allow you to choose numeric code value (or text) to assign locations if they are ordered for example.
It require you to put all the codes for each site, which I found heavy in term of code but it works. The switch part is roughly the same as an ifelse.
The regex consist in excluding the 3 first character and the other ones at the end after the 'CAM' sequence.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$n <- apply(df, 1, function(x) switch(gsub("CAM.*$","",gsub("^.{3}",'',x[1])),
BUT = 1,
DUC = 2)
)

Quanteda changing rel freq of a term over time

I have a corpus of news articles with date and time of publication as 'docvars'.
readtext object consisting of 6 documents and 8 docvars.
# Description: df[,10] [6 × 10]
doc_id text year month day hour minute second title source
* <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2014_01_01_10_51_00… "\"新华网伦敦1… 2014 1 1 10 51 0 docid报告称若不减… RMWenv
2 2014_01_01_11_06_00… "\"新华网北京1… 2014 1 1 11 6 0 docid盘点2013… RMWenv
3 2014_01_02_08_08_00… "\"原标题:报告… 2014 1 2 8 8 0 docid报告称若不减… RMWenv
4 2014_01_03_08_42_00… "\"地球可能毁灭… 2014 1 3 8 42 0 docid地球可能毁灭… RMWenv
5 2014_01_03_08_44_00… "\"北美鼠兔看起… 2014 1 3 8 44 0 docid北美鼠兔为应… RMWenv
6 2014_01_06_10_30_00… "\"欣克力C点核… 2014 1 6 10 30 0 docid英国欲建50… RMWenv
I would like to measure the changing relative frequency that a particular term - e.g 'development' - occurs in these articles (either as a proportion of the total terms in the article / or as a proportion of the total terms in all the articles published in a particular day / month). I know that I can count the number of times the term occurs in all the articles in a month, using:
dfm(corp, select = "term", groups = "month")
and that I can get the relative frequency of the word to the total words in the document using:
dfm_weight(dfm, scheme = "prop")
But how do I combine these together to get the frequency of a specific term relative to the total number of words on a particular day or in a particular month?
What I would like to be able to do is measure the change in the amount of times a term is used over time, but accounting for the fact that the total number of words used is also changing. Thanks for any help!
#DaveArmstrong gives a good answer here and I upvoted it, but can add a bit of efficiency using some of the newest quanteda syntax, which is a bit simpler.
The key here is preserving the date format created by zoo::yearmon(), since the dfm grouping coerce that to a character. So we pack it into a docvar, which is preserved by the grouping, and then retrieve it in the ggplot() call.
load(file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1"))
library("quanteda")
## Package version: 2.1.1
## create corpus and dfm
corp <- corpus(m, text_field = "body_text")
corp$date <- m$first_publication_date %>%
zoo::as.yearmon()
D <- dfm(corp, remove = stopwords("english")) %>%
dfm_group(groups = "date") %>%
dfm_weight(scheme = "prop")
library("ggplot2")
convert(D[, "wonderfully"], to = "data.frame") %>%
ggplot(aes(x = D$date, y = wonderfully, group = 1)) +
geom_line() +
labs(x = "Date", y = "Wonderfully/Total # Words")
I suspect someone will come up with a better solution within quanteda, but in the event they don't, you could always extract the word from the dfm and put it in a dataset along with the date and then make the graph. In the code below, I'm using some music reviews I scraped from the Guardian's website. I've commented out the functions that read in the data from an .rda file from Dropbox. You're welcomed to use it if you like - it's clean, but I don't want to inadvertently have someone download a file from the web they're not aware of.
# f <- file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1")
# load(f)
## create corpus and dfm
corp <- corpus(as.character(m$body_text))
docvars(corp, "date") <- m$first_publication_date
D <- dfm(corp, remove=stopwords("english"))
## take word frequencies "wonderfully" in the dfm
## along with the date
tmp <- tibble(
word = as.matrix(D)[,"wonderfully"],
date = docvars(corp)$date,
## calculate the total number of words in each document
total = rowSums(D)
)
tmp <- tmp %>%
## turn date into year-month
mutate(yearmon =zoo::as.yearmon(date)) %>%
## group by year-month
group_by(yearmon) %>%
## calculate the sum of the instances of "wonderfully"
## divided by the sum of the total words across all
## documents in the month
summarise(prop = sum(word)/sum(total))
## make a plot.
ggplot(tmp, aes(x=yearmon, y=prop)) +
geom_line() +
labs(x= "Date", y="Wonderfully/Total # Words")

how to retrieve the values of the colums two positions away in R

I am dealing with a super messy data set from Nexis (where I have a bunch of articles, titles, date, author etc):
V1 V2 V3 V4 V5 V6 V7 V8
1 1. UNIONS UNIMPRESSED BY GEORGE OSBORNE'S SPENDING ANNOUNCEMENTS
2 PA Newswire: Scotland, November 25, 2015 Wednesday 1:54
3 Newswire: Scotland, 1567 words, Alan Jones, Press Association
4 Correspondent
5 2. Standard Life to back HSBC over HQ
6 The Herald (Glasgow), November 24, 2015 Tuesday, Pg.
V9 V10 V11
1
2 PM BST, PA
3 Industrial
4
5 move
6 23, 620 words,
I want to develop a count of how many articles appear per month in each year (1995-2015), although te head of the data shows that month appear in column this is not always the case. Nevertheless, I have noticed that the year appears always two colums to the right of the month (same row). So I want to develop a code that finds how many articles are from Novermber 1995, February 1995...... October 2015. Any one up to the challenge?
Kind regards
PS: in the following image one can see better the data:
As you provided no working example, I created one and hope that my code will working also on your data.
# build example
d <- data.frame(a=c(month.name[1:3],month.name[1]),b=c(letters[1:4]),c=c(70,99,15,14))
d <- apply(d, 2, as.character)
d
Now the code will loop over all columns searching in every row for one of the 12 months. In all positive rows it will extract the month and the year (two columns behind) paste them together and save it in the results.
# Loop
result <- NULL
for( i in 1:ncol(d)){
# get row ids including one of the 12 months
row <- grep(paste(month.name,collapse = "|"),d[,i])
# month per year
if( length(row) > 0 ){
col=i # Column
mpy <- paste(d[row,i],d[row,i+2],sep = "_")
tmp <- data.frame(col,row,mpy,row.names = NULL)
result <- rbind(result,tmp)}
}
table(result$mpy)

Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Resources