How to include a new column when using base R? - r

I have a csv file as 'Campaigname.csv'
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
The code to the solution was given in a previous stackoverflow answer as:
## read the csv file - modify next line as needed
xx <- read.csv("Campaignname.csv",header=TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- xx$Market
ss <- stack(s)
DF <- with(ss, data.frame(Market = ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
But now another column like 'Identity' is included where the input is
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
And the desired result is
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
The number of columns may not be limited to just 3 columns, it may keep on increasing.
How to do it in R? New to R.Any help is appreciated.

I'm not sure that I fully understand your question, and you didn't provide a reproducible example (so I can't run your code and try to get to the end point you want). But I'll still try to help.
Generally speaking, in R you can add a new column to a data.frame simply by using it.
df = data.frame(advertiser = c("co1", "co2", "co3"),
campaign = c("camp1", "camp2", "camp3"))
df
advertiser campaign
1 co1 camp1
2 co2 camp2
3 co3 camp3
At this point, if I wanted to add an identity column I would simply create it with the $ operator like this:
df$identity = c(1, 2, 3)
df
advertiser campaign identity
1 co1 camp1 1
2 co2 camp2 2
3 co3 camp3 3
Note that there are other ways to accomplish this - see the transform (?transform) and rbind (?rbind) functions.
The caveat when adding a column to a data.frame is that I believe you must add a vector that has the same number of elements as their are rows in the data.frame. You can see the number of rows in the data.frame by typing nrow(df).

Related

extracting a row below a conditional statement (using a string search)

I was wondering if somebody could help me out with a line of code i've been having trouble with. I am trying to extract the row below every row that contains the string "=".
sagarin2 #my data
##output sample
#
# 2 Baylor =
# 93.44
# 91.43
# 3 Kansas =
# 90.86
# 00.00
I can extract the rows containing "=" by using the following code:
filter(sagarin2, grepl("=", ranking, fixed = TRUE))
##output sample
#
#2 Baylor =
#3 Kansas =
However, I am unsure how to call the row below Baylor and Kansas (93.44 and 90.86 respectively).
If anybody can help out, please let me know. I'm nearing the end of an undergraduate project and I could really use some help!
I tried using a the which() function and I also tried creating my own function with no luck. There are a couple of other posts that relate to this topic but I've been trying for a while and can't get any to work.
vec <- c("2 Baylor =",
"93.44",
"91.43",
"3 Kansas =",
"90.86",
"00.00")
output <- vec[dplyr::lag(grepl("=", vec))]
output[!is.na(output)]
"93.44" "90.86"

RStudio/ R - Make new df of rows where values in a column exact match a string (faster speed needed)

Sorry I have looked for solutions but couldn't find what was needed. I am quite new to R and have used only matlab before (hence am still trying to work out how not to use loops).
I have a df with academic papers in it (one row per paper).
Main df
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
Music; Engineering; Art 2011 Can we get the XXX
Biotechnology; Biology & Chemistry 2007 When will we find XXX
History; Biology 2006 Where does the XXXX
In one column ('Fields') there is a list of subject names, with multiple fields separated by a colon. I want to find all rows (papers) that have an exact match to a specific field name (e.g., 'Biology'). Then, make a new df with all those rows (papers). Importantly, however, I want not to get fields that partially match (e.g., 'Biology & Chemistry').
New df - just for those rows
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
History; Biology 2006 Where does the XXXX
i.e., does not also select Biotechnology; Biology & Chemistry 2007 When will we find XXX which has the word 'Biology' in it
My first thought was to get each field name in its own column using splitstring, then loop through each column using which to find the exact matches for the name. Because there are up to 200 columns (field names) this takes ages! It's taking up to an hour to find and pull all the rows. I would obviously like something faster.
I know in R you can avoid loops by applying etc., but I cant think how to use that here.
This is what it looks like when I split the author names into separate columns
Field1 Field2 Date Title
Biology Neuroscience 2016 How do we know when XXX
This is my code so far (note: there is a white space in front of the names once I split them up)
# Get list of columns to cycle through (they all start with 'sA')
names <- data[,grep("^sA", colnames(data))]
collist <- colnames(names)
names[collist] <- sapply(names[collist],as.character)
collist <- collist[-1]
Loop to get new df from matching rows
for (l in 1:length(namesUniq$Names)) {
namecurr <- namesUniq$Names[l]
namecurrSP <- paste0(" ", namecurr)
# Get data for that field
dfall <- data[which(data$sA1 == namecurr), ]
for (d in 1:length(collist)) {
dcol <- collist[d]
dfall <- rbind(dfall, data[which(data[, dcol] == namecurrSP), ])
rm(dcol)
}
rm(d)
Something that runs quickly would be really useful. Thank you for any help!
grepl does not work - it pulls other partial match strings (like 'Biology & Chemistry' when I want 'Biology' only)
dfall <- subset(data, grepl(namecurr, Field, fixed = TRUE))
For some reason, which does not work when I do it this way (rows works, rows2 does not - it selects rows outside the bounds of my df)
dfall <- rbind(data[rows, ], data[rows2, ])
without a dput of your example data here is a example that can be used
data
test <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
code:
test[sapply(strsplit(test,"; "), function(x) any(x=="Biology"))]
output:
[1] "Biology; Neuroscience" "History; Biology"
Not sure how many different subsets you'll be pulling from your main dataframe but thought I would take #Daniel-O solution a little farther for you and demonstrate a tidyverse solution.
You can think of it as make a Biology_df by starting with the Main_df and filtering for all the rows where after we str_split the Fields column by semi-colon and space ("; ") there are any pieces of the split that exactly match Biology
library(dplyr)
library(stringr)
library(purrr)
Main_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 Music; Engineering; Art 2011 Can we get the XXX
#> 3 Biotechnology; Biology & Chemistry 2007 Where does the XXXX
#> 4 History; Biology 2006 Where does the XXXX
Biology_df <-
Main_df %>%
filter(str_split(Fields, "; ") %>%
map_lgl( ~ any(.x == "Biology")
)
)
Biology_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 History; Biology 2006 Where does the XXXX
Based upon the little snippet of data you show
Fields <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
Date <- c("2016", "2011", "2007", "2006")
Title <- c("How do we know when XXX", "Can we get the XXX", "Where does the XXXX", "Where does the XXXX")
Main_df <- data.frame(Fields, Date, Title)

Cleaning Data from PDF file

I am trying to scrape data from a pdf downloaded from the link below and store as a datatable for analysis.
https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf.
Heres what I have so far;
require(pdftools)
require(data.table)
require(stringr)
url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")
This code nearly works, however in the notes column whereby the text spills on to a new page due to a \n I end up with the code spilling over to a new line. For example, on the 19-Jan-84 the notes column should read;
Corporate Event - Acquisition of Eagle Star by BAT Industries
But with my code, the "BAT Industries" spills over onto a new line whereas I would like it to be in the same string as the line above.
Once the code as run I would like to have the same table as the pdf with all the text going into the correct columns.
Thanks.
We may use the following manipulations.
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
I guess ultimately you are going to want a single data frame rather than a list of them. For that you may use do.call(rbind, dfl).

Transfer text file to table in R with some conditions on it

I have one text file like this
DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.
text goes till approx 300 lines. And sometimes Address data exceeds to two second line also i want to convert this text data to either cvs format which will have data like this
DOB, Name, Address
13-03-2003,ABC,xyz.
or at least in one data frame. I tried so many things, when i am giving read.table("file.txt",sep="\n") it makes everything in one column and i also tried first making headers by using
header <- read.table("file.txt",sep= "\n")
and then another data <- read.table("file.txt",skip = 3, sep ="\n") and then combining both but its not working out as my header vector has 3 and data vector has like 300 approx columns, its not working as required. Any help will be really helpful :)
You could try
entries <- unlist(strsplit(text, "\\n")) #separate entries by line breaks
entries <- entries[nchar(entries) > 0] #remove empty lines
as.data.frame(matrix(entries, ncol=3, byrow=TRUE)) #assemble dataframe
# V1 V2 V3
#1 DOB Name Address
#2 13-03-2003 ABC xyz.
#3 12-08-2004 dfs 1 infinite loop.
data
text <-'DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.'
df <- read.table(text = text)
Two assumptions were made, 1 there will not be any blank names or date of births. By "blank" I do not mean "NA", "", or any other marker that the value was missing. Second assumption was that names and DOBs will only occupy one line each.
s1 <- gsub("^\n|\n$", "", strsplit(x, "\n\n+")[[1]])
stars <- gsub("\n", ", ", sub("\n", "*", sub("\n", "*", s1)))
mat <- t(as.data.frame(strsplit(stars, "\\*")))
dimnames(mat) <- c(NULL, NULL)
write.csv(mat,"filename.csv")
We start by splitting the text by the blank lines and eliminating any leading or trailing newline tokens. Then we replace the first and second "\n" symbols with stars. Next we split on those new star markers that we created to always have 3 elements for each row. We create a matrix with the values and transpose it for display. Then write the data to csv.
When opened with Notepad on a test file, I get:
"","V1","V2","V3"
"1","DOB","Name","Address"
"2","13-03-2003","ABC","xyz."
"3","12-08-2004","dfs","1 infinite loop"
"4","01-01-2000","Bob Smith","1234 Main St, Suite 400"
row and column names can be set to FALSE with ?write.csv if desired.
Data
x <- "DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop
01-01-2000
Bob Smith
1234 Main St
Suite 400
"

rbind calendar data from Web - list error

I am trying to import calendar dates in R.
I found a website with dates that I imported with XML.
library('XML')
u="http://www.timeanddate.com/calendar/custom.html?year=2015&country=5&typ=0&display=2&cols=0&hol=0&cdt=1&holm=1&df=1"
tables = readHTMLTable(u)
Get rid of some unecessary elements
tables = tables[-1]
tables = tables[-1]
tables = tables[-13]
Generate list names
names(tables) <- paste('month', 1:12, sep = '')
with a solution proposed here
mtables = mapply(cbind, tables, 'Month'= 1:12, SIMPLIFY=F)
Here when I want to rbind my list:
do.call('rbind', mtables)
I get an error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Could you help with solve this error problem ?
rbind normally takes two parameters.
here is a code snippet using rbind.
hope this helps.
cheers
oliver
vehicles1 <- unique(grep("Vehicles", SCC$EI.Sector, ignore.case = TRUE, value = TRUE))
vehicles <- SCC[SCC$EI.Sector %in% vehicles1, ]["SCC"]
# Select observations relating to Baltimore MD
vehiclesBaltimore <- NEI[NEI$SCC %in% vehicles$SCC & NEI$fips == "24510",]
# Select observations relating to Los Angeles County CA
vehiclesLosAngelesCounty <- NEI[NEI$SCC %in% vehicles$SCC & NEI$fips == "06037",]
# Merge observations of Baltimore and Los Angeles County
vehiclesCompare <- rbind(vehiclesBaltimore, vehiclesLosAngelesCounty)
The issue was actually in the header.
`tables = readHTMLTable(u, header = F)`
instead of
`tables = readHTMLTable(u, header = T)`
In order to get the same column names for each lists.
Thanks

Resources