I was wondering if somebody could help me out with a line of code i've been having trouble with. I am trying to extract the row below every row that contains the string "=".
sagarin2 #my data
##output sample
#
# 2 Baylor =
# 93.44
# 91.43
# 3 Kansas =
# 90.86
# 00.00
I can extract the rows containing "=" by using the following code:
filter(sagarin2, grepl("=", ranking, fixed = TRUE))
##output sample
#
#2 Baylor =
#3 Kansas =
However, I am unsure how to call the row below Baylor and Kansas (93.44 and 90.86 respectively).
If anybody can help out, please let me know. I'm nearing the end of an undergraduate project and I could really use some help!
I tried using a the which() function and I also tried creating my own function with no luck. There are a couple of other posts that relate to this topic but I've been trying for a while and can't get any to work.
vec <- c("2 Baylor =",
"93.44",
"91.43",
"3 Kansas =",
"90.86",
"00.00")
output <- vec[dplyr::lag(grepl("=", vec))]
output[!is.na(output)]
"93.44" "90.86"
Related
I'm quite a novice at using R but I'm trying to self-teach and learn as I go along. I'm trying to create a loop to download and save multiple met data files individually as csv files using the worldmet package.
I have two variables, the met site code and the years of interest. I have included code to create a list of the years in question:
Startyear <- "2018"
Endyear <- "2020"
Yearlist <- seq(as.numeric(Startyear), as.numeric(Endyear))
and I have a .csv file with all the site codes listed which are required, and have read this into R. See below a simplified version of the dataframe, however in total there are 204 rows. This dataframe is called 'siteinfo'.
code station ctry
037760-99999 GATWICK UK
037690-99999 CHARLWOOD UK
038760-99999 SHOREHAM UK
038820-99999 HERSTMONCEUX WEST END UK
037810-99999 BIGGIN HILL UK
An example of the code to import one years worth of metdata for one site is as follows
importNOAA(code="037760-99999",year=2019,hourly=TRUE,precip=FALSE,PWC=FALSE,parallel=FALSE,quiet=FALSE)
I understand that I likely need a nested loop to change both variables, but I am unsure if I am going about this correctly. I also understand that I need to have quotation marks around the code value for it to be read correctly, however I was wondering if there's a quick way to include this as part of the code rather than editing all 204 values in the csv?
Would I also need a separate loop following downloading the files, or can this be included into one piece of code?
The current code I have, and I am sure there is a lot wrong with this so I appreciate any guidance, is as follows
for(i in 1:siteinfo$code) {
for(j in 1:Yearlist){
importNOAA(code=i,year=j,hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
This currently isn't working, so if you could help me piece this together, and if possible provide any explanation of where I have gone wrong or how I can improve my coding, I would be extremely grateful!
You can avoid loops altogether (better for large data sets and files) with some functions in dplyr and purrr. I get an error for invalid parameters when I try to run your importNOAA code, so I am using a simpler call to that function.
met_data <- siteinfo %>%
full_join(data.frame(year = Yearlist), by = character(0)) %>%
group_by(code, year) %>%
mutate(dat = list(data.frame(code, year))) %>%
mutate(met = purrr::map(dat, function(df) {
importNOAA(code = df$code, year = df$year, hourly=TRUE, quiet=FALSE)
}) ) %>%
select(-dat)
This code returns a tbl.df where the last column is a list of data.frames, each containing the data for a year-code combination. You can use met_data %>% summarize(met) to expand the data into one big data.frame to save to a csv, or if you want to write them all to indidividual csvs, use lapply:
lapply(1:nrow(met_data), function(x) {
write.csv(met_data$met[x],
file = paste(met_data$station[x], "_", met_data$year[x], ".csv", sep = ""))})
you can't use for loop like for(i in 1:siteinfo$code){}...
just short example
for(i in 1:mtcars$mpg){
print(i)
}
output:
numerical expression has 32 elements: only the first used[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
[1] 21
So use just index like this
for(i in 1:nrow(siteinfo$code){
for(j in 1:nrow(Yearlist){
importNOAA(code=siteinfo$code[i],year=Yearlist[j],hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
maybe that's works
Sorry I have looked for solutions but couldn't find what was needed. I am quite new to R and have used only matlab before (hence am still trying to work out how not to use loops).
I have a df with academic papers in it (one row per paper).
Main df
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
Music; Engineering; Art 2011 Can we get the XXX
Biotechnology; Biology & Chemistry 2007 When will we find XXX
History; Biology 2006 Where does the XXXX
In one column ('Fields') there is a list of subject names, with multiple fields separated by a colon. I want to find all rows (papers) that have an exact match to a specific field name (e.g., 'Biology'). Then, make a new df with all those rows (papers). Importantly, however, I want not to get fields that partially match (e.g., 'Biology & Chemistry').
New df - just for those rows
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
History; Biology 2006 Where does the XXXX
i.e., does not also select Biotechnology; Biology & Chemistry 2007 When will we find XXX which has the word 'Biology' in it
My first thought was to get each field name in its own column using splitstring, then loop through each column using which to find the exact matches for the name. Because there are up to 200 columns (field names) this takes ages! It's taking up to an hour to find and pull all the rows. I would obviously like something faster.
I know in R you can avoid loops by applying etc., but I cant think how to use that here.
This is what it looks like when I split the author names into separate columns
Field1 Field2 Date Title
Biology Neuroscience 2016 How do we know when XXX
This is my code so far (note: there is a white space in front of the names once I split them up)
# Get list of columns to cycle through (they all start with 'sA')
names <- data[,grep("^sA", colnames(data))]
collist <- colnames(names)
names[collist] <- sapply(names[collist],as.character)
collist <- collist[-1]
Loop to get new df from matching rows
for (l in 1:length(namesUniq$Names)) {
namecurr <- namesUniq$Names[l]
namecurrSP <- paste0(" ", namecurr)
# Get data for that field
dfall <- data[which(data$sA1 == namecurr), ]
for (d in 1:length(collist)) {
dcol <- collist[d]
dfall <- rbind(dfall, data[which(data[, dcol] == namecurrSP), ])
rm(dcol)
}
rm(d)
Something that runs quickly would be really useful. Thank you for any help!
grepl does not work - it pulls other partial match strings (like 'Biology & Chemistry' when I want 'Biology' only)
dfall <- subset(data, grepl(namecurr, Field, fixed = TRUE))
For some reason, which does not work when I do it this way (rows works, rows2 does not - it selects rows outside the bounds of my df)
dfall <- rbind(data[rows, ], data[rows2, ])
without a dput of your example data here is a example that can be used
data
test <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
code:
test[sapply(strsplit(test,"; "), function(x) any(x=="Biology"))]
output:
[1] "Biology; Neuroscience" "History; Biology"
Not sure how many different subsets you'll be pulling from your main dataframe but thought I would take #Daniel-O solution a little farther for you and demonstrate a tidyverse solution.
You can think of it as make a Biology_df by starting with the Main_df and filtering for all the rows where after we str_split the Fields column by semi-colon and space ("; ") there are any pieces of the split that exactly match Biology
library(dplyr)
library(stringr)
library(purrr)
Main_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 Music; Engineering; Art 2011 Can we get the XXX
#> 3 Biotechnology; Biology & Chemistry 2007 Where does the XXXX
#> 4 History; Biology 2006 Where does the XXXX
Biology_df <-
Main_df %>%
filter(str_split(Fields, "; ") %>%
map_lgl( ~ any(.x == "Biology")
)
)
Biology_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 History; Biology 2006 Where does the XXXX
Based upon the little snippet of data you show
Fields <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
Date <- c("2016", "2011", "2007", "2006")
Title <- c("How do we know when XXX", "Can we get the XXX", "Where does the XXXX", "Where does the XXXX")
Main_df <- data.frame(Fields, Date, Title)
I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.
I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.
Simplified versions of these files would look as shown below.
File 1:
PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451
File 2:
PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012
In the actual files many of the USERID values in the first file can also be found in the second file (some may not be present however). Also while the USERID values are unchanged for all users, their PROFILE.ID may have changed.
The problem is that I would have to locate the rows where the PROFILE.ID has changed.
I am thinking that I would have to use the following sequence of steps to analyze it in R:
Load both files to R Studio as data frames
Loop through the USERID column on the first file (which has more rows)
Search the second file for each USERID found in the first file
Return the corresponding PROFILE.ID from second file
Compare the returned value with what is in the first file
Output the rows where the PROFILE.ID values differ
I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.
library(tidyverse)
con1 <- file("file1.csv", open = "r")
con2 <- file("file2.csv", open = "r")
file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")
for (i in seq(nrow(file1))) {
profIDFile1 <- file1$PROFILE.ID[i]
userIDFile1 <- file1$USERID[i]
profIDRowFile2 <- filter(file2, USERID == userIDFile1)
profIDFile2 <- profIDRowFile2$PROFILE.ID
if (profIDFile1 != profIDFile2) {
output < - profIDRowFile2
}
}
write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)
close(con1)
close(con2)
Question:
Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?
I think you can do this with a simple join:
library(dplyr)
full_join(file1, file2, by = "USERID") %>%
filter(PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
This shows that those three USERID rows have differeing PROFILE.ID fields. (The .x are from file1, .y from file2.)
That test does not deal very well with IDs that are missing in one, so you might add logic such as:
full_join(file1, file2, by = "USERID") %>%
filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
# 4 Z551 Flagstaff Arizona ML1451 <NA> <NA> <NA>
The fourth row indicates an ID missing in file2. This here is likely an artifact of a small sample dataset (which is good on SO :-), I'm not certain if this is interesting or meaningful to you.
We can do this with base R
subset(merge(file, file2, by = 'USERID'), PROFILE.ID.x != PROFILE.ID.y)
I want to ingest all files in the working directory and scan all rows for line breaks or carriage returns. Instead of eliminating them, I'd like to divert them into a new output file for manual review. Here's what I have so far:
library(plyr)
library(dplyr)
library(readxl)
filenames <- list.files(pattern = "Sara Lee.*\\.xlsx$", ignore.case = TRUE)
read_excel_filename <- function(filename){
ret <- read_excel(filename, col_names = TRUE, skip = 5, trim_ws = FALSE)
ret
}
import.list <- ldply(filenames, read_excel_filename)
returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
(import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
(import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
(import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
(import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
(import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
(import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
(import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
(import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
(import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
(import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
(import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
(import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
(import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
(import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
(import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
now <- Sys.time()
carriage_return_file_name <- paste(format(now,"%Y%m%d"),"ROWS with Carriage Returns or New Lines.csv",sep="_")
write.csv(returnornewline, carriage_return_file_name, row.names = FALSE)
Here's some sample data:
Customer Segment Address
BuyFood 123 Main St.\r
BigKetchup 679 Smith Dr.\r
DownUnderMeat 410 Crocodile Way
BuyFood 123 Main St.
I thought the trim_ws = FALSE condition would work, but it hasn't.
Apologies for the column spam, I've yet to figure out an easier way to scan all the columns without listing them. Any help on that issue is appreciated as well.
EDIT: Added some sample data. I don't know how to show a carriage return in the address other than the regex of it. It doesn't look like that in the real sample data, that's just for our use here. Please let me know if that's not clear. The desired output would take the first 2 rows of data where there's a carriage return and output it to the csv file listed at the end of the code block.
EDIT 2: I used the code provided in the suggestion in place of the original long list of columns as follows. However, this doesn't give me a new variable that contains a dataframe of rows with new lines or carriage returns. When I look at my global environment in R Studio I see another variable under Data called "returnornewline" but it shows as a large list, unlike the import.list variable which shows a dataframe. This shouldn't be the case because I've only added a carriage return in the first row of the first spreadsheet of the data, so that list should not be so large.:
returnornewline <- lapply(import.list, function(x) lapply(x, function(s) grep("\r", s)))
# returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
# (import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
# (import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
# (import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
# (import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
# (import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
# (import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
# (import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
# (import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
# (import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
# (import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
# (import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
# (import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
# (import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
# (import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
# (import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
EDIT 3: I need to be able to take all rows in the newly created data frame "import.list" and scan them for any instances of carriage returns or new lines within all the rows. The example above is rudimentary, but the concept stands. In the example, I'd expect for the script to read the first two rows and say "hey, these rows have carriage returns, add this to the variable assigned to this line of code and at the end of the script output this data to a csv." The remaining two rows in the sample data above have no need to be output because they have no carriage returns in their data.
I have a csv file as 'Campaigname.csv'
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
The code to the solution was given in a previous stackoverflow answer as:
## read the csv file - modify next line as needed
xx <- read.csv("Campaignname.csv",header=TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- xx$Market
ss <- stack(s)
DF <- with(ss, data.frame(Market = ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
But now another column like 'Identity' is included where the input is
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
And the desired result is
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
The number of columns may not be limited to just 3 columns, it may keep on increasing.
How to do it in R? New to R.Any help is appreciated.
I'm not sure that I fully understand your question, and you didn't provide a reproducible example (so I can't run your code and try to get to the end point you want). But I'll still try to help.
Generally speaking, in R you can add a new column to a data.frame simply by using it.
df = data.frame(advertiser = c("co1", "co2", "co3"),
campaign = c("camp1", "camp2", "camp3"))
df
advertiser campaign
1 co1 camp1
2 co2 camp2
3 co3 camp3
At this point, if I wanted to add an identity column I would simply create it with the $ operator like this:
df$identity = c(1, 2, 3)
df
advertiser campaign identity
1 co1 camp1 1
2 co2 camp2 2
3 co3 camp3 3
Note that there are other ways to accomplish this - see the transform (?transform) and rbind (?rbind) functions.
The caveat when adding a column to a data.frame is that I believe you must add a vector that has the same number of elements as their are rows in the data.frame. You can see the number of rows in the data.frame by typing nrow(df).