Substring (variable length) values in entire column of dataframe - r

I have looked for this tirelessly with no luck. I am coming from a Java background and new to R. (On a side note, I am loving R, but disliking string operations in it as well as the documentation - maybe that's just a Java bias.)
Anyhow, I have a dataframe with a single column, it is composed of a latitude and longitude numbers seperated by a colon e.g. ROAD:_:-87.4968190989999:38.7414455360001
I would like to create 2 new data frames where each will have the separate lat and long numbers.
I have successfully written a piece of code where I use for loops (but I know this is inefficient - and that there has to be another way)
Here is a snippet of the inefficient code:
length <- length(fromLatLong)
for (i in 1:length){
fromLat[i] <- strsplit(fromLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
fromLong[i] <- strsplit(fromLatLong[i] ,":")[[1]][3]
}
for (i in 1:length){
toLat[i] <- strsplit(toLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
toLong[i] <- strsplit(toLatLong[i] ,":")[[1]][3]
}
Here is how I tried to optimize it using mutate, but I only get the first value copied over to all rows as such:
fromLat = mutate(fromLatLong, FROM_NODE_ID = (strsplit(as.character(fromLatLong$FROM_NODE_ID),":")[[1]][4]))
fromLong = mutate(fromLatLong, FROM_NODE_ID = (strsplit(fromLatLong$FROM_NODE_ID,":")[[1]][3]))
toLat = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][4]))
toLong = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][3]))
And here is the result:
FROM_NODE_ID
1
38.7414455360001
2
38.7414455360001
3
38.7414455360001
4
38.7414455360001
5
38.7414455360001
6
38.7414455360001
7
38.7414455360001
8
38.7414455360001
9
38.7414455360001
I would appriciete your help on this. Thanks

You can use the map_chr function of the purrr package. For instance:
fromLat = mutate(fromLatLong, FROM_NODE_ID = map_chr(FROM_NODE_ID, ~ strsplit(as.character(.x),":")[[1]][4]))

The following expression will produce a data frame with each of the colon-delimited components as a separate column. You can then break this up into separate data frames or do whatever else you want with it.
as.data.frame(t(matrix(unlist(strsplit(fromLatLong$coords, ":", fixed=TRUE), recursive=FALSE), nrow=4)),stringsAsFactors=FALSE)
(Assuming the column name of your values in the data frame is coords.)

Related

Finding out if column has only identical values in R for if...else

I'm new to R and have the following problem. I have a big df that looks something like this...
PPC=precipitation
ID
Area
ppc_1
ppc_2
ppc_3
....
other_values
1
PB
0.989
0.787
0.676
..
2
DR
0.978
0.989
.....
..
3
...
...
...
...
I need to test for normal distribution for every PPC-station per area with the Shapiro-Wilk-Test. To get the necessary data from the weather stations in one column I used the following "for loop".
My problem is that some weather stations seem to have identical values which causes an error for the Shapiro-Wilk-Test. I tried to circumvent it by using an "If...else- statement", but the "if" clause does not work.
#create empty vector to put results in
pvalue_ppc <- as.character()
for (i in 1:length(df$ID)) {
####filter for ID and PPC
temp_df<-df %>%select(starts_with("ppc"
), ID)%>% filter(ID==i )
#melt weather stations over ID
temp_df_ID_ppc <- melt(temp_df,
id = "ID",
value.name = "all_ppc")
#i want to test if all values in df$all_ppc are the same = TRUE
if(isTRUE(all.equal(temp_df_ID_ppc$all_ppc, countEQ = TRUE))){
#and if all values are equal, put an "NA" into the vector
output <- NA
pvalue_ppc[i]<- rbind(output)
}
else {
output<- shapiro.test(ttemp_df_ID_ppc$all_ppc)
#bind results to vector
pvalue_ppc[i]<- rbind(output$p.value)
}
}
#bind vector to orig. df
df<- cbind(df, "Pvalue_PPC"=pvalue_ppc)
The error I get is
$ operator is invalid for atomic vectors
or it returns TRUE back, even if not all values are equal.
I also tried using "identical(), but it only compared vector to vector

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

r - getting all NA in ordered factor column

Instead of showing more2 or less2 in the column, it only shows up as NA. Why aren't the character names appearing instead?
careermore2 <- vector(mode="character",length=length(mlb$careeryrs))
"less2" <- careermore2[mlb$careeryrs<=2]
"more2" <- careermore2[mlb$careeryrs>=2]
No.seasons <- factor(careermore2,levels=c("more2","less2"),exclude=NA,ordered=TRUE)
mlb2 <- cbind(mlb,No.seasons)
str(mlb2$No.seasons)
head(mlb2$No.seasons)
mlb2[mlb2$No.seasons=="more2",]
Looking at careermore2 I would say you've got these the wrong way round:
"less2" <- careermore2[mlb$careeryrs<=2]
"more2" <- careermore2[mlb$careeryrs>=2]
That creates two objects. You really meant:
careermore2[mlb$careeryrs<=2] = "less2"
careermore2[mlb$careeryrs>=2] = "more2"
ie set the corresponding values in careermore2. And you probably want <2 or >2 rather than have = in both...

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Ordering Merged data frames

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much
Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Resources