Import raw data into R - r

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
10 PHOENIX 894,070

For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530")
# [1] "3 CHICAGO 3,009,530"
You can then interpret spaces as field separators.

A variation on a theme... but first, some sample data:
cat("1 NEW YORK 7,262,700",
"2 LOS ANGELES 3,259,340",
"3 CHICAGO 3,009,530",
"4 HOUSTON 1,728,910",
"5 PHILADELPHIA 1,642,900",
"6 DETROIT 1,086,220",
"7 SAN DIEGO 1,015,190",
"8 DALLAS 1,003,520",
"9 SAN ANTONIO 914,350",
"10 PHOENIX 894,070", sep = "\n", file = "test.txt")
Step 1: Read the data in with readLines
x <- readLines("test.txt")
Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.
gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
# [1] "1\t NEW YORK \t7,262,700" "2\t LOS ANGELES \t3,259,340"
# [3] "3\t CHICAGO \t3,009,530" "4\t HOUSTON \t1,728,910"
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070"
Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:
out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x),
header = FALSE, strip.white = TRUE)
out
# V1 V2 V3
# 1 1 NEW YORK 7,262,700
# 2 2 LOS ANGELES 3,259,340
# 3 3 CHICAGO 3,009,530
# 4 4 HOUSTON 1,728,910
# 5 5 PHILADELPHIA 1,642,900
# 6 6 DETROIT 1,086,220
# 7 7 SAN DIEGO 1,015,190
# 8 8 DALLAS 1,003,520
# 9 9 SAN ANTONIO 914,350
# 10 10 PHOENIX 894,070
One possible last step would be to convert the third column to numeric:
out$V3 <- as.numeric(gsub(",", "", out$V3))

Expanding on #Hugh's answer I would try the following, although its not particularly efficient.
lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) {
gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x)
}))
citiesDF <- data.frame(num = rep(0, length(lines)),
city = rep("", length(lines)),
population = rep(0, length(lines)),
stringsAsFactors=FALSE)
for (i in 1:length(lines)) {
splitted = strsplit(lines[i], " +")
citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}

Related

Drop rows after criteria

I have some data that I'm trying to clean up, and I noticed that I have 150 files that have rows that are subsets of previous rows. Is there a way that I can drop everything after certain criteria occur? Below I'm not sure how I'd write out sample data for this via code, so I've listed an example of the data as text. Below. I'd like to drop all rows at and below "section 2"
Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
,,
Section 2,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
,,
Section 3,,
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
,,
Section 5,,
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
Expected output
Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
Assuming your text file is called temp.txt you can use readLines to read it in, find the line with 'Section 2' in it and read all the lines above that.
tmp <- readLines('temp.txt')
inds <- grep('Section 2', tmp) - 2
data <- read.csv(text = paste0(tmp[1:inds], collapse = '\n'))
data
# Name Age Address
#1 Section 1 NA
#2 Abby 10 1 Baker St
#3 Alice 12 3 Main St
#4 Becky 13 156 F St
#5 Ben 14 2 18th St
#6 Cameron 15 4 Journey Road
#7 Danny 16 123 North Ave
#8 Eric 17 325 Hill Blvd
Here, I "read" in your data by calling strsplit and using the newline as the separator. If you were doing this from file, you could use readLines
I use grep to find the line number that contains "Section 2", use that to subset raw_data. I paste0(..., collapse="") the lines that do not start with "Section" and use read.table using sep="," with header=TRUE to parse as if I read just that section with read.csv.
raw_data <- strsplit(split = "\\n", "Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
,,
Section 2,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
,,
Section 3,,
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
,,
Section 5,,
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd")
section2_idx <- grep('Section 2', raw_data[[1]])
raw_data_clean <- trimws(raw_data[[1]][1:(section2_idx-2)])
allsect_idx <- grep('^Section', raw_data_clean)
if(length(allsect_idx > 0))
raw_data_clean <- raw_data_clean[-allsect_idx]
read.table(text = paste0(raw_data_clean, collapse="\n"), sep=",", header = TRUE)
#> Name Age Address
#> 1 Abby 10 1 Baker St
#> 2 Alice 12 3 Main St
#> 3 Becky 13 156 F St
#> 4 Ben 14 2 18th St
#> 5 Cameron 15 4 Journey Road
#> 6 Danny 16 123 North Ave
#> 7 Eric 17 325 Hill Blvd
Created on 2020-12-06 by the reprex package (v0.3.0)
Here is a made up example that avoids having to type in your starting data.
mixed_data is 500 elements long and each row is a string containing two commas. The string doesn't need to be broken apart if if looks like your example.
Create an empty vector to hold just one of each value. Then loop through the whole mixed list and add the unique entries to that vector. This example resulted in 444 unique items in one_of_each of the original 500 in mixed_data.
set.seed(101)
a <- sample(LETTERS,500, replace = TRUE)
b <- sample(letters,500, replace = TRUE)
d <- sample(c(1:3),500, replace = TRUE)
mixed_data <- paste0(a,",",b,",",d)
head(mixed_data)
one_of_each <- c() #starts empty
for (i in 1:length(mixed_data)){
if (mixed_data[i] %in% one_of_each == FALSE) {
one_of_each <- c(one_of_each,mixed_data[i]) #if not found, then add
}
}

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

Regular Expressions to Unmerge row entries

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04
You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

How to extract the dollar value from a row in a data frame and paste to its respective row

Character vector x contains tweets about the flights from the source to the destination city with its fare. It looks like below:
x <- c('RT #airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD for $234',
'RT #TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270',
'SOME JUNK HERE',
'RT #airfarewatchdog: Los Angeles Los Angeles LAX to New York')
I'm basically trying to extract the source and the destination city with its fare from each row and storing it into another variable.
My code looks like below:
toMatch <- (data$City_Airport)
a <- sapply(1:length(x), function(i) {
res <- c(i, paste(ex_dollar(x)), unlist(stringr::str_extract_all(x[i], paste(toMatch, collapse = "|"))))
if (length(res) > 1 ) {res
} else NULL
})
a <- plyr::ldply(a, rbind)
a[] <- lapply(a, as.character)
a[is.na(a)] <- ""
names(a)[1] <- "row"
My output looks like below:
row 2 3 4 5 6 7 8 9
1 1 $234 $270 NA NA Los Angeles Los Angeles LAX SJD
2 2 $234 $270 NA NA New York Mexico City
3 3 $234 $270 NA NA SOM JUN HER
4 4 $234 $270 NA NA Los Angeles Los Angeles LAX New York
What is happening here is that the fares are extracted from all the rows and they all are pasted on each row
I'm assuming the problem here is with the paste(ex_dollar(x)) function which is inside the loop. I tried to stick that function everywhere else but it wouldn't just work.
I want my output to look something like below:
row 2 3 4 5 6
1 1 $234 Los Angeles Los Angeles LAX SJD
2 2 $270 New York Mexico City
3 3 NA SOM JUN HER
4 4 NA Los Angeles Los Angeles LAX New York
Assuming you already have a function ex_dollar() that extracts the dollar value from a string (your code calls ex_dollar(), although you don't provide its code), then simply use ex_dollar() on a line-by-line basis inside the loop, rather than on the whole of the text: i.e. use ex_dollar(x[i]) rather than ex_dollar(x)
a <- sapply(1:length(x), function(i) {
res <- c(i, paste(ex_dollar(x[i])), unlist(stringr::str_extract_all(text[i], paste(toMatch, collapse = "|"))))
if (length(res) > 1 ) {res
} else NULL
})
One way to extract the costs is by using regular expressions.
Using your data:
x <- data.frame(text = c("RT #airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD for $234",
"RT #TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270",
"SOME JUNK HERE",
"RT #airfarewatchdog: Los Angeles Los Angeles LAX to New York"))
The method is:
x$value = sapply(x,FUN = function(i){regmatches(i,gregexpr("\\$\\d+",i))})
This regular expression will match a $ followed by digits. If you have decimals then use "\\$[0-9.]+"
Result:
text value
1 RT #airfarewatchdog: Los Angeles Los Angeles LAX to Cabo #SJD for $234 $234
2 RT #TheFlightDeal: Airfare Deal: [AA] New York - Mexico City, Mexico. $270 $270
3 SOME JUNK HERE
4 RT #airfarewatchdog: Los Angeles Los Angeles LAX to New York
Here is one method for a data.frame named df:
# extract dollars columns as a matrix
myMat <- as.matrix(df[, 2:5])
# pull off diagonal (the data you want)
myDollars <- diag(myMat)
# construct new data.frame
dfNew <- cbind(df[, -(2:5)], myDollars)
This returns the dataframe
# set names of columns and print result
setNames(dfNew, c("row", 2:5, "myDollars"))
row 2 3 4 5 myDollars
1 1 Los_Angeles Los_Angeles LAX SJD $234
2 2 New_York Mexico_City <NA> <NA> $270
3 3 SOM JUN HER <NA> <NA>
4 4 Los_Angeles Los_Angeles LAX New_York <NA>

Transform data.frame of lists to data.frame

I would like to transform this data.frame
Loc Time(h)
Paris, Luxembourg 10,15
Paris, Lyon, Berlin 9,12,11
to this
Loc Time(h)
Paris 10
Luxembourg 15
Paris 9
Lyon 12
Berlin 11
You could use Ananda Mahto's cSplit function, provided you have data.table installed.
If dat is your data,
devtools::source_gist(11380733)
cSplit(dat, c("Loc", "Time"), direction = "long")
# Loc Time
# 1: Paris 10
# 2: Luxembourg 15
# 3: Paris 9
# 4: Lyon 12
# 5: Berlin 11
Assuming each entry in your dataframe is a character string, of the form you say above, you could do the following
#notice the space in ", " for the first line
newLoc<-sapply(df$Loc, function(entry) {unlist(strsplit(entry,", ", fixed=TRUE))})
#and the lack there of in the second
newTime<-sapply(df$`Time(h)`, function(entry) {unlist(strsplit(entry, ",", fixed=TRUE))})
I think we also need to flatten the results
dim(newLoc)<-NULL
dim(newTime)<-NULL
Then combine back into a df
data.frame(cbind(Loc=newLoc, `Time(h)`=newTime))

Resources