How to learn about possible variables in R [duplicate] - r

This question already has answers here:
List distinct values in a vector in R
(7 answers)
Closed 2 years ago.
In the nycflights13 package, how do I see all of the carriers in the carrier variable? I can pull up the list of variables from air_time to year, but I want to see the list of all the carriers. Please let me know!

The carriers are listed in the airlines data frame.
nycflights13::airlines
# # A tibble: 16 x 2
# carrier name
# <chr> <chr>
# 1 9E Endeavor Air Inc.
# 2 AA American Airlines Inc.
# 3 AS Alaska Airlines Inc.
# 4 B6 JetBlue Airways
# 5 DL Delta Air Lines Inc.
# 6 EV ExpressJet Airlines Inc.
# 7 F9 Frontier Airlines Inc.
# 8 FL AirTran Airways Corporation
# 9 HA Hawaiian Airlines Inc.
# 10 MQ Envoy Air
# 11 OO SkyWest Airlines Inc.
# 12 UA United Air Lines Inc.
# 13 US US Airways Inc.
# 14 VX Virgin America
# 15 WN Southwest Airlines Co.
# 16 YV Mesa Airlines Inc.
Or am I not understanding the question?

I presume you are looking at the 'flights' table in that package, since it contains a carrier variable.
For a list of the unique values in a variable, you could use base unique function:
unique(nycflights13::flights$carrier)
[1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
or table to count the number of appearances:
table(nycflights13::flights$carrier)
9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN YV
18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 5162 12275 601

Related

Splitting the rows that has "|" using separate() fn not splitted

my data looks like
> company
name category_list
11 1-4 All Entertainment|Games|Software
12 1.618 Technology Networking|Real Estate|Web Hosting
13 1-800-DENTIST Health and Wellness
14 1-800-DOCTORS Health and Wellness
15 1-800-PublicRelations, Inc. Internet Marketing|Media|Public Relations
i will have to split the category_list column based the values. when the values are pipe separated, the row should be split.
i tried the same using separate function but the column is not populated with any values
c1 <- company %>% separate(category_list,into=c("primary_Sector"), sep="|")
Actual output:
name primary_Sector
11 1-4 All
12 1.618 Technology
13 1-800-DENTIST
14 1-800-DOCTORS
15 1-800-PublicRelations, Inc.
Expected output
name category_list
11 1-4 All Entertainment
12 1-4 All Games
13 1-4 All Software
can someone tell me what is wrong?
tidyr::separate() does the column-wise separation, tidyr::separate_rows() does the row-wise separation:
library(tidyr)
read.table(
text="name;category_list
1-4 All;Entertainment|Games|Software
1.618 Technology;Networking|Real Estate|Web Hosting
1-800-DENTIST;Health and Wellness
1-800-DOCTORS;Health and Wellness
1-800-PublicRelations, Inc.;Internet Marketing|Media|Public Relations",
sep=";", header = TRUE, stringsAsFactors = FALSE
) %>%
separate_rows(category_list, sep = "\\|")
## name category_list
## 1 1-4 All Entertainment
## 2 1-4 All Games
## 3 1-4 All Software
## 4 1.618 Technology Networking
## 5 1.618 Technology Real Estate
## 6 1.618 Technology Web Hosting
## 7 1-800-DENTIST Health and Wellness
## 8 1-800-DOCTORS Health and Wellness
## 9 1-800-PublicRelations, Inc. Internet Marketing
## 10 1-800-PublicRelations, Inc. Media
## 11 1-800-PublicRelations, Inc. Public Relations

Combine cells having similar values in a row

I have a data frame like below.
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald trump agreement climate united states
3 donald trump agreement paris united states
4 donald trump agreement united states NA
5 donald trump climate emission united states
6 donald trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united kingdom chicken hen wimp
12 united states agreement paris NA
And I want the resultant as a data frame with rows like below
For example,
Row1 should be as such since it doesn't have any similar rows.
if you see rows 2,3,4,5 and 12. They should be combined in a same row like
united states donald trump paris climate agreement emission
And rows 7,9 and 11 should be combined as
united kingdom chicken hen wimp mustard
It can be in any order.
Assume the data frame DF shown reproducibly in the Note at the end.
Convert that to a character matrix m. Let us say that two rows are similar if they have more than one element in common and define is_similar to take two row indexes and return TRUE or FALSE accordingly. Then apply that to every pair of rows using outer. Interpret that as the adjacency matrix of a graph and calculate the connected compnents splitting DF into a list L each of whose elements is a data frame of the rows from DF that constitute that connected component.. Finally rework L into a character matrix.
library(igraph)
m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))
adj <- graph.adjacency(smat)
cl <- components(adj)$membership
str(split(1:n, cl))
## List of 6
## $ 1: int 1
## $ 2: int [1:5] 2 3 4 5 12
## $ 3: int 6
## $ 4: int [1:3] 7 9 11
## $ 5: int 8
## $ 6: int 10
spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))
giving:
[,1] [,2] [,3] [,4] [,5] [,6]
1 "application" "android" "ios" NA NA NA
2 "donald_trump" "united_states" "agreement" "climate" "paris" "emission"
3 "donald_trump" "entertainer" "host" "president" NA NA
4 "hen" "pan" "united_kingdom" "chicken" "mustard" "wimp"
5 "husband" "pamela" "private_lives" NA NA NA
6 "sex" "associate" "pamela" "partner" NA NA
Note: The input in reproducible form is:
Lines <- "
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald_trump agreement climate united_states
3 donald_trump agreement paris united_states
4 donald_trump agreement united_states NA
5 donald_trump climate emission united_states
6 donald_trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private_lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united_kingdom chicken hen wimp
12 united_states agreement paris NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Update: Fixed similarity definition.

R:Fuzzy Logic Name match

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets
Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern
Cus File
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
Master
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
i have tried with this simple loop
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*result *
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO
it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist
code:
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
Please if anybody has worked with such data in R please help !
achieved partial result by
code :
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])

Programmatically look up a ticker symbol in R

I have a field of data containing company names, such as
company <- c("Microsoft", "Apple", "Cloudera", "Ford")
> company
Company
1 Microsoft
2 Apple
3 Cloudera
4 Ford
and so on.
The package tm.plugin.webmining allows you to query data from Yahoo! Finance based on ticker symbols:
require(tm.plugin.webmining)
results <- WebCorpus(YahooFinanceSource("MSFT"))
I'm missing the in-between step. How can I query ticket symbols programmatically based on company names?
I couldn't manage to do this with the tm.plugin.webmining package, but I came up with a rough solution - pulling & parsing data from this web file: ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt. I say rough because for some reason my calls with httr::content(httr::GET(...)) don't work every time - I think it has to do with the type of web address (ftp://) but I don't do that much web scraping so I can't really explain this. It seemed to work better on my Linux than my Mac, but that could be irrelevant. Regardless, here's what I got: Thanks to #thelatemail's comment, this seems to be working much smoother:
library(quantmod) ## optional
symbolData <- read.csv(
"ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt",
sep="|")
##
> head(symbolData,10)
Symbol Security.Name Market.Category Test.Issue Financial.Status Round.Lot.Size
1 AAIT iShares MSCI All Country Asia Information Technology Index Fund G N N 100
2 AAL American Airlines Group, Inc. - Common Stock Q N N 100
3 AAME Atlantic American Corporation - Common Stock G N N 100
4 AAOI Applied Optoelectronics, Inc. - Common Stock G N N 100
5 AAON AAON, Inc. - Common Stock Q N N 100
6 AAPL Apple Inc. - Common Stock Q N N 100
7 AAVL Avalanche Biotechnologies, Inc. - Common Stock G N N 100
8 AAWW Atlas Air Worldwide Holdings - Common Stock Q N N 100
9 AAXJ iShares MSCI All Country Asia ex Japan Index Fund G N N 100
10 ABAC Aoxin Tianli Group, Inc. - Common Shares S N N 100
Edit:
As per #GSee's suggestion, a (presumably) more robust way to obtain the source data is with the stockSymbols() function in the package TTR:
> symbolData2 <- stockSymbols(exchange="NASDAQ")
Fetching NASDAQ symbols...
> ##
> head(symbolData2)
Symbol Name LastSale MarketCap IPOyear Sector
1 AAIT iShares MSCI All Country Asia Information Technology Index Fun 34.556 6911200 NA <NA>
2 AAL American Airlines Group, Inc. 40.500 29164164453 NA Transportation
3 AAME Atlantic American Corporation 4.020 83238028 NA Finance
4 AAOI Applied Optoelectronics, Inc. 20.510 303653114 2013 Technology
5 AAON AAON, Inc. 18.420 1013324613 NA Capital Goods
6 AAPL Apple Inc. 103.300 618546661100 1980 Technology
Industry Exchange
1 <NA> NASDAQ
2 Air Freight/Delivery Services NASDAQ
3 Life Insurance NASDAQ
4 Semiconductors NASDAQ
5 Industrial Machinery/Components NASDAQ
6 Computer Manufacturing NASDAQ
I don't know if you just wanted to get ticker symbols from names, but if you are also looking for actual share price information you could do something like this:
namedStock <- function(name="Microsoft",
start=Sys.Date()-365,
end=Sys.Date()-1){
ticker <- symbolData[agrep(name,symbolData[,2]),1]
getSymbols(
Symbols=ticker,
src="yahoo",
env=.GlobalEnv,
from=start,to=end)
}
##
## an xts object named MSFT will be added to
## the global environment, no need to assign
## to an object
namedStock()
##
> str(MSFT)
An ‘xts’ object on 2013-09-03/2014-08-29 containing:
Data: num [1:251, 1:6] 31.8 31.4 31.1 31.3 31.2 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:6] "MSFT.Open" "MSFT.High" "MSFT.Low" "MSFT.Close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 2
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2014-09-02 21:51:22.792"
> chartSeries(MSFT)
So like I said, this isn't the cleanest solution but hopefully it helps you out. Also note that my data source was pulling companies traded on NASDAQ (which is most major companies), but you could easily combine this with other sources.

create random subsets in R without duplicates

my task is to divide a dataset of 32 rows into 8 groups without having duplicated entries.
i am trying to do this with a loop and by creating a new dataset after each cycle.
the data:
year pos country elo fifa cont hcountry hcont
1 2010 FRA 1851 1044 Europe RSA Africa
2 2010 MEX 1872 895 South America RSA Africa
3 2010 URU 1819 899 South America RSA Africa
4 2010 RSA 1569 392 Africa RSA Africa
5 2010 GRE 1726 964 Europe RSA Africa
6 2010 KOR 1766 632 Asia RSA Africa
8 2010 ARG 1899 1076 South America RSA Africa
9 2010 USA 1749 957 North America RSA Africa
10 2010 SVN 1648 860 Europe RSA Africa
11 2010 ALG 1531 821 Africa RSA Africa
...
my solution so far:
for (i in 1:8){
assign(paste("group", i, sep = ""), droplevels(subset(wc2010[sample(nrow(wc2010), 4),])))
wc2010 <- subset(wc2010, !(country %in% group[i]$country))
}
problem is of course: i don't know how to use the loop-variable.... :-(
help would be deeply appreciated!
thanks
Bob
Here is one way to create a random partition:
random.groups <- function(n.items = 32L, n.groups = 8L)
1L + (sample.int(n.items) %% n.groups)
So then you just have to do:
wc2010$group <- random.groups(nrow(wc2010), n.groups = 8L)
Then you might also be interested in doing
groups <- split(wc2010, wc2010$group)
Edit: this was not asked by the OP, but I realize that soccer draws for big tournaments usually involves hats: before the draw, teams are grouped by regions and/or rankings. Then groups are formed by randomly picking one team from each hat, so that two teams from a same hat cannot end up in the same group.
Here is a modification to my function so it can also take hats as an input:
random.groups <- function(n.items = 32L, n.groups = 8L,
hats = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), hats)
shuffled <- lapply(splitted.items, sample)
1L + (order(unlist(shuffled)) %% n.groups)
}
Here is an example, where say, the first 8 teams are in hat #1, the next 8 teams are in hat #2, etc.:
# set.seed(123)
random.groups(32, 8, c(rep(1, 8), rep(2, 8), rep(3, 8), rep(4, 8)))
# [1] 7 8 2 6 5 3 1 4 8 7 5 3 2 4 1 6 3 2 7 6 5 8 1 4 7 6 5 4 3 2 1 8

Resources