Loading tables in R with space seperation - r

How to load a space seperated table with spaces inside the fields?
Simple case data:
Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular
Notice how the last element has space seperation in naming the school("Brentwood Elementary" as opposed to "Elm")
The following query fails with: "line x did not have y elements"
dat = read.table("dat.txt",header=TRUE)
Edit:
The data points are all factors and contains a set values
Edit: full data available through http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html
Thanks to #AmandaMahto

Actually, if you can use the data source Ananda found, it's pretty easy since the <pre> area is tab delimited:
library(rvest)
pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text()
dat <- read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)
dat[245:249,]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 245 girl 4 9 White Rural Sand Grades 1 3 2 4
## 246 girl 4 9 White Rural Sand Sports 3 2 1 4
## 247 girl 4 9 White Rural Sand Sports 3 2 1 4
## 248 girl 4 9 White Rural Sand Grades 2 1 3 4
## 249 girl 6 12 White Rural Brown Middle Popular 4 2 1 3
To actually answer your question (this is a bit like Ananda's answer) you'll need to know where the problem column is and work around it. This one uses gsubfn and pre-defined values for that column to make whole then split afterwards:
library(gsubfn)
# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc
lines <- readLines("awful.txt")
schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)
dat <- read.table(text=paste(lines, sep="", collapse="\n"),
header=TRUE, stringsAsFactors=FALSE)
dat$School <- gsub("_", " ", dat$School)
dat[c(1,34,94,198,255,324,377,433),]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 34 boy 4 10 White Suburban Brentwood Elementary Grades 2 1 3 4
## 94 girl 6 11 White Suburban Brentwood Middle Grades 3 4 1 2
## 198 boy 5 10 White Rural Ridge Sports 4 2 1 3
## 255 girl 6 12 Other Rural Brown Middle Grades 3 2 1 4
## 324 boy 4 9 Other Urban Main Grades 4 1 3 2
## 377 boy 4 9 White Urban Portage Popular 4 1 2 3
## 433 girl 6 11 White Urban Westdale Middle Popular 4 2 1 3

Unfortunately, the answer to this question is pretty much "It depends on how much you know about the data set."
For instance, in the description for the dataset, it specifies the possible values for each variable. Here, we know that there are only a few schools with multi-word names, and that these follow a predictable pattern of "Elementary" and "Middle".
As such, you could read the data in using readLines and figure out the least obtrusive way to insert a delimiter before re-reading the data with read.table.
Here's an example:
Sample data:
cat("Grade Area School Goals Value",
"4 Rural Elm Popular 1",
"4 Rural Elm Sports 2",
"4 Rural Elm Grades 1",
"4 Rural Elm Popular 3",
"3 Rural Brentwood Elementary Sports 4",
"3 Rural Brentwood Middle Grades 3",
"3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")
Read it in as a character vector:
x <- readLines("test.txt")
Use gsub to force the multi-word school names to become a single word (separated by an underscore). Then, use read.table to get your data.frame.
read.table(text = gsub(" (Elementary|Middle)", "_\\1", x), header = TRUE)
# Grade Area School Goals Value
# 1 4 Rural Elm Popular 1
# 2 4 Rural Elm Sports 2
# 3 4 Rural Elm Grades 1
# 4 4 Rural Elm Popular 3
# 5 3 Rural Brentwood_Elementary Sports 4
# 6 3 Rural Brentwood_Middle Grades 3
# 7 3 Suburban Ridge Popular 3

Related

picking a value from one column filters another column

I am in the midst of dealing with a big data project where I want to filter one column from highlighting with another.
For example, I want to showcase house 1, and as a result of that, I want to compare House 1 with other values from other URBAN houses, not all houses.
table <- data.frame(
house=paste("House", 1:15),
category = c("Urban", "Rural", "Suburban")
)
table
# house category
# 1 House 1 Urban
# 2 House 2 Rural
# 3 House 3 Suburban
# 4 House 4 Urban
# 5 House 5 Rural
# 6 House 6 Suburban
# 7 House 7 Urban
# 8 House 8 Rural
# 9 House 9 Suburban
# 10 House 10 Urban
# 11 House 11 Rural
# 12 House 12 Suburban
# 13 House 13 Urban
# 14 House 14 Rural
# 15 House 15 Suburban
I tried to give this a go, but it is not working for me...
table %>%
filter(house == house1) %>%
filter(category == table$house)
I want the output to look like this...
# house category
# 1 House 1 Urban
# 2 House 4 Urban
# 3 House 7 Urban
# 4 House 10 Urban
# 5 House 13 Urban
Any suggestions are truly appreciated.
Maybe you can try the base R code below
subset(df,category == category[house=="House1"])
or dplyr option
df %>%
filter(category == category[house == "House1"])
which gives
house category
1 House1 Urban
3 House3 Urban
5 House5 Urban
12 House12 Urban
13 House13 Urban
14 House14 Urban
dummy data
df <- structure(list(house = c("House1", "House2", "House3", "House4",
"House5", "House6", "House7", "House8", "House9", "House10",
"House11", "House12", "House13", "House14", "House15"), category = c("Urban",
"Suburban", "Urban", "Rural", "Urban", "Suburban", "Suburban",
"Rural", "Rural", "Suburban", "Suburban", "Urban", "Urban", "Urban",
"Rural")), class = "data.frame", row.names = c(NA, -15L))
Using dplyr you might do something like this
table %>%
filter(category == filter(table, house=="House 1") %>% pull(category))
Basically just a sub-query to find the category of House 1.
Try
table %>%
filter(category == "Urban")
Remember you need to use quotes " " in your filter statements.
You can also use match which will return index of first match and you can get corresponding category from it.
subset(table, category == category[match('House 1', house)])
house category
1 House 1 Urban
4 House 4 Urban
7 House 7 Urban
10 House 10 Urban
13 House 13 Urban
Same code with filter if you want to use dplyr :
dplyr::filter(table, category == category[match('House 1', house)])

Create Dataframe w/All Combinations of 2 Categorical Columns then Sum 3rd Column by Each Combination

I have an large messy dataset but want to accomplish a straightforward thing. Essentially I want to fill a tibble based on every combination of two columns and sum a third column.
As a hypothetical example, say each observation has the company_name (Wendys, BK, McDonalds), the food_option (burgers, fries, frosty), and the total_spending (in $). I would like to make a 9x3 tibble with the company, food, and total as a sum of every observation. Here's my code so far:
df_table <- df %>%
group_by(company_name, food_option) %>%
summarize(total= sum(total_spending))
company_name food_option total
<chr> <chr> <dbl>
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
The problem is that McDonalds has zero observations with "Frosty" as the food_option. Consequently, I get a partial table. I'd like to fill that with a row that shows:
8 McDonalds Frosty 0
9 BK Frosty 0
I know I can add the rows manually, but the actual dataset has over a hundred combinations so it will be tedious and complicated. Also, I'm constantly modifying the upstream data and I want the code to automatically fill correctly.
Thank you SO MUCH to anyone who can help. This forum has really been a godsend, really appreciate all of you.
Try:
library(dplyr)
df %>%
mutate(food_option = factor(food_option, levels = unique(food_option))) %>%
group_by(company_name, food_option, .drop = FALSE) %>%
summarise(total = sum(total_spending))
Newer versions of dplyr have a .drop argument to group_by where if you've got a factor with pre-defined levels they will not be dropped (and you'll get the zeros).
You can use tidyr::expand_grid():
tidyr::expand_grid(company_name = c("Wendys", "McDonalds", "BK"),
food_option = c("Burgers", "Fries", "Frosty"))
to create all possible variations
library(tidyverse)
# example data
df = read.table(text = "
company_name food_option total
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
", header=T)
df %>% complete(company_name, food_option, fill=list(total = 0))
# # A tibble: 9 x 3
# company_name food_option total
# <fct> <fct> <dbl>
# 1 BK Burgers 669
# 2 BK Fries 38
# 3 BK Frosty 0
# 4 McDonalds Burgers 1044
# 5 McDonalds Fries 148
# 6 McDonalds Frosty 0
# 7 Wendys Burgers 757
# 8 Wendys Fries 140
# 9 Wendys Frosty 98

Weighting a String Distance Metric based on regular expressions

Is it possible to weight a string distance metric such as the Damerau-Levenshtein distance where the weight changes based on the character type?
I am looking to create a fuzzy match of addresses and need to weight numbers and letters differently so that an address like:
"5 James Street" and "5 Jmaes Street" are considered identical and
"5 James Street" and "6 James Street" are considered different.
I considered splitting the addresses into numbers and letters prior to applying the string distance however this will miss flats at "5a" and "5b". The ordering is also not consistent amongst the data set so one entry may be "James Street 5".
I am using R with the stringdist package currently but not restricted to these.
Thanks!
Here's an idea. It involves a bit of manual processing but it might be a good starting point. First, we compute the approximate string distance between the addresses using adist() (or stringdist() with the best suited method to your data) without paying attention to street numbers.
m <- adist(v)
rownames(m) <- v
> m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#5 James Street 0 2 3 1 4 17 17
#5 Jmaes Street 2 0 4 3 6 17 17
#5#Jam#es Str$eet 3 4 0 4 6 17 17
#6 James Street 1 3 4 0 4 17 17
#James Street 5 4 6 6 4 0 16 17
#10a Cold Winter Road 17 17 17 17 16 0 1
#10b Cold Winter Road 17 17 17 17 17 1 0
In this case, we can clearly identify the two clusters, but we could also use hclust() to visualize a dendrogram.
cl <- hclust(as.dist(m))
plot(cl)
rect.hclust(cl, 2)
Then, we tag each street to it's corresponding cluster of similarities, iterate through them and check for matching street numbers.
library(dplyr)
res <- data.frame(cluster = cutree(cl, 2)) %>%
tibble::rownames_to_column("address") %>%
mutate(
# Extract all components of the address
lst = stringi::stri_extract_all_words(address),
# Identify the component containing the street number and return it
num = purrr::map_chr(lst, .f = ~ grep("\\d+", .x, value = TRUE))) %>%
# For each cluster, tag matching street numbers
mutate(group = group_indices_(., .dots = c("cluster", "num")))
Which gives:
# address cluster lst num group
#1 5 James Street 1 5, James, Street 5 1
#2 5 Jmaes Street 1 5, Jmaes, Street 5 1
#3 5#Jam#es Str$eet 1 5, Jam, es, Str, eet 5 1
#4 6 James Street 1 6, James, Street 6 2
#5 James Street 5 1 James, Street, 5 5 1
#6 10a Cold Winter Road 2 10a, Cold, Winter, Road 10a 3
#7 10b Cold Winter Road 2 10b, Cold, Winter, Road 10b 4
You could then pull() the unique addresses based on group using distinct():
> distinct(res, group, .keep_all = TRUE) %>% pull(address)
#[1] "5 James Street" "6 James Street" "10a Cold Winter Road"
# "10b Cold Winter Road"
Data
v <- c("5 James Street", "5 Jmaes Street", "5#Jam#es Str$eet", "6 James Street",
"James Street 5", "10a Cold Winter Road", "10b Cold Winter Road")

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Make repeating character vector values

Hey there everyone just getting started with R, so I decided to make some data up with the eventual goal of superimposing it on top of a map.
Before I can get there I'm trying to add a name to my data to sort by Province.
Drugs <- c("Azin", "Prolof")
Provinces <- c("Ontario", "British Columbia", "Quebec")
Gender <- c("Female", "Male")
raw <- c(10,16,8,20,7,12,13,11,9,7,14,7)
yomom <- matrix(raw, nrow = 6, ncol = 2)
colnames(yomom) <- Drugs
bro <- data.frame(Gender, yomom)
idunno <- data.frame(Provinces, bro)
The first problem I've encountered is that the provinces vector is repeating, I'm not sure how to make it look like this in R. I'm basically trying to get it to skip a row.
Something like this?
idunno <- data.frame(Provinces=rep(Provinces,each=2), bro)
idunno
# Provinces Gender Azin Prolof
# 1 Ontario Female 10 13
# 2 Ontario Male 16 11
# 3 British Columbia Female 8 9
# 4 British Columbia Male 20 7
# 5 Quebec Female 7 14
# 6 Quebec Male 12 7
Read the documentation on rep(...)

Resources