How can I create a term matrix that sums numeric values associated to each document? - r

I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?

Assuming you can get to the stage where you have the Frequency table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency table

Related

Subtracting subset from larger dataset in R

Hi all: I have two variables. The first is entitled WITHOUT_VERANDAS. It is a list of cities, aggregated by average rental prices of homes WITHOUT verandas (there are about 200 rows):
City Price
1 Appleton 5000
2 Ames 9000
3 Lodi 1020
4 Milwaukee 2010
5 Barstow 2000
6 Chicago 2320
7 Champaign 2000
The second variable is entitled WITH_VERANDAS. It's a list of cities, aggregated by average rental prices of homes WITH verandas (there are about 10 rows, this is a subset of the previous dataset, since not every city has rental properties with verandas):
City Price
1 Milwaukee 3000
2 Chicago 2050
3 Lodi 5000
For each city on the WITH_VERANDAS list, I want to subtract that city's WITHOUT_VERANDAS city value from the first list. I want to see which cities have the highest or lowest differential. Essentially, the result should only include the WITH_VERANDAS data.
I've tried this:
difference <- WITH_VERANDAS$Price-WITHOUT_VERANDAS$Price
View(difference)
However, this returns as many rows as the WITHOUT_VERANDAS dataset. I also get an error:
longer object length is not a multiple of shorter object length
And the result is simply subtracting WITHOUT_VERANDAS's row 1 from WITH_VERANDA's row 1, as seen in the results: (for example, row 1 of the output would be the value of Milwaukee-Appleton, row 2 output would be Chicago - Ames, and so forth)
1. -2000
2. -6950
If I could only filter WITHOUT_VERANDAS to include only the cities included in WITH_VERANDAS, I think it would work. Thanks!
R2evans, thank you ! this worked great. Now, I have:
City Price.x Price.y
1 Appleton NA 5000
2 Ames NA 9000
3 Lodi 5000 1020
4 Milwaukee 3000 2010
How would I go about filtering this list to take out any row where Price.x is "NA"? i.e all rows that did not match. Thanks again!

Construct a vector of names from data frame using R

I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"

Leaflet R color map based on multiple variables?

From what I have seen colored maps in leaflet usually only depict one variable(GDP, Crime stats, Temperature etc) like this one:
.
Is there a way to make maps that display the highest variable in a data frame in leaflet R? For example showing which alcoholic beverage is the most popular in a country, like this map?
(source: dailymail.co.uk)
Say that I had a data frame that looked like this and I wanted to do a similar map to the alcoholic beverage one...
Country Beer Wine Spirits Coffee Tea
Sweden 7 7 5 10 6
USA 9 6 6 7 5
Russia 5 3 9 5 8
Is there a way in leaflet R to pick out the alcoholic beverages, assign them a color and then display them on the map to show which type of alcoholic beverage is the most popular in the three different countries?
Step 0, make a test data frame:
> set.seed(1234)
> drinks = data.frame(Country=c("Sweden","USA","Russia"),
Beer=sample(10,3), Wine=sample(10,3), Spirits=sample(10,3),
Coffee=sample(10,3), Tea=sample(10,3))
Note I have country as a column - yours might have countries in the row names which means the following code needs changing. Anyway. We get:
> drinks
Country Beer Wine Spirits Coffee Tea
1 Sweden 2 7 1 6 3
2 USA 6 8 3 7 9
3 Russia 5 6 6 5 10
Now we combine apply to work along rows, which.max to get the highest element, and various subset operations to drop the country column and get the drink name from the column names:
> drinks$Favourite = names(drinks)[-1][apply(drinks[,-1],1,which.max)]
> drinks
Country Beer Wine Spirits Coffee Tea Favourite
1 Sweden 2 7 1 6 3 Wine
2 USA 6 8 3 7 9 Tea
3 Russia 5 6 6 5 10 Tea
If there's a tie then which.max will pick (I think) the first element. If you want something else then you'll have to rewrite.
Now feed your new data frame to leaflet and map the Favourite column.

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Resources