Subtracting subset from larger dataset in R - r

Hi all: I have two variables. The first is entitled WITHOUT_VERANDAS. It is a list of cities, aggregated by average rental prices of homes WITHOUT verandas (there are about 200 rows):
City Price
1 Appleton 5000
2 Ames 9000
3 Lodi 1020
4 Milwaukee 2010
5 Barstow 2000
6 Chicago 2320
7 Champaign 2000
The second variable is entitled WITH_VERANDAS. It's a list of cities, aggregated by average rental prices of homes WITH verandas (there are about 10 rows, this is a subset of the previous dataset, since not every city has rental properties with verandas):
City Price
1 Milwaukee 3000
2 Chicago 2050
3 Lodi 5000
For each city on the WITH_VERANDAS list, I want to subtract that city's WITHOUT_VERANDAS city value from the first list. I want to see which cities have the highest or lowest differential. Essentially, the result should only include the WITH_VERANDAS data.
I've tried this:
difference <- WITH_VERANDAS$Price-WITHOUT_VERANDAS$Price
View(difference)
However, this returns as many rows as the WITHOUT_VERANDAS dataset. I also get an error:
longer object length is not a multiple of shorter object length
And the result is simply subtracting WITHOUT_VERANDAS's row 1 from WITH_VERANDA's row 1, as seen in the results: (for example, row 1 of the output would be the value of Milwaukee-Appleton, row 2 output would be Chicago - Ames, and so forth)
1. -2000
2. -6950
If I could only filter WITHOUT_VERANDAS to include only the cities included in WITH_VERANDAS, I think it would work. Thanks!

R2evans, thank you ! this worked great. Now, I have:
City Price.x Price.y
1 Appleton NA 5000
2 Ames NA 9000
3 Lodi 5000 1020
4 Milwaukee 3000 2010
How would I go about filtering this list to take out any row where Price.x is "NA"? i.e all rows that did not match. Thanks again!

Related

Subsetting rows of a dataframe when respondent number is duplicated in column

I have a huge dataset which is partly pooled cross section and partly panel data:
Year Country Respnr Power Nr
1 2000 France 1 1213 1
2 2001 France 2 1234 2
3 2000 UK 3 1726 3
4 2001 UK 3 6433 4
I would like to filter the panel data from the combined data and tried the following:
> anyDuplicated(df$Respnr)
[1] 45047 # Out of 340.000
dfpanel<- subset(df, duplicated(df$Respnr) == TRUE)
The new df is however reduced to zero observations. The following led to the expected amount of observations:
dfpanel<- subset(df, Nr < 3)
Any idea what could be the issue?
Although I have not figured out why the previous did not work, the following does provide a working solution. I have simply split the previous approach. The solution adds a column panel, which in my case is actually a welcome addition
df$panel <- duplicated(df$Respnr)
dfpanel <- subset(df, df$panel == TRUE)

How can I create a term matrix that sums numeric values associated to each document?

I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?
Assuming you can get to the stage where you have the Frequency table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency table

Create list of elements which match a value

I have a table of values with the name, zipcode and opening date of recreational pot shops in WA state.
name zip opening
1 The Stash Box 98002 2014-11-21
3 Greenside 98198 2015-01-01
4 Bud Nation 98106 2015-06-29
5 West Seattle Cannabis Co. 98168 2015-02-28
6 Nimbin Farm 98168 2015-04-25
...
I'm analyzing this data to see if there are any correlations between drug usage and location and opening of recreational stores. For one of the visualizations I'm doing, I am organizing the data by number of shops per zipcode using the group_by() and summarize() functions in dplyr.
zip count
(int) (int)
1 98002 1
2 98106 1
3 98168 2
4 98198 1
...
This data is then plotted onto a leaflet map. Showing the relative number of shops in a zipcode using the radius of the circles to represent shops.
I would like to reorganize the name variable into a third column so that this can popup in my visualization when scrolling over each circle. Ideally, the data would look something like this:
zip count name
(int) (int) (character)
1 98002 1 The Stash Box
2 98106 1 Bud Nation
3 98168 2 Nimbin Farm, West Seattle Cannabis Co.
4 98198 1 Greenside
...
Where all shops in the same zipcode appear together in the third column together. I've tried various for loops and if statements but I'm sure there is a better way to do this and my R skills are just not up there yet. Any help would be appreciated.

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

How do I reorder a factor

I want to reorder a factor based on one of its rows. For example I want to reorder the "country" factor based on the value corresponding to the 2014 entries below. UK would be ranked first and USA second.
dat <- data.frame(
country=c("USA","USA","UK","UK"),
year=c(2014,2013,2014,2013),
value=c(2,NA,1,NA)
)
country year value
1 USA 2014 2
2 USA 2013 NA
3 UK 2014 1
4 UK 2013 NA
I don't quite understand how factors are reordered. It seems the reorder command replaces the an entire column in a data.frame but it I would think that I should only need to specify a new order for the factor labels. "level" seems to do the opposite, giving labels to the ordering.
Maybe this:
factor(dat$country, levels=with(dat[dat$year==2014,], country[order(value)] ))
#[1] USA USA UK UK
#Levels: UK USA
factor(country<-c("USA","USA","UK","UK"),level <- c("UK","USA"))
sort(country)

Resources