Averaging rows based upon a known, irregular relationship using R - r

I have data on energy companies whose jurisdiction overlaps in places. I want to be able to compute an average of sales for the places where these companies overlap. These companies will always overlap - so how can I use this information to calculate the averages just for those pairs? There are about 20 pairs of companies.
data <- data.frame(Company = c("Energy USA","Good Energy",
"Hydropower 4 U",
"Coal Town",
"Energy USA/Good Energy",
"Good Energy/Coal Town"),
Sales = c(100, 2500, 550, 6000, "?", "?"))
Company Sales
1 Energy USA 100
2 Good Energy 2500
3 Hydropower 4 U 550
4 Coal Town 6000
5 Energy USA/Good Energy ? (Answer: 1300)
6 Good Energy/Coal Town ? (Answer: 4250)

We use 'grep' to get index of 'Company' elements that have more than one entries i.e. separated by '/'. Then, split those elements by the delimiter (output will be a list), loop through the list with sapply, match the elements with the 'Company' column to get the position, use that to get the corresponding 'Sales' elements. As the 'Sales' column was factor, we need to convert it to numeric to get the mean. When we convert factor to numeric class, all non-numeric elements i.e. ? will be converted to NA. Replace those NA elements with the mean values.
i1 <- grepl('/', data$Company)
v1 <- sapply(strsplit(as.character(data$Company[i1]), '/'),
function(x) mean(as.numeric(as.character(data$Sales[match(x,
data$Company)]))))
data$Sales <- as.numeric(as.character(data$Sales))
data$Sales[is.na(data$Sales)] <- v1
data
# Company Sales
#1 Energy USA 100
#2 Good Energy 2500
#3 Hydropower 4 U 550
#4 Coal Town 6000
#5 Energy USA/Good Energy 1300
#6 Good Energy/Coal Town 4250

Without knowing how your original data is, it is hard to give a working answer. However, assuming your data has Company and Sales columns with multiple rows for each company, you can do something like this:
mean(data$Sales[data$Company %in% c('Energy USA', 'Good Energy')]])
mean(data$Sales[data$Company %in% c('Good Energy', 'Coal Town')]])

you could create a new column "jurisdiction" in "data", if your dataset is rather small..
MeansByJurisdiction <- tapply(data$sales, data$jurisdiction, mean)
then you could convert the vector to dataframe
MeansByJurisdiction <- data.frame(MeansByJurisdiction)
the rownames in the MeansByJurisdiction dataframe will be populated with the jurisdictions and you can extract them with a simple line of code:
MeansByJurisdiction$jurisdictions <- row.names(MeansByJurisdiction)

Related

Subsetting to Output only certain CHAR names in a single column

I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.
Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19

approach to cut dataset to make a new factor variable

Currently, I am trying to cut the dataset into three parts: developed, developing and under-developed. The cut criteria is quantiles. That is,
developed would be those above 75% quantiles, developing would be between 50%-75% and under-developed would be below 50%. However, quantiles are different by years.
data = data.frame("country" = c("U.S.A","U.S.A","Jamaica","Jamaica","Congo","Congo"),
"year" = c(2000,2001,2000,2001,2000,2001),
"gdp_per_capita" = c(30000,40000,100,200,50,60))
quantiles = do.call("data.frame",
tapply(data$gdp_per_capita, data$year, quantile))
What I did was to calculate the quantiles by year and I got a data frame with just that information. Now, I am trying to use this information to apply above criteria for each year.
Example
2000 = (50% = 3000, 75% = 15999)
2001 = (50% = 5000, 75% = 18000)
cut points changes
Possible results
year country gdp_per_capita status
2000 U.S. 1800000 "developed"
2000 France 200000 "developed"
....more than 500+ obs.
2000 Kenya 300 "under-developed"
2000 Malaysia 1500 "developing"
2001 Malaysia 3000 "developing"
2001 Kenya 500 "under-developed"
2001 Spain 30000 "developed"
2000 India 300 "under-developed"
2001 India 5100 "developing"
What will be the most efficient way to resolving this issue?
I tried using ifelse and doing that one by one. This seems like it is too much work and I felt like there was no reason to use computer if I am going to iterate them one by one.
Instead of data.frame, consider rbind in do.call to create quantile percents as columns, then merge to original dataset by year. Finally, calculate status with a nested ifelse conditional logic.
### QUANTILES
quantiles_matrix <- do.call("rbind", tapply(data$gdp_per_capita, data$year, quantile))
quantiles_df <- transform(data.frame(quantiles_matrix),
year = row.names(quantiles_matrix))
### MERGE
mdf <- merge(data, quantiles_df, by="year")
### STATUS COLUMN ASSIGNMENT
final_df <- transform(mdf,
status = ifelse(gdp_per_capita > X75., "developed",
ifelse(gdp_per_capita >= X50. & gdp_per_capita <= X75., "developing",
ifelse(gdp_per_capita < X50., "under-developed", NA)
)
)
)
Rextester demo

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

Reshape specific rows into columns in R

My sample data frame would look like the following:
1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y
I want to make rows 1, 3, and 5 column names and have the data below each fall into each column, respectively. I was looking into the reshape function, but I only saw examples where an entire column of values needed to be individual columns. So I wasn't sure what to do in this scenario--apologies if it's obvious.
Here is the desired output:
1 Number Type Code Reason Date Amount Damage Act State City Zip Phone
2 0123 06 09 010 08/31/16 10,000 Y N WI GB 1234 Y
Thanks
As some people have commented, you could build a data frame out of the rows of your starting data frame, but I think it's a little easier to work on the lines of text.
If your starting file looks something like this
Number , Type , Code ,Reason
0123 , 06 , 09 , 010
Date , Amount , Damage , Act
08/31/16 , 10000 , Y , N
State , City , Zip , Phone
WI , GB , 1234, Y
we can read it in with each line as an element of a character vector:
lines <- readLines("start.csv")
make all the odd lines into a single line:
oddind <- seq(from=1, to= length(lines), by=2)
namelines <- paste(lines[oddind], collapse=",")
make all the even lines into a single line:
datlines <- paste(lines[oddind+1], collapse=",")
make those lines into a new CSV to read:
writeLines(text= c(namelines, datlines), con= "nice.csv")
print(read.csv("nice.csv"))
This gives
Number Type Code Reason Date Amount Damage Act State
1 123 6 9 10 08/31/16 10000 Y N WI
City Zip Phone
1 GB 1234 Y
So, it's all in one row of the data frame and all the variable names show up correctly in the data frame.
The benefits of this strategy are:
It will work for starting CSV files where the number of variables isn't a multiple of 4.
It will work for starting CSV files with any number of rows
There is no chance of weird dynamic casting happening with unlist() or as.character().
Creating a dataframe roughly appearing like that (although it necessarily has column names). Those are probably factor columns if you just used one of the standard read.* functions without using stringsAsFactors=FALSE, hence the need to convert with as.character.
dat=read.table(text="1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y")
Then you can set odd number rows as names of the values-vector of the even number rows with:
setNames( unlist( lapply( dat[!c(TRUE,FALSE), ] ,as.character)),
unlist( lapply( dat[c(TRUE,FALSE), ] ,as.character)) )
1 3 5 Number Date State Type
"2" "4" "6" "0123" "08/31/16" "WI" "06"
Amount City Code Damage Zip Reason Act
"10,000" "GB" "09" "Y" "1234" "010" "N"
Phone
"Y"
The !c(TRUE,FALSE) and its logical complement in the next extract operation get magically recycled along all the possible rows. Obviously there would be better ways of doing this if you posted a version of a text file rather than saying that the starting point was a dataframe. You would need to remove what were probably rownames. If you want a "clean solution then post either dput(.) from your dataframe or the raw text file.

Very simple bar graph

First I will have to apologize for my ignorance as I'm sure this is a very simple question but I am very new to R. My question is that I have a data frame that looks like this:
countrydes Insured
USA 4500
CANADA 4500
USA 7500
CANADA 7600
All I want to do is aggregate the sum of the insured value by country and produce a bar graph e.g.
countrydes Insured
USA 12000
Canada 12100
Many thanks.
This will do the trick:
# Define your data
dfr <- data.frame(
countrydes=rep(c("USA", "CANADA"), 2),
Insured=c(4500, 4500, 7500, 7600))
# Sum by country
totals <- with(dfr, by(Insured, countrydes, sum))
# Plot the answer
barplot(totals)
(As Etiennebr mentioned, you could use aggregate instead of by, in which case you need to coerce countrydes to be a list.)
You could simply sum each one separatly. Let's call your data frame df :
USA <- sum(df[df$countrydes=="USA",]$Insured)
CANADA <- sum(df[df$countrydes=="CANADA",]$Insured)
But with aggregate() you can handle all the countries in one line.
aggsumcount <- aggregate(x=df$Insured,by=list(df$countrydes),FUN=sum)

Resources