What I want to do is make a list, then make a list from part of the elements of that list. I can do it in 2 steps using subset and then dlply, but I'm wondering if there's a faster way with any of the XXply methods.
So I have a dataframe:
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas","asia"), 50, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),5),
revenue = sample(500:1000,50,replace=T),
orders = sample(0:2,50,replace=T)
)
Ultimately, what I'm looking for here is: For each region, a list of identity values organized by business.
The messy approach is to take a subset for each region then simply turn that into a list:
mideast <- subset(data, region == "mideast")
americas <- subset(data, region == "americas")
asia <- subset(data, region == "asia")
mideast.list <- dlply(mideast, .(biz), identity)
americas.list <- dlply(americas, .(biz), identity)
asia.list <- dlply(asia, .(biz), identity)
Easy enough but it gets unwieldy with bigger datasets.
If I use dlply on the original data, it gives me the values I'm looking for, but again, I want to have actual list objects for each region. So:
list2 <- dlply(data, .(region, biz), identity)
But then how do I access just the regions from list2 and create separate list objects out of them?
I'm not 100% clear I understand what you're trying to do, but maybe this is it?
lst <- lapply(
split(data, data$region),
function(df) lapply(split(df, df$biz), identity)
)
lst[["americas"]][["shipping"]]
# biz region date revenue orders
# 3 shipping americas 2010-02-03 621 2
# 23 shipping americas 2010-02-03 799 2
# 33 shipping americas 2010-02-03 920 0
# 34 shipping americas 2010-02-04 705 2
This matches the structure of americas.list, so I think this is what you're trying to do. Also, note that you can skip the inner lapply if identity is really the function you want to apply (split alone does what you need).
Related
I have two data frames: sold1 and sold.
They both hold data on sold house prices but from different sources. As it turns out, the data, even for the same property may not match.
I’d like to compare each element in Sold1 to Sold2 and exclude those in Sold1 that are already present (duplicated) in sold2.
The test for a ‘duplicate’ in this instance would be a combination of:
Sold1.price is within 10% of Sold2.price
Sold2.address is contained in Sold1.address
Sold2.dateOfSale is no earlier than Sold1.dateOfSale
Two questions:
Why does the 'price' field change class when accessed in a function?
Is there a more elegant way to do this without using apply twice (very slow on two large dataframes)?
sold1 <- data.frame(
price = c(100000,150000,200000,250000,300000,400000),
address = c("Widmore Road, Bromley", "River Quaggy Apartments", "Meadowcroft Way, Orwell, SG8","Freelands Road, Bromley","Nascot Street, London, W12","Priory Terrace, South Hampstead, NW6"),
dateOfSale = c(as.Date("2020-01-01"),as.Date("2020-02-03"),as.Date("2020-03-05"),as.Date("2020-04-06"),as.Date("2020-05-08"),as.Date("2020-06-12"))
)
sold2 <- data.frame(
price = c(100000,150000,210000,251000,300000),
address = c("Random Road, Bromley", "Random2 Road Apartments", "Meadowcroft Way","Freelands Road","Random street London, W12"),
dateOfSale = c(as.Date("2020-01-01"),as.Date("2020-04-03"),as.Date("2020-03-25"),as.Date("2020-04-26"),as.Date("2019-05-08"))
)
FLR_Check_Match <- function (s2, s1row) {
checkRes = TRUE
# Price within tolerance
checkRes = checkRes && as.numeric(s1row["price"]) >= as.numeric(s2["price"]) * .9 && as.numeric(s1row["price"]) <= as.numeric(s2["price"]) * 1.1
# address close match
checkRes = checkRes && grepl(s2["address"], s1row["address"], ignore.case = TRUE)
# date of sale falls 7 weeks after date of sale in the sold1 (s1) data
checkRes = checkRes && as.Date(s2["dateOfSale"]) >= as.Date(s1row["dateOfSale"])
return(checkRes)
}
FCheck_Sold_Dups <- function(s1, s2) {
print(class(s1["price"])) #character
#For each element in s2, check whether there is a match to s1
excV <- apply(s2, 1, FLR_Check_Match, s1)
result <- any(excV)
return(result)
}
sold1$exclude <- apply(sold1,1,FCheck_Sold_Dups,sold2)
1. Why does the 'price' field change class when accessed in a function?
apply coerces dataframe to matrix.
From ?apply :
If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix
This can also verified by the source code of apply :
apply
function (X, MARGIN, FUN, ...)
{
FUN <- match.fun(FUN)
dl <- length(dim(X))
if (!dl)
stop("dim(X) must have a positive length")
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X) # <- here
else as.array(X)
.....
.....
Since matrix can hold data of only one type everything is changed to character. Hence, you have to coerce the data into numeric again in the function.
2. Is there a more elegant way to do this without using apply twice (very slow on two large dataframes)
We cannot really avoid comparing every row of sold1 to sold2 (maybe fuzzy matching/joining ?) so if you have very large data most of the answers would be slow. Here is one way which is maybe more elegant, faster and shorter. I added prefix to column names of sold1 and sold2 dataframes to differentiate between them.
library(dplyr)
names(sold1) <- paste0('sold1_', names(sold1))
names(sold2) <- paste0('sold2_', names(sold2))
tidyr::crossing(sold1, sold2) %>%
group_by(sold1_address) %>%
#Check if address matches
summarise(address_check = any(str_detect(sold1_address, stringr::regex(sold2_address, ignore_case = TRUE))),
#Check if price is in range
price_check = any(data.table::between(sold1_price, sold2_price * .9, sold2_price * 1.1)),
#Check if date is in range
date_check = any(sold2_dateOfSale >= sold1_dateOfSale),
#if all the three conditions satisfy
exclude = address_check & price_check & date_check) %>%
select(sold1_address, exclude) %>%
left_join(sold1, by = 'sold1_address')
# sold1_address exclude sold1_price sold1_dateOfSale
# <chr> <lgl> <dbl> <date>
#1 Freelands Road, Bromley TRUE 250000 2020-04-06
#2 Meadowcroft Way, Orwell, SG8 TRUE 200000 2020-03-05
#3 Nascot Street, London, W12 FALSE 300000 2020-05-08
#4 Priory Terrace, South Hampstead, NW6 FALSE 400000 2020-06-12
#5 River Quaggy Apartments FALSE 150000 2020-02-03
#6 Widmore Road, Bromley FALSE 100000 2020-01-01
Note that I have created 3 separate columns address_check, price_check and date_check for clarity and keep each condition separate, they are not necessarily needed and we can combine those conditions into one if required.
I have two columns in two data frames, where the longer one includes all elements of the other column. Now I want to delete elements in the longer column that do not overlap with the other, together with the corresponding row. I identified the "difference" using:
diff <- setdiff(gdp$country, tfpg$country)
and I tried to use two FOR loops to get this done:
for (i in 1:28) { for(j in 1:123) {if(diff[i] == gdp$country[j]) {gdp <- gdp[-c(j),]}}}
where 28 is the number of rows I want to delete (length of diff) and 123 is the length of the longer column. This does not work, the error message:
Error in if (diff[i] == gdp$country[j]) { :
missing value where TRUE/FALSE needed
So how do I fix this? Or is there a better way to do this?
Thank you very much.
I have a data frame called "gdp" here:
country wto y1990 y1991 y1992
Austria 1995 251540 260197 265644
Belgium 1995 322113 328017 333038
Cyprus 1995 14436 14537 15898
Denmark 1995 177089 179392 182936
Finland 1995 149584 140737 136058
France 1995 1804032 1822778 1851937
There are 123 rows.
I would like to delete rows with country names specified in another vector:
diff ["Austria","China",...,"Yemen"]
there is a better way! What you're describing is the equivalent of a left join, or inner join. But in R the way to achieve it is using the merge command:
## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"),
incomparables = NULL, ...)
In your case:
merge(gdp, tfpg, by = intersect('country', 'country'))
E.g.
x = data.frame(foo = c(1,2,3,4,5), bar=c("A","B","C","D","E"))
y = data.frame(baz = c(6,7,8,9), bar=c("A","C","E","F"))
z = merge(x,y,by=intersect('bar','bar'))
gives
bar foo baz
1 A 1 6
2 C 3 7
3 E 5 8
I am using trade data (FAO) which I would like to turn into matrices (per Item and Year). Therefore I've done a split:
# import is the original df
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:11])
and the data looks like this (you can find test data at the end) :
[[1]]
RC PC Item Year Value
Argentina Chile Almonds 1996 1108
Algeria Spain Almonds 1996 1
....
[[2]]
....
[[3]]
....
[[n]]
I used the cast function (below) to create a matrix for almonds in 2012:
# import_almonds2012 is a test subset from import df (with import values for almonds in 2012)
RCPC <- cast(RC ~ PC, data =import_almonds2012, value = "Value")
Now my question: how can I do matrices of all Items/Years (~100 Items and 17 years!!) from the import_YI_lap df? My problem is that I don't know (1) how to operate the levels/ojects in this df ([[1]], [[2]]...). Or there a better way to split data or to save the splited df into objects? And (2) how to create all the needed matrices without coping thousend lines of code. Loops? If yes, how??
here a test-dataset:
import<- data.frame(RC=c("DE", "IT", "USA"),
PC = c("BRA", "ARG"),
Item = c("Almonds", "Apples"),
Year = c(1996,1997,1998),
Value = c(1,5,3,2,8,3))
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:5])
import_YI_lap
It's difficult to test without data, but you can try this:
do.call(rbind,import_YI_lap)
I have problems by merging two dataframes with different length.
To make it as easy as possible the datasets:
Dataset A - Persons
http://pastebin.com/HbaeqACi
Dataset B - Waterfeatures:
http://pastebin.com/UdDvNtHs
Dataset C - City:
http://pastebin.com/nATnkMRk
I have some R-Code , which is not relevant for my problem, but I will paste it completely, so you have exactly the same situation:
require(fossil)
library(fossil)
#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("city.csv", header =TRUE)
#### calculate distance
# Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
#Generate a function that looks for the closest waterfeature for each id coordinates and calculate/save the distance
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
disnw <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
disnw <- min(disnw)
disnw <- data.frame(disnw, WaterFeature=tmp)
return(disnw)
}
# apply distance calculation function to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))
Now I want to copy the calculated distance to the city dataset. I've tried to use merge (both datasets have the city attribute) and the persons only contains the cities from the city dataset.
city_all_parameters = city
city_all_parameters = merge(city_all_parameters, persons[,c("city", "disnw")], all=TRUE)
Unfortunately this is not the outcome, which I want to have. I have 164 rows, but I only want to have these 5 rows + the variable disnw and it's corresponding value.
I've tried it out with rbind as well, but there I get the error:
"Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match"
Any tip, how to solve this problem?
Your code works as you intended, but I wanted to show you a more elegant way to do it in base. I have commented the code:
library(fossil)
# If you want to use pastebin, you can make it easy to load in for us like this:
# But I recommend using dput(persons) and pasting the results in.
persons = read.csv("http://pastebin.com/raw.php?i=HbaeqACi", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("http://pastebin.com/raw.php?i=UdDvNtHs", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("http://pastebin.com/raw.php?i=nATnkMRk", header =TRUE)
# Use column names instead of column indices to clarify your code
UniqueCoordinates <- data.frame(unique(persons[,c('POINT_X','POINT_Y')]))
# I didn't understand why you wanted to format the Id,
# but you don't need the Id in this code
# UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
# Instead of calculating the pairwise distance between all
# the water points everytime, use deg.dist with mapply:
UniqueCoordinates$disnw <- mapply(function(x,y) min(deg.dist(long1=x,lat1=y,
long2=water$POINT_X,
lat2=water$POINT_Y)),
UniqueCoordinates$POINT_X,
UniqueCoordinates$POINT_Y)
persons <- merge(UniqueCoordinates,persons)
# I think this is what you wanted...
unique(persons[c('city','disnw')])
# city disnw
# 1 City E 6.4865635
# 20 City A 1.6604204
# 69 City B 0.9893909
# 113 City D 0.6001968
# 148 City C 0.5308953
# If you want to merge to the city
merge(persons,city,by='city')
I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate
library(plyr)
fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')
Inf_table <- ddply(hospG, "Year", summarise,
Infections = sum(SSI == 1),
Procedures = length(Procedure),
PropInf = round(Infections/Procedures * 100 ,2)
)
This gives me the number of infections, procedures, and proportion infected per year for this hospital.
What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is:
s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)
Expected_prop_inf <- Inf_table$Procedures * s1/s2
Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far.
Thanks for any help offered.
It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R.
# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141
rather than with ddply which is not so natural:
# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
Year ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141
Here is a solution to aggregate using data.table.
I'm not sure if it's posible to do it in one step.
require("data.table")
fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)
Inf_table <- DT[, {Infections = sum(SSI == 1)
Procedures = length(Procedure)
PropInf = round(Infections/Procedures * 100 ,2)
list(
Infections = Infections,
Procedures = Procedures,
PropInf = PropInf
)
}, by = Year]
Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]
tables()
The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.