I work with the dataframe df
Name = c("Albert", "Caeser", "Albert", "Frank")
Earnings = c(1000,2000,1000,5000)
df = data.frame(Name, Earnings)
Name Earnings
Albert 1000
Caesar 2000
Albert 1000
Frank 5000
If I use the tapply function
result <- tapply(df$Earnings, df$Name, sum)
I get this table result
Albert 2000
Caeser 2000
Frank 5000
Are there any circumstances, under which the table "result" would not be ordered alphabetically, if I use the tapply function as described above?
When I tried to find an answer, I changed the order of the rows:
Name Earnings
Frank 5000
Caeser 2000
Albert 1000
Albert 1000
but still get the same result.
I use multiple functions where I calculate with the output of tapply calculations and I have to be absolutely sure, that the output is always delivered in the same order.
Normally the output is ordered, but you can come up with examples where it is not. For example if you have factors with unordered levels.
df <- data.frame(Name = factor(c('Ben', 'Al'), levels = c('Ben', 'Al')),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case you can either use as.character or (probably saver) order the result afterwards.
tapply(df$Earnings, as.character(df$Name), sum)
## Al Ben
## 4 1
result <- tapply(df$Earnings, df$Name, sum)
result[order(names(result))]
## Al Ben
## 4 1
Another possible problem can be leading spaces:
df <- data.frame(Name = c(' Ben', 'Al'),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case, just remove all leading spaces to get results ordered.
You can order sapply output as you order any array in R. Using the [sort] command.1
> result
Albert Caeser Frank
2000 2000 5000
> sort(result,decreasing=TRUE)
Frank Albert Caeser
5000 2000 2000
Depending on what you want to order by, you can either sort the values as shown above (by leaving decreasing NULL, i.e. sort(result) you will get values in increasing order), or by sorting the names:
This will deliver the results by name in reverse alphabetical order
result[sort(names(result),decreasing=TRUE)]
Frank Caeser Albert
5000 2000 2000
What else would you like to sort and order by?
Related
I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.
I have two data frames that contains strings that are slightly different let's say:
Name Size
Company.1 Inc. 234
Company.2 LLC 164
Company.3 INC 231
On the other hand, my second data frame is:
Name State
Company.1 INC MA
Company.2 NY
Company.3 inc. CA
Do you know a tool that could for example match the first 6 characters and merge into a new table the result (or at least shows me the option if there is a multiple match)?
I tried grep or sapply but it is not working because I need to compare all name values of the first data frame to all the name value of the second one.
Thanks for your help!
It seems like all you need here is to use match in order to match the first 9 letters in both files, something like (I'm assuming here df1 is your first data set and df2 is the second respectively)
indx <- match(substr(df1$Name, 1, 9), substr(df2$Name, 1, 9))
df1["State"] <- df2$State[indx]
df1
# Name Size State
# 1 Company.1 Inc. 234 MA
# 2 Company.2 LLC 164 NY
# 3 Company.3 INC 231 CA
Or using some fast join using the data.table package
library(data.table)
setkey(setDT(df1)[, Name := substr(Name, 1, 9)], Name)
setDT(df2)[, Name := substr(Name, 1, 9)]
df1[df2]
# Name Size State
# 1: Company.1 234 MA
# 2: Company.2 164 NY
# 3: Company.3 231 CA
I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.
So say I want to take this first dataframe:
Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom
etc....
and merge it to this one:
DonorName CandidateName DonationAmount CandidateParty
1 blow joe Bush, George W 3,000 Republican
2 guy some Obama, Barack 5,000 Democrat
3 smith john Reid, Harry 4,000 Democrat
such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?
Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?
Reproducible versions of the example data:
first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
stringsAsFactors=FALSE)
second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)
solution:
first$DonorName <- gsub(", "," ",tolower(first$Name),fixed=TRUE)
require(dplyr)
result <- inner_join(first,second,by="DonorName")
will give you what you need if the data is as you've provided it.
result
Name DonorName CandidateName DonationAmount CandidateParty
1 Blow, Joe blow joe Bush, George W 3,000 Republican
2 Smith, John smith john Reid, Harry 4,000 Democrat
"fast way to go about this"
The dplyr method as above:
f_dplyr <- function(left,right){
left$DonorName <- gsub(", "," ",tolower(left$Name),fixed=TRUE)
inner_join(left,right,by="DonorName")
}
data.table method, setting key on first.
f_dt <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
left[right,nomatch=0L]
}
data.table method, setting both keys.
f_dt2 <- function(left,right){
left[,DonorName := gsub(", "," ",tolower(Name),fixed=TRUE)]
setkey(left,DonorName)
setkey(right,DonorName)
left[right,nomatch=0L]
}
base method relying on sapply:
f_base <- function(){
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
}
let's make second df a bit more realistic at 1M obs for a fairish comparision:
second <- cbind(second[rep(1:3,1000000),],data.frame(varn= 1:1000000))
left <- as.data.table(first)
right <- as.data.table(second)
library(microbenchmark)
microbenchmark(
f_base(),
f_dplyr(first,second),
f_dt(left,right),
f_dt2(left,right),
times=20)
And we get:
Unit: milliseconds
expr min lq median uq max neval
f_base() 2880.6152 3031.0345 3097.3776 3185.7903 3904.4649 20
f_dplyr(first, second) 292.8271 362.7379 454.6864 533.9147 774.1897 20
f_dt(left, right) 489.6288 531.4152 605.4148 788.9724 1340.0016 20
f_dt2(left, right) 472.3126 515.4398 552.8019 659.7249 901.8133 20
On my machine, with this ?contrived example we gain about 2.5 seconds over base methods. sapply simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups in first and second.
Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.
Without dplyr:
second[second$DonorName %in%
sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
Result:
# DonorName CandidateName DonationAmount CandidateParty
# 1 blow joe Bush, George W 3,000 Republican
# 3 smith john Reid, Harry 4,000 Democrat
I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.