Related
Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.
I have a small question regarding binary operations in a dataframe. Here I have a dataframe and I want to create a new column PerWeek which is the result when taking Gross divided by Weeks, and I am wondering how can I do it since Gross elements are not numeric.
boxoffice = function(){
url = "https://www.imdb.com/chart/boxoffice"
read_table = read_html("https://www.imdb.com/chart/boxoffice")
movie_table = html_table(html_nodes(read_table, "table")[[1]])
Name = movie_table[2]
Gross = movie_table[4]
Weeks = movie_table[5]
BoxOffice =
for (i in 1:10){
PerWeek = movie_table[4][i] %/% movie_table[5][i]
}
df = data.frame(Name,BoxOffice,PerWeek)
return(df)
}
If you have Gross value always in millions, you can get the numbers from it and multiply by 1e6 to get amount in millions and then divide by Weeks.
library(rvest)
library(dplyr)
url = "https://www.imdb.com/chart/boxoffice"
read_table = read_html("https://www.imdb.com/chart/boxoffice")
movie_table = html_table(html_nodes(read_table, "table")[[1]])
movie_table <- movie_table[-c(1, ncol(movie_table))]
movie_table %>% mutate(per_week_calc = readr::parse_number(Gross) * 1e6/Weeks)
# Title Weekend Gross Weeks per_week_calc
#1 Onward $10.5M $60.3M 2 30150000
#2 I Still Believe $9.5M $9.5M 1 9500000
#3 Bloodshot $9.3M $10.5M 1 10500000
#4 The Invisible Man $6.0M $64.4M 3 21466667
#5 The Hunt $5.3M $5.8M 1 5800000
#6 Sonic the Hedgehog $2.6M $145.8M 5 29160000
#7 The Way Back $2.4M $13.4M 2 6700000
#8 The Call of the Wild $2.2M $62.1M 4 15525000
#9 Emma. $1.4M $10.0M 4 2500000
#10 Bad Boys for Life $1.1M $204.3M 9 22700000
If you have data in billions or thousands you can refer
Changing Million/Billion abbreviations into actual numbers? ie. 5.12M -> 5,120,000 and Convert from K to thousand (1000) in R
Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode
I have an excel file that has multiple sheets. each sheet looks like this with some excess data at the bottom
A B C D....
1 time USA USA USA
2 MD CA PX
3 pork peas nuts
4 jan-11 4 2 2
5 feb-11 4 9 3
6 mar-11 8 8 3
.
.
workbook1|workbook2.....
The file is 11 mb, but when I try to use
sheet<-readWorksheetFromFile("excelfile.xlsx", sheet = 1)
I get
Error: OutOfMemoryError (Java): Java heap space
For each work sheet the data takes up different number for rows and columns, I want to write something that produces this for each sheet.
I am trying to convert each column into
country state product unit time
USA MD pork 3 jan-11
USA MD pork 3 feb-11
USA MD pork 3 mar-11
...
..
.
Is there any way to do this in R?
If your spreadsheet is full of formulas, you might need to convert those to values to get them to be read in easily. Otherwise, I would suggest using a tool like this one (among others out there) to convert all the sheets in a workbook to CSV files and work from there.
If you've gotten that far, here's something that can be tried for the "reshaping" part of your question. Here, we'll assume that "A" actually represents a CSV file, the contents of which are the six lines shown as sample data in your question:
## Create some sample data
A <- tempfile()
writeLines(sep="\n", con = A,
text = c("time, USA, USA, USA",
", MD, CA, PX",
", pork, peas, nuts",
"jan-11, 4, 2, 2",
"feb-11, 4, 9, 3",
"mar-11, 8, 8, 3"))
The first thing I would do is read in the headers and the data separately. To read the headers separately, use nrows to specify the number of rows that contain the header information. To read the data separately, specify skip to skip the header rows.
B <- read.csv(A, header = FALSE, skip = 3, strip.white = TRUE)
Bnames <- read.csv(A, header = FALSE, nrows = 3, strip.white = TRUE)
Use apply to paste the header rows together to form the names for the resulting data.frame:
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
B
# time USA_MD_pork USA_CA_peas USA_PX_nuts
# 1 jan-11 4 2 2
# 2 feb-11 4 9 3
# 3 mar-11 8 8 3
Now comes the part of converting the data from a "wide" to a "long" format. There are many ways to do this, some using base R too, but the most direct is to use melt and colsplit from the "reshape2" package:
library(reshape2)
BL <- melt(B, id.vars="time")
cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
# time value country state product
# 1 jan-11 4 USA MD pork
# 2 feb-11 4 USA MD pork
# 3 mar-11 8 USA MD pork
# 4 jan-11 2 USA CA peas
# 5 feb-11 9 USA CA peas
# 6 mar-11 8 USA CA peas
# 7 jan-11 2 USA PX nuts
# 8 feb-11 3 USA PX nuts
# 9 mar-11 3 USA PX nuts
Unfortunately, XLConnect is unlikely to work in your application. I can confirm that on a system with 8GB RAM, running Win 7 64bit and 64bit R 3.0.2, XLConnect fails with a 22MB .xlsx file, with the same error that you are getting. As #Ista pointed out, and as explained here, after restarting R and before doing anything else:
options(java.parameters = "-Xmx4096m")
library(XLConnect)
wb <- loadWorkbook("myWorkBook.xlsx")
sheet <- readWorksheet(wb,"Data")
avoids the error. However, the import still takes more than an hour(!!).
In contrast, as #Gaffi pointed out, once the sheet "Data" is saved to a csv file (~7MB), it can be imported as follows:
library(data.table)
system.time(sheet <- fread("Data.csv"))
user system elapsed
0.84 0.00 0.86
in less than 1 second. In my test case sheet has 6 columns and ~376,000 rows.
Sorry about this "second answer", but you really had two questions... #Ananda's solution for reshaping your data is extremely elegant. This is just another way to think about it.
If you transpose the input matrix you get a new matrix, where the first column is country, the second column is city, the third column is "type" (for lack of a better term), and the actual data is in the other columns (so, there is one additional column for every "time").
So a different approach is to transpose first and then melt the new matrix. This avoids creating all the concatenated column names and splitting them back later. The problem is that melt.data.frame is exceptionally inefficient with a very large number of columns (which you would have here). So doing it this way would bbe 10X slower than #Ananda's approach.
A solution is to use melt.array (just call melt(...) with an array rather than a data frame). As shown below, this approach is ~20X faster, with larger datasets (yours was 11MB).
library(reshape) # for melt(...)
library(microbenchmark) # for microbenchmark(...)
# this is just to model your situation with more realistic size
# create a large data frame (250 columns of country, city, type; 1000 rows of time)
df <- rep(c("USA","UK","FR","CHN","GER"),each=50) # time + 250 columns
df <- rbind(df,rep(c(c("NY","SF","CHI","BOS","LA")),each=10))
df <- rbind(df,rep(c("pork","peas","nuts","fruit","other")))
df <- rbind(df,matrix(sample(1:1000,250*1000,replace=T),ncol=250))
df <- cbind(c("time","","",
as.character(as.Date(1:1000,origin="2010-01-01"))),df)
df <- data.frame(df) # big warning here about duplicated row names; not important
# #Ananda'a approach:
transform.orig <- function(df){
B <- df[-(1:3),]
Bnames <- df[1:3,]
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
BL <- melt(B, id.vars="time")
final <- cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
return(final)
}
# transpose approach:
transform.new <- function(df) {
zz <- t(df)
times <- t(zz[1,4:ncol(zz)])
colnames(zz) <- c("country","city","type", times)
data <- melt(zz[-1,-(1:3)],varnames=c("id","time"))
final <- cbind(country=rep(zz[-1,1],each=ncol(zz)-3),
city =rep(zz[-1,2],each=ncol(zz)-3),
type =rep(zz[-1,3],each=ncol(zz)-3),
data[,-1])
return(final)
}
# benchmark
microbenchmark(transform.orig(df),transform.new(df), times=5, unit="s")
Unit: seconds
expr min lq median uq max neval
transform.orig(df) 9.2511679 9.6986330 9.889457 10.1518191 10.3354328 5
transform.new(df) 0.4383197 0.4724145 0.474212 0.5815531 0.6886383 5
For reading the data from excel, try the openxlsx package. It uses c++ instead of java, and better handles larger excel files.
To reshape your data look at the tidyr package. The gather function could help you out.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.