I Have the following dataset which includes 2 variables:
dt4<-structure(list(a1 = c(4L, 4L, 3L, 4L, 4L), a2 = c(1L,
3L, 4L, 5L, 4L)), .Names = c("a1", "a2"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I Have the following function that add labels and levels to an existing dataset:
Add_Labels_Level_To_Dataset <- function(df, df_name,levels_list,labels_list) {
df[] <- lapply( df, ordered)
for (i in 1:length(colnames(df))) {
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
}
df
}
which is run by that R command:
Add_Labels_Level_To_Dataset(dt4, "dt4", level_list, labels_list)
The lists supplied in the R command are the following ones which represents the ordered levels of each variable in the dataset, respectively:
label_list=list("'S','SA','SB','SC,'SD'", "'S','SA','SB','SC,'SD'")
level_list=list("5,4,3,2,1", "5,4,3,2,1")
Why my function is not working properly?
I dont know what is wrong with that!
When I run the R commands outside R function, they tie levels/ labels to the dataset given. However, when I run my R function, this does not happen!
df_name="dt4"
df=dt4
levels_list=level_list
labels_list=label_list
i=3
df[] <- lapply( df, ordered)
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
Can you help?
This is a xy problem. I agree with #MrFlick that parse should be avoided.
On the original post the main issue is the function should be returning dt4 and not df. There are some missing ' (single quote) when defining label_list.
We could use mapply and avoid the single quote:
label_list=list(c('S','SA','SB','SC','SD'), c('S','SA','SB','SC','SD'))
level_list=list(c(5,4,3,2,1), c(5,4,3,2,1))
as.data.frame(mapply(function(x, labels,levels ) {ordered(x, labels,levels)}, dt4, level_list, label_list, SIMPLIFY = F))
# a1 a2
#1 SA SD
#2 SA SB
#3 SB SA
#4 SA S
#5 SA SA
Using eval/parse should be avoided. There are tpyically much easier ways to do what you want in R. For example, with this code, we can just write
Add_Labels_Level_To_Dataset <- function(df, levels_list, labels_list) {
df[] <- Map(function(data, levels, labels) {
ordered(data, levels=strsplit(levels,",")[[1]], labels=strsplit(labels, ",")[[1]])
}, df, levels_list, labels_list)
df
}
And we can call it like
dt4 <- Add_Labels_Level_To_Dataset(dt4, level_list, label_list)
Note that it returns a new data.frame which you can reassign to dt4 or some other variable. Functions in R should never modify objects outside their own scope which is one of the other reasons you were running into problems with your function.
Related
I have the follow dataset:
dataset=structure(list(var1 = c(28.5627505742013, 22.8311421908438, 95.2216156944633,
43.9405107684433, 97.11211245507, 48.4108281508088, 77.1804554760456,
27.1229329891503, 69.5863061584532, 87.2112890332937), var2 = c(32.9009465128183,
54.1136392951012, 69.3181485682726, 70.2100433968008, 44.0986660309136,
62.8759404085577, 79.4413498230278, 97.4315509572625, 62.2505457513034,
76.0133410431445), var3 = c(89.6971945464611, 67.174579706043,
37.0924087055027, 87.7977314218879, 29.3221596442163, 37.5143952667713,
62.6237869635224, 71.3644423149526, 95.3462834469974, 27.4587387405336
), var4 = c(41.5336912125349, 98.2095112837851, 80.7970978319645,
91.1278881691396, 66.4086666144431, 69.2618868127465, 67.7560870349407,
71.4932355284691, 21.345994155854, 31.1811877787113), var5 = c(33.9312525652349,
88.1815139763057, 98.4453701227903, 25.0217059068382, 41.1195872165263,
37.0983888953924, 66.0217586159706, 23.8814191706479, 40.9594196081161,
79.7632974945009), var6 = c(39.813664201647, 80.6405956856906,
30.0273275375366, 34.6203793399036, 96.5195455029607, 44.5830867439508,
78.7370151281357, 42.010761089623, 23.0079878121614, 58.0372223630548
), kmeans = structure(c(2L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 2L, 3L
), .Label = c("1", "2", "3"), class = "factor")), .Names = c("var1",
"var2", "var3", "var4", "var5", "var6", "kmeans"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
And the follow function:
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
With dplyr::summarise only, the result is ok:
library(tidyverse)
my1<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(sum,mean,sd))
But, with myfun doesn't work:
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(myfun))
Error in summarise_impl(.data, dots) :
Column var1 must be length 1 (a summary value), not 3
What's the problem?
You can try this approach, Your approach will not yield the correct result as there it is not able to wrap two values returned by your custom function in a single cell, to circumvent the problem, I used enframe with list in the custom function:
library(tidyverse)
myfun<-function(x){
return(list(enframe(c('sum' = sum(x),'mean' = mean(x),'sd' = sd(x)))))
}
For example with mtcars data:
my2<-mtcars%>%
summarise_at(c('mpg','drat'), function(x) myfun(x)) %>%
unnest() %>%
select(-name1) %>%
set_names(nm = c('name', 'mpg', 'drat'))
it will yield:
name mpg drat
1 sum 642.900000 115.0900000
2 mean 20.090625 3.5965625
3 sd 6.026948 0.5346787
Also, there is one alternate way in which you can try solving it using purrr.
For example:
f <- function(x,...){
list('mean' = mean(x, ...),'sum' = sum(x, ...))
}
mtcars %>%
select(mpg, drat) %>%
map_dfr(~ f(.x, na.rm=T), .id ="Name") %>%
data.frame()
When you are applying this function
dataset%>% summarise_if(is.numeric,.funs=funs(sum,mean,sd))
You are applying three different function (sum, mean and sd) which is applied to all columns individually. So every column which is numeric these function would be applied to them. Here we have got three different function returning three values.
Regarding your function, I think what you were trying to do was
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
Now , when this function is applied to one column it returns you three values, so here one function is returning you three values instead.
myfun(dataset$var1)
#[1] 597.17994 59.71799 29.03549
As #NelsonGon mentioned in the comments, you are trying to store three values in single column. You could return them as list as #Pkumar showed or some variation of do also would help you achieve that. If you break down the functions and make three functions separately, it would work the same way as you have shown earlier.
myfun1 <- function(x) sum(x)
myfun2 <- function(x) mean(x)
myfun3 <- function(x) sd(x)
dataset %>% summarise_if(is.numeric,.funs=funs(myfun1,myfun2,myfun3))
it's not the most elegant way, but if your external function is just a list of other functions, maybe you can just use a list for your functions:
myfun_ls <- list(sum,mean,sd)
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=myfun_ls)
it's a fairly simple task, but I'm trying to wrap my head around how to match values using a dataframe with keys and values. I've tried merge, but as the number of rows is different, I'm not sure that's appropriate.
Is there a for loop I can write that will loop through each key in my input dataframe and change Product's value if it's one of the ones in the lookup table?
Essentially, my data looks like this:
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
input_product <- c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate", "Donuts", "Juice")
input <- as.data.frame(cbind(input_key, input_product))
I'd like to replace the NAs with the Product values in the corresponding lookup table:
lookup_key <- c(1245,1546, 7764, 9909)
lookup_product <- c("Ice Cream","Soda", "Bacon","Cheese")
lookup_data <- as.dataframe(cbind(lookup_key, lookup_product))
Finally, I'm hoping to get the final dataframe looking like this:
output_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", Chocolate","Donuts", "Juice")
output_data <- as.data.frame(cbind(output_key, output_product))
OPTION 1: Using R-base functions:
Vectorial solution:
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
Note: The ==TRUE is redundant, added just for better understanding.
Using lapply function:
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
Note: Attention to the global assignment operation (<<)
Using for loop:
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
Note: Here we just need a simple assignment.
In the above cases you need to create the data frames setting the input argument: stringsAsFactors as FALSE, for example:
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors = FALSE)
lookup_data <- as.data.frame(cbind(lookup_key, lookup_product), stringsAsFactors = FALSE)
Then you get the output:
> input
input_key input_product
1 9061 Water
2 8680 Bread
3 1546 Soda
4 5376 Chips
5 9550 Chicken
6 9909 Cheese
7 3853 Chocolate
8 3732 Donuts
9 9209 Juice
>
OPTION 2: Using data.tablepackage
I found this elegant solution using inner join:
require(data.table)
setkey(input,input_key)
setkey(lookup_data,lookup_key)
> setDT(input)[setDT(lookup_data), input_product := i.lookup_product, nomatch=0][]
input_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
>
data.tableis actually very powerful for data set manipulation. Let's explain the syntax behind:
setDT: Converts a data frame by reference (no copy occurs) into data.table, because the original data sets are not a data.table classes, that's the way to
convert them on the fly. Notice that now it is not necessary to use the attribute stringsAsFactors because for data.tableits default value is FALSE.
input[lookup_data, nomatch=0]: Is the way, with data.table package to create a inner join (see this link). It means the interception of both tables. The no match option with value 0 means no rows will be returned for that row of i (in our case: lookup_data).
This would be the output:
> setDT(input)[setDT(lookup_data), nomatch=0][]
input_key input_product lookup_product
1: 1546 NA Soda
2: 9909 NA Cheese
>
input_product := i.lookup_product: assigns the column from the outer
data set, with the value of the inner data set.
[]: Prints the result (for verifying the solution purpose)
For more information about data.tableI would recommend to read the package documentation it comes with many examples. It is also a good idea to run in R the following command (after loading the data.tablepackage):
example(data.table)
It provides more than 50 examples (the same from the package documentation) with its corresponding result about the different uses of this package.
PERFORMANCE
Let's compare all possible alternatives in terms of performance. Then we need to modify
the input data set for increasing its size:
rep.num <- 1000
input_key <- rep(c(9061,8680,1546,5376,9550,9909,3853,3732,9209),rep.num)
input_product <- rep(c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate",
"Donuts", "Juice"),rep.num)
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors=F)
Wrap all different alternatives into a corresponding given function. I have included
the solution via dplyr proposed by #count
vectSol <- function(input, lookup_data) {
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
return(input)
}
lapplySol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
return(input)
}
forSol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
return(input)
}
dataTableSol <- function (input, lookup_data) {
setkey(input,input_key)
setkey(lookup_data,lookup_key)
input[lookup_data, input_product := i.lookup_product, nomatch=0]
return(input)
}
dplyrSol <- function(input, lookup_data) {
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
return(input)
}
Now test each solution (double check).
Make a copy of the input data set, because data.table operate by reference. We need to create a copy from scratch.
input.copy <- setDT(as.data.frame(cbind(input_key, input_product), stringsAsFactors=F))
lookup_data.copy<- setDT(as.data.frame(cbind(lookup_key, lookup_product),
stringsAsFactors=F))
input1.out <- vectSol(input, lookup_data)
input2.out <- lapplySol(input, lookup_data)
input3.out <- forSol(input, lookup_data)
input4.out <- forSol(input, lookup_data)
input5.out <- dataTableSol(copy(input.copy), lookup_data.copy)
We use the package compare because all.equal fails for comparing a data frame
with a data.table object, because the attributes values, therefore we need a
comparison that only checks the values.
library(compare)
OK <- all(
all.equal(input1.out, input2.out) && all.equal(input1.out, input3.out)
&& all.equal(input1.out, input4.out)
&& compare(input1.out[order(input1.out$input_key),],
input5.out, ignoreAttrs=T)$result
)
try(if(!OK) stop("Result are not the same for all methods"))
Now let's to use microbenchmarkpackage for comparing the time performance of all solutions
library(microbenchmark)
op <- microbenchmark(
VECT = {vectSol(input, lookup_data)},
FOR = {forSol(input, lookup_data)},
LAPPLY = {lapplySol(input, lookup_data)},
DPLYR = {dplyrSol(input, lookup_data)},
DATATABLE = {dataTableSol(input.copy, lookup_data.copy)},
times=100L)
print(op)
Here is the result:
Unit: milliseconds
expr min lq mean median uq max neval cld
VECT 1.005890 1.078983 1.384964 1.108162 1.282269 6.562040 100 a
FOR 416.268583 438.545475 476.551526 449.679426 476.032938 740.027018 100 b
LAPPLY 428.456092 454.664204 492.918478 464.204607 501.168572 751.786224 100 b
DPLYR 13.371847 14.919726 16.482236 16.105815 17.086174 23.537866 100 a
DATATABLE 1.699995 2.059205 2.427629 2.279371 2.489406 8.542219 100 a
Additionally we can graph the solution, via:
library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()
The best performance on this order is: Vectorial, data.table, dplyr, for-loop, lapply.
Pretty tired so this is clumsy, but it should work for the data provided (your output sample is probaly wrong though):
require(dplyr)
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
This is easily done using the data.table package as follows:
# load sample data
input_data <- structure(list(
input_key =
structure(c(6L, 5L, 1L, 4L, 8L, 9L,
3L, 2L, 7L),
.Label = c("1546", "3732", "3853", "5376", "8680",
"9061", "9209", "9550", "9909"), class = "factor"),
input_product = structure(c(7L, 1L, NA, 3L, 2L, NA, 4L, 5L, 6L),
.Label = c("Bread", "Chicken", "Chips", "Chocolate",
"Donuts", "Juice", "Water"), class = "factor")),
.Names = c("input_key",
"input_product"),
row.names = c(NA, -9L), class = "data.frame")
lookup_data <- structure(list(
lookup_key = structure(1:4,
.Label = c("1245", "1546", "7764", "9909"), class = "factor"),
lookup_product = structure(c(3L,
4L, 1L, 2L), .Label = c("Bacon", "Cheese", "Ice Cream", "Soda"
), class = "factor")), .Names = c("lookup_key", "lookup_product"
), row.names = c(NA, -4L), class = "data.frame")
# convert to data.table and add keys for merging
library(data.table)
input <- data.table(input_data, key = 'input_key')
lookup <- data.table(lookup_data, key = 'lookup_key')
# merge the data (can use merge method as well)
DT <- lookup[input]
# where the input_product is NA, replace with lookup
DT[is.na(input_product), input_product := lookup_product]
print(DT)
# you can now get rid of lookup_product column, if you like
DT[, lookup_product:= NULL]
print(DT)
The final output of the above is:
> print(DT)
lookup_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
I have a dataframe that contains more than 2 millions records. I am only sharing the few records due to data security reasons .I wish you guys can understand my reason.
data <- data[order(data$email_address_hash),]
skip_row <- c()
data$hash_time <- rep('NA',NROW(data)) #adding new column to our data
rownames(data) <- as.character(1:NROW(data))
dput(droplevels(data))
structure(list(email_address_hash = structure(c(2L, 1L, 1L, 2L
), .Label = c("0004eca7b8bed22aaf4b320ad602505fe9fa9d26", "35c0ef2c2a804b44564fd4278a01ed25afd887f8"
), class = "factor"), open_time = structure(c(2L, 1L, 3L, 4L), .Label = c(" 04:39:24",
" 09:57:20", " 10:39:43", " 19:00:09"), class = "factor")), .Names = c("email_address_hash",
"open_time"), row.names = c(41107L, 47808L, 3973L, 8307L), class = "data.frame")
str(data)
'data.frame': 4 obs. of 2 variables:
$ email_address_hash: Factor w/ 36231 levels "00012aec4ca3fa6f2f96cf97fc2a3440eacad30e",..: 7632 2 2 7632
$ open_time : Factor w/ 34495 levels " 00:00:03"," 00:00:07",..: 15918 5096 16971 24707
.
skip_row <- c()
data$hash_time <- rep('NA',NROW(data)) #adding new column to our data
rownames(data) <- as.character(1:NROW(data))
for(i in 1:NROW(data)){
#Skipping the email_address_hash that was already used for grouping
if(i %in% skip_row) next
hash_row_no <- c()
#trimming data so that we don't need to look into whole dataframe
trimmed_data <- data[i:NROW(data),]
# Whenever we search for email_address_hash the previous one was ignored or removed from the check
#extracting rownames so that we can used that as rownumber inside the skip_row
hash_row_no <- rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],])
#note :- we know the difference b/w rownames and rownumber
#converting rownames into numeric so that we can use them as rowno
hash_row_no <- as.numeric(hash_row_no)
first_no <- hash_row_no[1]
last_no <- hash_row_no[NROW(hash_row_no)]
skip_row <- append(skip_row,hash_row_no)
data$hash_time[first_no] <- paste(data$open_time[first_no:last_no], collapse = "")
}
Please note that I also tried the below approaches for to speed up the process but that seems to be ineffective
hash_row_no <- rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],])
converted dataframe to data.table
setDT(data)
performed either of the operation gives times similar time
system.time(rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],]))
system.time(rownames(trimmed_data)[trimmed_data[["email_address_hash"]] == trimmed_data$email_address_hash[1]])
Can you guys help me to speed up my code as my data contains more than 2 millions records and it is taking more than 30 minutes and even more ?
Apparently you want to do this:
library(data.table)
setDT(data)
data[, .(open_times = paste(open_time, collapse = "")), by = email_address_hash]
# email_address_hash open_times
#1: 35c0ef2c2a804b44564fd4278a01ed25afd887f8 09:57:20 19:00:09
#2: 0004eca7b8bed22aaf4b320ad602505fe9fa9d26 04:39:24 10:39:43
Or possibly this:
data[email_address_hash == "0004eca7b8bed22aaf4b320ad602505fe9fa9d26",
paste(open_time, collapse = "")]
#[1] " 04:39:24 10:39:43"
I have a problem with using mutate{dplyr} function with the aim of adding a new column to data frame. I want a new column to be of character type and to consist of "concat" of sorted words from other columns (which are of character type, too). For example, for the following data frame:
> library(datasets)
> states.df <- data.frame(name = as.character(state.name),
+ region = as.character(state.region),
+ division = as.character(state.division))
>
> head(states.df, 3)
name region division
1 Alabama South East South Central
2 Alaska West Pacific
3 Arizona West Mountain
I would like to get a new column with the following first element:
"Alamaba_East South Central_South"
I tried this:
mutate(states.df,
concated_column = paste0(sort(name, region, division), collapse="_"))
But I received an error:
Error in sort(1:50, c(2L, 4L, 4L, 2L, 4L, 4L, 1L, 2L, 2L, 2L, 4L, 4L, :
'decreasing' must be a length-1 logical vector.
Did you intend to set 'partial'?
Thank you for any help in advance!
You need to use sep = not collapse =, and why use sort?. And I used paste and not paste0.
library(dplyr)
states.df <- data.frame(name = as.character(state.name),
region = as.character(state.region),
division = as.character(state.division))
res = mutate(states.df,
concated_column = paste(name, region, division, sep = '_'))
As far as the sorting goes, you do not use sort correctly. Maybe you want:
as.data.frame(lapply(states.df, sort))
This sorts each column, and creates a new data.frame with those columns.
Adding on to Paul's answer. If you want to sort the rows, you could try order. Here is an example:
res1 <- mutate(states.df,
concated_column = apply(states.df[order(name, region, division), ], 1,
function(x) paste0(x, collapse = "_")))
Here order sorts the data.frame states.df by name and then breaks the tie by region and division
I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4