I have been playing with the WHO package that contains a great amount of data. A good thing is that the get_data function allows to pull several tables into a list of data.frames (using lapply)
### Socio-Economic indicators
# health expenditure, GDP per capita, Literacy Rate,
Fertility Rate, Pop under 1 USD, Population,
socio_econ <- c("WHS7_143", "WHS9_93", "WHS9_85", "WHS9_95", 'WHS9_90', 'WHS9_86')
SECON <- lapply(socio_econ, function(t) get_data(t))
The ultimate goal is to bind the data.frames, possibly using bind_rows function from dplyr. One problem is that each of the data.frames comes with the response variable called 'value' in a different order (Hence it is not possible to subset the same number of column within each data frame in the list). Similar problem arises with the class of the columns, for example 'year'. Basically, each modification would need to conditionally find the particular columns by name and assign new values.
My solution has been to use a for loop but I think there must be a cleaner way using lapply type functions. Here's to change the names and year class.
for (i in 1:length(socio_econ)){
names(SECON[[i]])[which(names(SECON[[i]])=='value')] <- socio_econ[i]
SECON[[i]]$year <- as.character(SECON[[i]]$year)
}
You can use mutate_at in a lapply call to change the class of the "year" and "value" colums to numeric. Since the data.frames in the list have a different number of columns, I would suggest a full_join using Reduce.
library(dplyr)
SECON <-lapply(SECON, function(df) mutate_at(df, .cols = c("year","value"), as.numeric))
output <- Reduce(full_join, SECON)
This gives me an output object of dimension 14169x8. 14169 corresponds to the total number of lines in all list elements.
You could nest a couple of functions like:
f.bind <- function(x){
f.get <- function(x){
x %>%
dplyr::select(region, year, value)
}
x = lapply(c, f.get)
do.call(rbind,(x))
}
The inner function is just wrapping a small dplyr select function and the outer function is applying the inner and binding all of the results.
Related
First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.
EDIT:
Currently i have this working but it lacks the IF statemtent:
for (i in crimes$category) {
shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
names(shoplifting) <- c("ID", "Month", "Street_Name")
}
What i am trying to do:
for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
}
}
It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..
I'll guess, and update if needed based on your question edits.
rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)
One way to work around this is to rbind a just-created data.frame:
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.
If you really just need to do it once for a category, then do this without the need for a for loop:
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).
On this note, if you need one frame per category, you can get that simply with:
df_split(df, df$category)
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.
Try:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]
Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.
This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.
Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]
OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]
I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))
I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))
What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.
for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}
I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!
I want to split a large dataframe into a list of dataframes according to the values in two columns. I then want to apply a common data transformation on all dataframes (lag transformation) in the resulting list. I'm aware of the split command but can only get it to work on one column of data at a time.
You need to put all the factors you want to split by in a list, eg:
split(mtcars,list(mtcars$cyl,mtcars$gear))
Then you can use lapply on this to do what else you want to do.
If you want to avoid having zero row dataframes in the results, there is a drop parameter whose default is the opposite of the drop parameter in the "[" function.
split(mtcars,list(mtcars$cyl,mtcars$gear), drop=TRUE)
how about this one:
library(plyr)
ddply(df, .(category1, category2), summarize, value1 = lag(value1), value2=lag(value2))
seems like an excelent job for plyr package and ddply() function. If there are still open questions please provide some sample data. Splitting should work on several columns as well:
df<- data.frame(value=rnorm(100), class1=factor(rep(c('a','b'), each=50)), class2=factor(rep(c('1','2'), 50)))
g <- c(factor(df$class1), factor(df$class2))
split(df$value, g)
You can also do the following:
split(x = df, f = ~ var1 + var2...)
This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.