I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.
library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)
This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.
Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length
MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)
More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.
# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")
# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)
The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.
I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.
Thanks in advance!
ANSWER (specific to example in question)
Block should be
MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]):
MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
Here is a solution using aggregate from base R
data(baseball, package = "plyr")
MyData <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)
Related
I have been playing with the WHO package that contains a great amount of data. A good thing is that the get_data function allows to pull several tables into a list of data.frames (using lapply)
### Socio-Economic indicators
# health expenditure, GDP per capita, Literacy Rate,
Fertility Rate, Pop under 1 USD, Population,
socio_econ <- c("WHS7_143", "WHS9_93", "WHS9_85", "WHS9_95", 'WHS9_90', 'WHS9_86')
SECON <- lapply(socio_econ, function(t) get_data(t))
The ultimate goal is to bind the data.frames, possibly using bind_rows function from dplyr. One problem is that each of the data.frames comes with the response variable called 'value' in a different order (Hence it is not possible to subset the same number of column within each data frame in the list). Similar problem arises with the class of the columns, for example 'year'. Basically, each modification would need to conditionally find the particular columns by name and assign new values.
My solution has been to use a for loop but I think there must be a cleaner way using lapply type functions. Here's to change the names and year class.
for (i in 1:length(socio_econ)){
names(SECON[[i]])[which(names(SECON[[i]])=='value')] <- socio_econ[i]
SECON[[i]]$year <- as.character(SECON[[i]]$year)
}
You can use mutate_at in a lapply call to change the class of the "year" and "value" colums to numeric. Since the data.frames in the list have a different number of columns, I would suggest a full_join using Reduce.
library(dplyr)
SECON <-lapply(SECON, function(df) mutate_at(df, .cols = c("year","value"), as.numeric))
output <- Reduce(full_join, SECON)
This gives me an output object of dimension 14169x8. 14169 corresponds to the total number of lines in all list elements.
You could nest a couple of functions like:
f.bind <- function(x){
f.get <- function(x){
x %>%
dplyr::select(region, year, value)
}
x = lapply(c, f.get)
do.call(rbind,(x))
}
The inner function is just wrapping a small dplyr select function and the outer function is applying the inner and binding all of the results.
So, I have been trying to turn a text file (each line is a chat log) into R to turn it into a data frame and further tidy the data.
I am using read.Lines so I can have each log as a single line. Because read.Lines reads them a single long char; I then convert them to strings (I need to parse the log); as per below
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- c(lapply(rawchat, toString))
My problem comes when I want to turn this list into data frame:
rawchat <- as.data.frame(rawchat)
It turns the list into a data frame of 1 observation of 42,000 variables. The intention was to turn it into 42,000 observations of one variable.
Any help please?
By the way, I am pretty new in tidying raw data in R.
So, I encountered another block:
I loaded a text file as data frame as per below.
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)
names(rawchat) <- "chat"
I am currently trying to identify any row (42000) that starts with the number 16. I can't seem to apply correctly the startsWith() function or the dplyr starts_with(), even grepl with regular expressions.
Could it be the format of the observations of the data frame (chr)?
The problem is your rawchat <- c(lapply(rawchat, toString))
Just use
rawchat <- readLines("disc-W-App-avec-loy.txt")")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)
I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".
I would like to create a data.table in tidy form containing the columns articleID, period and demand (with articleID and period as key). The demand is subject to a random function with input data from another data.frame (params). It is created at runtime for differing numbers of periods.
It is easy to do this in "non-tidy" form:
#example data
params <- data.frame(shape=runif(10), rate=runif(10)*2)
rownames(params) <- letters[1:10]
periods <- 10
# create non-tidy data with one column for each period
df <- replicate(nrow(params),
rgamma(periods,shape=params[,"shape"], rate=params[,"rate"]))
rownames(df) <- rownames(params)
Is there a "tidy" way to do this creation? I would need to replicate the rgamma(), but I am not sure how to make it use the parameters of the corresponding article. I tried starting with a Cross Join from data.table:
dt <- CJ(articleID=rownames(params), per=1:periods, demand=0)
but I don't know how to pass the rgamma to the dt[,demand] directly and correctly at creation nor how to change the values now without using some ugly for loop. I also considered using gather() from the tidyr package, but as far as I can see, I would need to use a for loop either.
It does not really matter to me whether I use data.frame or data.table for my current use case. Solutions for any (or both!) would be highly appreciated.
This'll do (note that it assumes that params is sorted by row names, if not you can convert it to a data.table and merge the two):
CJ(articleID=rownames(params), per=1:periods)[,
demand := rgamma(.N, shape=params[,"shape"], rate=params[,"rate"]), by = per]
this is probably very easy but I could use some help... I've built an agent-based model that runs 300 times for an arbitrary number of steps, depending on what happens inside the model. the resulting data is structured like this:
I want to isolate the runs that end in 150 steps or less and analyze them in their entirety (not just the final step). what's the best way to do this in R? Thanks!
First, use aggregate to get the number of steps per run:
n_steps <- aggregate(AT6$run, by=list(run=AT6$run), FUN=length)
Now compute a filter variable:
n_steps <- within(n_steps, filter <- x<=150)
Merge with the original data and keep only filtered runs:
AT6_f <- merge(AT6, n_steps)
AT6_f <- AT6_f[AT6_f$filter,]
And finally split the data.frame by runs:
result <- split(AT6_f, AT6_f$run)
Note: the result is a list, each element being a data.frame containing a single run. If you have a function f that analyses each run, you should pass the result above to it with this:
lapply(result, f)