I have multiple data files in which im interested in cleaning up then obtaining means from to run repeated measures ANOVA on.
Here's example data, in real data theres 4500 rows and another line called Actresponse which sometimes contains a 9 which I trim around : https://docs.google.com/file/d/0B20HmmYd0lsFVGhTQ0EzRFFmYXc/edit?pli=1
I have just discovered plyr and how awesome it is for manipulating data, but the way I'm using it right now looks rather stupid to me. I have 4 different things I"m interested in that I want to read into a data frame. I've read them in to 4 separate data frames to start, I'm wondering if there is a way I can combine this and read all the means into one data frame (a row for each reqresponse of each file) with less lines of code. Basically, can I achieve what I've done here without rewriting a lot of the same code 4 times?
PMScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse),summarise,Score=mean(Score))
})
PMRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
OtherScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],. (Condition,Reqresponse),summarise,Score=mean(Score))
})
OtherRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
I think this deals with what you're trying to do. Basically, I think you need to read all the data in once, then deal with that data.frame. There are several questions dealing with how to read it all in, here is how I would do it so I maintain a record of which file each row in the data.frame comes from, which can also be used for grouping:
filenames <- list.files(".", pattern="^[2-3].txt")
import <- mdply(filenames, read.table, header = T, quote = "\"")
import$file <- filenames[import$X1]
Now import is a big dataframe with all your files in it (I'm assuming your pattern recognition etc for reading in files is correct). You can then do summaries based on whatever criteria you like.
I'm not sure what you're trying to achieve in line 3 of your code above, but for the ddply below that, you just need to do:
ddply(import[import$Reqresponse==9,],.(Condition,Reqresponse,file),summarise,Score=mean(Score))
There's so much going on in the rest of your code that it's hard to make out exactly what you want.
I think the important thing is that to make this efficient, and easier to follow, you need to read your data in once, then work on that dataset - making subsets if necessary, doing summary stats or whatever else it is.
As an example of how you can work with this, here's an attempt to deal with your problem of dealing with trials (rows?) that have reqresponse == 9 and the following two. There are probably ways of doing this more efficiently, but this is slightly based on how you were doing it to show you briefly how to work with the larger dataframe. Now modified to remove the first two trials of each file:
import.clean <- ddply(import, .(file), function(x) {
index <- which(x$reqresponse == 9)
if(length(index) > 0) {
index <- unique(c(index, index + 1, index + 2, 1, 2))
}
else index <- c(1,2)
x <- x[-index,]
return(x)
})
Related
Is there a way to simplify this code using a loop?
VariableList <- c(v0,v1,v2, ... etc)
National_DF <- df[,VariableList]
AL_DF <- AL[,VariableList]
AR_DF <- AR[,VariableList]
AZ_DF <- AZ[,VariableList]
... etc
I want the end result to have each as a data frame since it will be used later in the model. Each state such as 'AL', 'AR', 'AZ', etc are data frames. The v{#} represents an out of place variable from the RAW data frame. This is meant to restructure the fields, while eliminating some fields, for preparation for model use.
Continuing the answer from your previous question, we can arrange the data in the same lapply call before creating dataframes.
VariableList <- c('v0','v1','v2')
data <- unlist(lapply(mget(ls(pattern = '_DF$')), function(df) {
index <- sample(1:nrow(df), 0.7*nrow(df))
df <- df[, VariableList]
list(train = df[index,], test = df[-index,])
}), recursive = FALSE)
Then get data in global environment :
list2env(data, .GlobalEnv)
I am trying to practice programming functions in R. I have made a function that allows me to determine which is the best pokemon for each attribute (e.g. attack, speed, defense etc.) per given type of pokemon (e.g. water, psychic, etc.). So far, I am only able to do this for one pokemon generation (reflected in one excel file). I want to do the same to include all the 6 generations (stored in 6 excel files). I have been working on the code for sometime... Maybe anybody here can give some inputs? Here is my current code in R for the said function:
bestpoke<-function(Type1, attri){
data <- read.csv("gen01.csv", colClasses = "character", header=TRUE)
dx <- as.data.frame(cbind(data[, 2], # Name
data[, 3], # Type1
data[, 6], # HP
data[, 7], # Attack
data[, 8], # Defense
data[, 9], # SpecialAtk
data[, 10], # SpecialDef
data[, 11]), # Speed
stringsAsFactors = FALSE)
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
## Check that name and attributes are valid
if(!Type1 %in% dx[, "Type1"]){
stop('invalid Type')
} else if(!attri %in% c("HP", "Attack", "Defense", "SpecialAtk", "SpecialDef", "Speed")){
stop('invalid attribute')
} else {
da <- which(dx[, "Type1"] == Type1)
db <- dx[da, ] # extracting data for the called state
dc <- as.numeric(db[, eval(attri)])
max_val <- max(dc, na.rm = TRUE)
result <- db[, "Name"][which(dc == max_val)]
output <- result[order(result)]
}
return(output) }
First thing, I believe that you can simplify your code a little, especially the first when you define the objects data and dx.
data <- read.csv("gen01.csv", stringsAsFactors = F)
You can use F or T instead of False or True. The read.csv() function has the stringsAsFactors parameter, so all character columns will be read as characters and the numeric columns will remain numeric. The header parameter in the function is True by default. Also, the output of the read.csv() function generally is a data.frame, so you don't have to use the as.data.frame function at that point. For more information run "?read.csv" in your R console. Then, you can use the following to subset the columns that you want:
dx <- data[, c(2, 3, 6:11)]
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
#This last line is just fine if you want to change the names of the columns.
#You can use the function names() instead, but I think that they work the same way.
Now, referring to the other generations, I think that you can practice with some conditionals if else. First install the xlsx package and readxl package so you don't have to necessarily export your excel files to csv format. You can try adding a new parameter "gen" or "generation" that requires a numeric input from 1 to 6, for example, and do something like this:
bestpoke<-function(Type1, attri, gen = 1){ # First generation will be the default
require(readxl)
if(gen == 1) {
data <- read.csv("gen01.csv", stringsAsFactors = F)
} else if(gen == 2) {
data <- read_xls("gen02.xls", stringsAsFactors = F)
} else if(gen == 3) { } # And so on...
dx <- data[, c(2, 3, 6:11)]
colnames(dx) <- c("Name", "Type1", "HP", "Attack", "Defense", "SpecialAtk","SpecialDef", "Speed")
## From this point your function is the same
This is assuming that all the files have the same structure (number and order and type of columns).
You may want to try some other things, but if you are learning, it's better that you search for yourself some ways to accomplish them. This could be:
Reading and merging two or more generations (if the question is what is the best pokemon between two or more generations). In this case, the conditionals have to change.
Good luck!
I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))
I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)
I am a long-time Stata user but am trying to familiarize myself with the syntax and logic of R. I am wondering if you could help me with writing more efficient codes as shown below (The "The Not-so-efficient Codes")
The goal is to (A) read several files (each of which represents the data of a year), (B) create the same variables for each file, and (C) combine the files into a single one for statistical analysis. I have finished revising "part A", but are struggling with the rest, particularly part B. Could you give me some ideas as to how to proceed, e.g. use unlist to unlist data.l first, or lapply to each element of data.l? I appreciate your comments-thanks.
More Efficient Codes: Part A
# Creat an empty list
data.l = list()
# Create a list of file names
fileList=list.files(path="C:/My Data, pattern=".dat")
# Read the ".dat" files into a single list
data.l = sapply(fileList, readLines)
The Not-so-efficient Codes: Part A, B and C
setwd("C:/My Data")
# Part A: Read the data. Each "dat" file is text file and each line in the file has 300 characters.
dx2004 <- readLines("2004.INJVERBT.dat")
dx2005 <- readLines("2005.INJVERBT.dat")
dx2006 <- readLines("2006.INJVERBT.dat")
# Part B-1: Create variables for each year of data
dt2004 <-data.frame(hhx = substr(dx2004,7,12),fmx = substr(dx2004,13,14),
,iphow = substr(dx2004,19,318),stringsAsFactors = FALSE)
dt2005 <-data.frame(hhx = substr(dx2005,7,12),fmx = substr(dx2005,13,14),
,iphow = substr(dx2005,19,318),stringsAsFactors = FALSE)
dt2006 <-data.frame(hhx = substr(dx2006,7,12),fmx = substr(dx2006,13,14),
iphow = substr(dx2006,19,318),stringsAsFactors = FALSE)
# Part B-2: Create the "iid" variable for each year of data
dt2004$iid<-paste0("2004",dt2004$hhx, dt2004$fmx, dt2004$fpx, dt2004$ipepno)
dt2005$iid<-paste0("2005",dt2005$hhx, dt2005$fmx, dt2005$fpx, dt2005$ipepno)
dt2006$iid<-paste0("2006",dt2006$hhx, dt2006$fmx, dt2006$fpx, dt2006$ipepno)
# Part C: Combine the three years of data into a single one
data = rbind(dt2004,dt2005, dt2006)
you are almost there. Its a combination of lapply and do.call/rbind to work with lapply's list output.
Consider this example:
test1 = "Thisistextinputnumber1"
test2 = "Thisistextinputnumber2"
test3 = "Thisistextinputnumber3"
data.l = list(test1, test2, test3)
makeDF <- function(inputText){
DF <- data.frame(hhx = substr(inputText, 7, 12), fmx = substr(inputText, 13, 14), iphow = substr(inputText, 19, 318), stringsAsFactors = FALSE)
DF <- within(DF, iid <- paste(hhx, fmx, iphow))
return(DF)
}
do.call(rbind, (lapply(data.l, makeDF)))
Here test1, test2, test3 represent your dx200X, and data.l should be the list format you get from the efficient version of Part A.
In makeDF you create your desired data.frame. The do.call(rbind, ) is somewhat standard if you work with lapply-return values.
You also might want to consider checking out the data.table-package which features the function rbindlist, replacing any do.call-rbind construction (and is much faster), next to other great utility for large data sets.