Extracting outputs from lapply to a dataframe - r

I have some R code which performs some data extraction operation on all files in the current directory, using the following code:
files <- list.files(".", pattern="*.tts")
results <- lapply(files, data_for_time, "17/06/2006 12:00:00")
The output from lapply is the following (extracted using dput()) - basically a list full of vectors:
list(c("amer", "14.5"), c("appl", "14.2"), c("brec", "13.1"),
c("camb", "13.5"), c("camo", "30.1"), c("cari", "13.8"),
c("chio", "21.1"), c("dung", "9.4"), c("east", "11.8"), c("exmo",
"12.1"), c("farb", "14.7"), c("hard", "15.6"), c("herm",
"24.3"), c("hero", "13.3"), c("hert", "11.8"), c("hung",
"26"), c("lizr", "14"), c("maid", "30.4"), c("mart", "8.8"
), c("newb", "14.7"), c("newl", "14.3"), c("oxfr", "13.9"
), c("padt", "10.3"), c("pbil", "13.6"), c("pmtg", "11.1"
), c("pmth", "11.7"), c("pool", "14.6"), c("prae", "11.9"
), c("ral2", "12.2"), c("sano", "15.3"), c("scil", "36.2"
), c("sham", "12.9"), c("stra", "30.9"), c("stro", "14.7"
), c("taut", "13.7"), c("tedd", "22.3"), c("wari", "12.7"
), c("weiw", "13.6"), c("weyb", "8.4"))
However, I would like to then deal with this output as a dataframe with two columns: one for the alphabetic code ("amer", "appl" etc) and one for the number (14.5, 14.2 etc).
Unfortunately, as.data.frame doesn't seem to work with this input of nested vectors inside a list. How should I go about converting this? Do I need to change the way that my function data_for_time returns its values? At the moment it just returns c(name, value). Or is there a nice way to convert from this sort of output to a dataframe?

Try this if results were your list:
> as.data.frame(do.call(rbind, results))
V1 V2
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
...

One option might be to use the ldply function from the plyr package, which will stitch things back into a data frame for you.
A trivial example of it's use:
ldply(1:10,.fun = function(x){c(runif(1),"a")})
V1 V2
1 0.406373084755614 a
2 0.456838687881827 a
3 0.681300171650946 a
4 0.294320539338514 a
5 0.811559669673443 a
6 0.340881009353325 a
7 0.134072444401681 a
8 0.00850683846510947 a
9 0.326008745934814 a
10 0.90791508089751 a
But note that if you're mixing variable types with c(), you probably will want to alter your function to return simply data.frame(name= name,value = value) instead of c(name,value). Otherwise everything will be coerced to character (as it is in my example above).

inp <- list(c("amer", "14.5"), c("appl", "14.2"), .... # did not see need to copy all
data.frame( first= sapply( inp, "[", 1),
second =as.numeric( sapply( inp, "[", 2) ) )
first second
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
5 camo 30.1
6 cari 13.8
snipped output

Because and forNelton took the response I was in the process of giving and Joran took the only other reasonable response I could think of and since I'm supposed to be writing a paper here's a ridiculous answer:
#I named your list LIST
LIST2 <- LIST[[1]]
lapply(2:length(LIST), function(i) {LIST2 <<- rbind(LIST2, LIST[[i]])})
data.frame(LIST2)

Related

R: read_csv reads numeric entries as logical - parsing col_logical instead of col_double

I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)

how to map a dataframe and vectors into function parameters with *pply functions

The problem I met is specific operation for *pply(like apply or mapply, etc, I'm not sure).
The dataframe is dlt:
dlt.1 dlt.2 dlt.3 dlt.4 dlt.5
1 3.244198 6.482869 9.711874 12.92918 16.13489
6 3.196401 6.391871 9.585553 12.77681 15.96547
19 3.182911 6.365424 9.547196 12.72799 15.90795
24 3.164079 6.328089 9.491971 12.65577 15.81984
and the vector is freq:
1 2 3 4 5
Now I intend to map the dt and freqn to a function foo:
foo <- function( dlti, freqi){ dlti * freqi }
where I hope the ith column of dlt correspond to the ith element of freq
I tried apply and mapply, but both failed. Would anyone please show me what is correct way?
It is not clear from your question what you actually want because you don't show the desired result. Without that, there is ambiguity in your question.
dlt <- tribble(
~dlt_1, ~dlt_2, ~dlt_3, ~dlt_4, ~dlt_5 ,
3.244198, 6.482869, 9.711874, 12.92918, 16.13489,
3.196401, 6.391871, 9.585553, 12.77681, 15.96547,
3.182911, 6.365424, 9.547196, 12.72799, 15.90795,
3.164079, 6.328089, 9.491971, 12.65577, 15.81984
)
freqi <- c(1,2,3,4,5)
foo <- function(dlti,freqi){dlti * freqi}
purrr::map2(dlt,freqi,foo)
$dlt_1
[1] 3.244198 3.196401 3.182911 3.164079
$dlt_2
[1] 12.96574 12.78374 12.73085 12.65618
$dlt_3
[1] 29.13562 28.75666 28.64159 28.47591
$dlt_4
[1] 51.71672 51.10724 50.91196 50.62308
$dlt_5
[1] 80.67445 79.82735 79.53975 79.09920
In base R, we can do this by replicating the 'freq' and then do the *
dlt*freq[col(dlt)]
# dlt.1 dlt.2 dlt.3 dlt.4 dlt.5
#1 3.244198 12.96574 29.13562 51.71672 80.67445
#6 3.196401 12.78374 28.75666 51.10724 79.82735
#19 3.182911 12.73085 28.64159 50.91196 79.53975
#24 3.164079 12.65618 28.47591 50.62308 79.09920
Or using Map in base R
dlt[] <- Map(`*`, dlt, freq)

Change column in dataframe based on regex in R

I have a large dataframe with a column displaying different profiles:
PROFILE NTHREADS TIME
profAsuffix 1 3.12
profAanother 2 1.9
profAyetanother 3
...
profBsuffix 1 4.1
profBanother 1 3.9
...
I want to rename all profA* pattern combining them in one name (profA) and do the same with profB*. Until now, I do it as:
data$PROFILE <- as.factor(data$PROFILE)
levels(data$PROFILE)[levels(data$PROFILE)=="profAsuffix"] <- "profA"
levels(data$PROFILE)[levels(data$PROFILE)=="profAanother"] <- "profA"
levels(data$PROFILE)[levels(data$PROFILE)=="profAyetanother"] <- "profA"
And so on. But this time I have too many differents suffixes, so I wonder if I can use grepl or a similar approach to do the same thing.
We can use sub
data$PROFILE <- sub("^([a-z]+[A-B]).*", "\\1", data$PROFILE)

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources