Reading multiple excel files into R using the map function - r

I know there are similar questions to this but I haven't came across ones using the map function from the purrr package. I am having a difficult time trying to read in some excel files(.xlsx) using purrr::map(). I would like each one to be it's own data frame. I tried the approach in this similar question: How can I reading multiple (excel) files into R?.
However, I keep getting this error:
Error: path does not exist: "tab3_DOfinal_HUClevel_assessment.xlsx"
I know for sure I have the right path. Not sure why I am getting this error. I have about 9 excel spreadsheets that I want to read in.
Code I tried:
# load necessary package
library(purrr)
file.list <- list.files(path="2016_Data_Tables",pattern='*.xlsx')
file.list <- setNames(file.list, file.list)
# store all .xlsx files as individual data frames inside of one list
df <- map(file.list, read_xlsx)
The file name pattern goes as follows:
tab3_DOfinal_HUClevel_assessment.xlsx
The only thing that changes is the DOfinal part.
Some sample data:
structure(list(ID = 1, WMA = 15, Number = "02040302020030-01",
HUC14 = "HUC02040302020030", Name = "Absecon Creek (AC Reserviors) (gage to SB)",
Region = "Atlantic Coast", NumofStations = "2", ListofStations = "01410455, R32",
ListofAssessment = "2, 2", HUCTier = "2", swqs = "PL, SE1",
TotalNumSamples5yrs = "NA", flgusgsprelim = "NA, 0", auassess = 2,
auassesstrout = -999, finalauassess = 2, finalauassesstrout = -999,
Changefrom2014 = "No Change-2", Changetroutfrom2014 = "No Change",
listHUC14assess5 = "NA", listHUC14assess3 = "NA", listHUC14assess2 = "01410455, R32",
His2014 = "Attaining", His2014trout = "-999", Notes = NA_character_,
OldStations2014 = "01410455", OldStationsAssess2014 = "2",
Error = NA_character_), .Names = c("ID", "WMA", "Number",
"HUC14", "Name", "Region", "NumofStations", "ListofStations",
"ListofAssessment", "HUCTier", "swqs", "TotalNumSamples5yrs",
"flgusgsprelim", "auassess", "auassesstrout", "finalauassess",
"finalauassesstrout", "Changefrom2014", "Changetroutfrom2014",
"listHUC14assess5", "listHUC14assess3", "listHUC14assess2", "His2014",
"His2014trout", "Notes", "OldStations2014", "OldStationsAssess2014",
"Error"), row.names = c(NA, -1L), class = c("tbl_df", "tbl",
"data.frame"))
structure(list(WMA = 15, Number = "02040302020030-01", HUC14 = "HUC02040302020030",
Name = "Absecon Creek (AC Reserviors) (gage to SB)", Region = "Atlantic Coast",
NumofStations = "1", ListofStations = "01410455", ListofAssessment = "2",
MaxStaAssessment = "2", MinStaAssessment = "2", TotalNumSamples5yrs = "NA",
auassess = "2", ChangeFrom2014 = "No Change-2", liststaassess2 = "01410455",
liststaassess3 = "NA", liststaassess5 = "NA", Assessment2014 = "Attaining",
Comments = NA_character_), .Names = c("WMA", "Number", "HUC14",
"Name", "Region", "NumofStations", "ListofStations", "ListofAssessment",
"MaxStaAssessment", "MinStaAssessment", "TotalNumSamples5yrs",
"auassess", "ChangeFrom2014", "liststaassess2", "liststaassess3",
"liststaassess5", "Assessment2014", "Comments"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
structure(list(WMA = 15, Number = "02040302020030-01", HUC14 = "HUC02040302020030",
Name = "Absecon Creek (AC Reserviors) (gage to SB)", Region = "Atlantic Coast",
NumofStations = "1", ListofStations = "R32", ListofAssessment = "3",
MaxStaAssessment = "3", MinStaAssessment = "3", TotalNumSamples5yrs = "9",
auassess = "3", ChangeFrom2014 = "No Change-3", liststaassess2 = "NA",
liststaassess3 = "R32", liststaassess5 = "NA", Assessment2014 = "N/A",
Comments = NA_character_), .Names = c("WMA", "Number", "HUC14",
"Name", "Region", "NumofStations", "ListofStations", "ListofAssessment",
"MaxStaAssessment", "MinStaAssessment", "TotalNumSamples5yrs",
"auassess", "ChangeFrom2014", "liststaassess2", "liststaassess3",
"liststaassess5", "Assessment2014", "Comments"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))

Aurèle makes a really good point regarding your file paths.
I would like each one to be it's own data frame
If this is the goal, then a combination of purrr::iwalk and assign could easily get you there. The process goes as follows:
Get a list of all of the .xlsx files located in 2016_Data_Tables/.
Then use purrr::set_names to name each element in this list with its filename sans the .xlsx extension.
Then use purrr::iwalk to apply the assign function to each element in the list. Specifically, use read_xlsx to read each .xlsx file from disk into a data frame and then assign that data frame as a named object to R's global environment
list.files('data/mpg', pattern = '.xlsx', full.names = T) %>%
purrr::set_names(stringr::str_remove(basename(.), '.xlsx$')) %>%
purrr::iwalk(function(x, i) assign(i, readxl::read_xlsx(x), .GlobalEnv))

Related

How to highlight specific cells in a dataframe in R markdown HTML

I have a dataframe as shown below (the data is listed at the end of question):
As can be seen, I want to highlight few cells in the final R-Markdown report for the presentation. My current code is only able to show the table:
cluster_summary%>% kbl(caption = '<b>Clustering results</b>') %>%
kable_classic(full_width = F, html_font = "Cambria")
How can I highlight those cells??
DATA
structure(list(cluster = structure(1:7, .Label = c("1", "2",
"3", "4", "5", "6", "7"), class = "factor"), n = c(512L, 1048L,
662L, 1968L, 576L, 1738L, 1188L), ave_price_per_sqft_adjusted = c(5.16299733157459,
3.32371811588978, 3.96858531607868, 3.32922072520205, 3.42896017156734,
4.16418851265888, 4.08627345683475), ave_age = c(12.0393129995492,
12.6062546474121, 9.32033699503113, 25.5092197801581, 19.1151284494788,
12.2180810585854, 12.0248580167839), ave_DOM = c(47.706537201211,
42.0442099665614, 49.9960193152193, 34.2190863941281, 44.5416652882415,
37.1891219996921, 33.3872422432855), ave_activity_rate = c(1.20118970114087,
1.14598100690658, 1.47458159497434, 1.58286371628597, 1.31320615630511,
1.32586511589676, 2.90376115653893), topic_1 = c(0.0873152283441761,
0.0402887288191615, 0.0671677410154403, 0.0658325530416239, 0.0486383977595131,
0.678477957074527, 0.124182893709105), topic_2 = c(0.0432613598954236,
0.0696506982126008, 0.0443719103703934, 0.714018587278257, 0.106997881943579,
0.0858546713546651, 0.123859196751554), topic_3 = c(0.734165987470995,
0.0151590853651532, 0.0274370600921245, 0.0267196491438714, 0.0186524676995082,
0.0422361263557554, 0.0476136227502999), topic_4 = c(0.0268470362758521,
0.0222984614059603, 0.035088529448869, 0.0682401425738628, 0.733361959255753,
0.0345517467883103, 0.0701685629335576), topic_5 = c(0.0236832387869678,
0.0195300786802868, 0.681931511958987, 0.01084326403663, 0.00780696913319592,
0.0271831270677069, 0.0256968988305932), topic_6 = c(0.00241582961309524,
0.00512777524684262, 0.043572436212494, 0.00284832693741011,
0.00466231684981685, 0.00447461706422522, 0.00578628373290925
), topic_7 = c(0.0293156710834479, 0.0165055511133993, 0.0243384949312766,
0.0479052429538088, 0.0240980295134035, 0.035084908174513, 0.531063470492252
), topic_8 = c(0.0519347465414063, 0.808840100571256, 0.0730651082702796,
0.0592810817199474, 0.0538401481417729, 0.0805723035106479, 0.0664648058614109
)), class = "data.frame", row.names = c(NA, -7L))
You could use formattable, see these examples.
color_tile formatter combined with area option allows to change color of a specific row & col.
library(formattable)
highlight <- color_tile("yellow","yellow")
formattable(data, list(
area(col = 3, row = 1 ) ~ highlight,
area(col = 4, row = 4 ) ~ highlight
))

How to select one value of a data.frame within a list column with R?

I have a data.frame that contains a type column. The list contains a 1x3 data.frame. I only want one value from this list. Thus will flatten my data.frame so I can write out a csv.
How do I select one item from the nested data.frame (see the 2nd column)?
Here's the nested col. I'd provide the data but cannot flatten to write_csv.
result of dput:
structure(list(id = c("1386707", "1386700", "1386462", "1386340",
"1386246", "1386300"), fields.created = c("2020-05-07T02:09:27.000-0700",
"2020-05-07T01:20:11.000-0700", "2020-05-06T21:38:14.000-0700",
"2020-05-06T07:19:44.000-0700", "2020-05-06T06:11:43.000-0700",
"2020-05-06T02:26:44.000-0700"), fields.customfield_10303 = c(NA,
NA, 3, 3, NA, NA), fields.customfield_28100 = list(NULL, structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76414",
value = "Technical Debt", id = "76414"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), NULL,
structure(list(self = ".../rest/api/2/customFieldOption/76411",
value = "Maintenance", id = "76411"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L))), row.names = c(NA,
6L), class = "data.frame", .Names = c("id", "fields.created",
"fields.customfield_10303", "fields.customfield_28100"))
I found a way to do this.
First, instead of changing the data, I added a column with mutate. Then, directly selected the same column from all nested lists. Then, I converted the list column into a vector. Finally, I cleaned it up by removing the other columns.
It seems to work. I don't know yet how it will handle multiple rows within the nested df.
dat <- sample_dat %>%
mutate(cats = sapply(nested_col, `[[`, 2)) %>%
mutate(categories = sapply(cats, toString)) %>%
select(-nested_col, -cats)
Related
How to directly select the same column from all nested lists within a list?
r-convert list column into character vector where lists are characters
library(dplyr)
library(tidyr)
df <- tibble(Group=c("A","A","B","C","D","D"),
Batman=1:6,
Superman=c("red","blue","orange","red","blue","red"))
nested <- df %>%
nest(data=-Group)
unnested <- nested %>%
unnest(data)
Nesting and unnesting data with tidyr
library(purrr)
nested %>%
mutate(data=map(data,~select(.x,2))) %>%
unnest(data)
select with purrr, but lapply as you've done is fine, it's just for aesthetics ;)

Lexis function not found in R

I am using this code from the R help guide in the Epi
package:
# A small bogus cohort
xcoh <- structure( list( id = c("A", "B", "C"),
birth = c("14/07/1952", "01/04/1954",
"10/06/1987"),
entry = c("04/08/1965", "08/09/1972",
"23/12/1991"),
exit = c("27/06/1997", "23/05/1995",
"24/07/1998"),
fail = c(1, 0, 1) ),
.Names = c("id", "birth", "entry", "exit",
"fail"),
row.names = c("1", "2", "3"),
class = "data.frame" )
# Define a Lexis object with timescales calendar time and
age
Lcoh <- Lexis( entry = list( per=entry ),
exit = list( per=exit,
age=exit-birth ),
exit.status = fail,
data = xcoh )
But I get this error:
Error in Lexis(entry = list(per = entry), exit = list(per = exit, age = exit - :
could not find function "Lexis"
Any thoughts?
Epi package first needs to be installed in the environment using:
install.packages("Epi")
And then the library for Epi needs to be loaded.
library(Epi)
Hence your code being modified as follows:
install.packages("Epi")
library(Epi)
xcoh <- structure( list( id = c("A", "B", "C"),
birth = c("14/07/1952", "01/04/1954",
"10/06/1987"),
entry = c("04/08/1965", "08/09/1972",
"23/12/1991"),
exit = c("27/06/1997", "23/05/1995",
"24/07/1998"),
fail = c(1, 0, 1) ),
.Names = c("id", "birth", "entry", "exit",
"fail"),
row.names = c("1", "2", "3"),
class = "data.frame" )
# Define a Lexis object with timescales calendar time and
Lcoh <- Lexis( entry = list( per=entry ),
exit = list( per=exit,
age=exit-birth ),
exit.status = fail,
data = xcoh )
Note: I have removed the line that says age. Assuming it is not relevant to the question posted here.

How to get list value and add new column from list value in data frame using R Algorithm

I'm basically .net developer one of my project page needs data from MongoDB collection. But I need to compare each column value so I used R language. I have retrieved data from MongoDB but some of columns have a list of variables so I'm not able to compare each column. Can you help me to separate the values of the list column and add a new column with the same name of list variable?
If its possible to use any algorithm to solve means it's more preferable.
My sample Data frame(Data set)
ID UserDetails CompanyDetails
1 list(UserID = 247891,Useraltr="Admin",UsercumEmpdetaisl=list(list(FirstName="Jack",LastName="De"))) list(ComyAddress="4/8 9 Block UD",ComyReg="344/88 7 Cross UK")
2 list(UserID=c(247891,256134),Useraltr=c("Admin","SuperAdmin"),UsercumEmpdetaisl=list(list(FirstName=c("peter","jhon","Vector"),LastName =c("Anderson","VJ","PK")))) list(ComyAddress =c("1BLOCK","2BLOCK"),ComyReg=c("1MainRoad","3street"),LandMark =c("Near post Office","check post"))
Result data frame
ID UserID Useraltr FirstName LastName ComyAddress ComyReg LandMark
1 247891 Admin Jack De 4/8 9 Block UD 344/88 7 Cross UK Empty(NULL)
2 247891,256134 Admin,SuperAdmin peter,jhon,Vector Anderson,VJ,PK 1BLOCK,2BLOCK 1MainRoad,3street Near post Office,check post
data frame dput data for first 2 row.
structure(list(ID = c("1", "2"), UserDetails = list(structure(list(
UserID = 247891, Useraltr = 'Admin', UsercumEmpdetaisl = list(structure(list(
FirstName = "Jack", LastName ="De" ), .Names = c("FirstName", "LastName"
), class = "data.frame", row.names = 1L))), .Names = c("UserID",
"Useraltr", "UsercumEmpdetaisl"), class = "data.frame", row.names = 1L),
structure(list(UserID = c(247891,256134), Useraltr = c('Admin','SuperAdmin'), UsercumEmpdetaisl = list(
structure(list(FirstName = c("peter", "jhon", "Vector"), LastName = c("Anderson",
"VJ","PK")), .Names = c("FirstName", "LastName"), class = "data.frame", row.names = 1L))), .Names = c("UserID",
"Useraltr", "UsercumEmpdetaisl"), class = "data.frame", row.names = 1L)),
CompanyDetails = list(structure(list(ComyAddress = "4/8 9 Block UD"
, ComyReg = "344/88 7 Cross UK"), .Names = c("ComyAddress", "ComyReg"
), class = "data.frame", row.names = 1:2), structure(list(
ComyAddress = c("1BLOCK","2BLOCK"), ComyReg = c("1MainRoad","3 street"
),LandMark=c("Near post Office","check post")), .Names = c("ComyAddress", "ComyReg","LandMark"), class = "data.frame", row.names = 1:2))), .Names = c("ID",
"UserDetails", "CompanyDetails"), row.names = 1:2, class = "data.frame")

Arithmetic on summarized dataframe from dplyr in R

I have a large dataset I use dplyr() summarize to generate some means.
Occasionally, I would like to perform arithmetic on that output.
For example, I would like to get the mean of means from the output below, say "m.biomass".
I've tried this mean(data.sum[,7]) and this mean(as.list(data.sum[,7])). Is there a quick and easy way to achieve this?
data.sum <-structure(list(scenario = c("future", "future", "future", "future"
), state = c("fl", "ga", "ok", "va"), m.soc = c(4090.31654013689,
3654.45350562628, 2564.33199749487, 4193.83388887064), m.npp = c(1032.244475,
821.319385, 753.401315, 636.885535), sd.soc = c(56.0344229400332,
97.8553643582118, 68.2248389927858, 79.0739969429246), sd.npp = c(34.9421782033153,
27.6443555578531, 26.0728757486901, 24.0375040705595), m.biomass = c(5322.76631158111,
3936.79457763176, 3591.0902359206, 2888.25308402464), sd.m.biomass = c(3026.59250918009,
2799.40317348016, 2515.10516340438, 2273.45510178843), max.biomass = c(9592.9303,
8105.109, 7272.4896, 6439.2259), time = c("1980-1999", "1980-1999",
"1980-1999", "1980-1999")), .Names = c("scenario", "state", "m.soc",
"m.npp", "sd.soc", "sd.npp", "m.biomass", "sd.m.biomass", "max.biomass",
"time"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4), vars = list(quote(scenario)), labels = structure(list(
scenario = "future"), class = "data.frame", row.names = c(NA,
-1), vars = list(quote(scenario)), drop = TRUE, .Names = "scenario"), indices = list(0:3))
We can use [[ to extract the column as a vector; as mean only works on a vector or a matrix -- not on a data.frame. If the OP wanted to do this on a single column, use this:
mean(data.sum[[7]])
#[1] 3934.726
If there was only the data.frame class, the data.sum[,7] would be extracting it as a vector, but the tbl_df prevents it to collapse it to vector
For multiple columns, the dplyr also has specialised functions
data.sum %>%
summarise_each(funs(mean), 3:7)

Resources