Related
I am having issues trying to find an efficient way to read in multiple unstructured .xlsx files into R. This requires a bit of explaining, so anyone who is trying to assist can understand exactly what I am trying to do.
I have been suggested this and have decided it would be easier to use dput to replicate my dataset. The structure can be replicated with the code below:
x <- structure(list(...1 = c("Company Name", "Contact",
"Name", "Phone #", "Scope of Work", NA,
"Trees", "36\" Box Southern Live Oak (1.5\" Caliper)", "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)",
NA, "DG",
"Desert Gold", "Pink Coral"
), ...2 = c("To:", "Date:", "Job Name:", "Plan Date:", "Install All Trees, Shrubs, Irrigation and Landscape Material to Meet all Landscape Plans and Specs",
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c("Contractor",
"DATE ID", "Job ID", "DATE ID", NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...5 = c(NA, NA, NA, NA, NA, NA, "Quantity", "20", "38",
NA, "Quantity", "26", "32"), ...6 = c(NA, NA, NA, NA, NA, NA, NA,
10, 10, NA, NA, 10, 10), ...7 = c(NA, NA, NA, NA, NA, NA, NA,
200, 380, NA, NA, 260, 320)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
The tibble will look like this, if you use the code above:
...1 ...2 ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 "Company Name" To: Cont~ NA NA NA NA
2 "Contact" Date: DATE~ NA NA NA NA
3 "Name" Job ~ Job ~ NA NA NA NA
4 "Phone #" Plan~ DATE~ NA NA NA NA
5 "Scope of Work" Inst~ NA NA NA NA NA
6 NA NA NA NA NA NA NA
7 "Trees" NA NA NA Quan~ NA NA
8 "36\" Box Southern Live Oak (1.~ NA NA NA 20 10 200
9 "36\" Box Thornless Chilean Mes~ NA NA NA 38 10 380
10 NA NA NA NA NA NA NA
11 "DG" NA NA NA Quan~ NA NA
12 "Desert Gold" NA NA NA 26 10 260
13 "Pink Coral" NA NA NA 32 10 320
TLDR: These files consist of landscape bid forms to contractors. If you notice, the subset x[1:5,1:3] are information about the job, such as the job name, the date, the contractors name, the landscape company's name, etc. Every single one of the .xlsx files have the exact same format regarding that subset. I would like to keep the job name, but, for the purpose of this question, I will not make it the focus.
Under the x[1:5,1:3] subset, starting on x[7,1], there is a header named Trees, which is bolded on the .xlsx files. The next header is DG, which is also bolded. The values are right under the headers, so the first value for Trees is "36\" Box Southern Live Oak (1.5\" Caliper)" and the first value for DG is "Desert Gold". These values are not bolded.
It is important to stress that there are about 10-15 different headers throughout hundreds of files and the amount of values for each header can range from 1 to 100+ rows. These headers and values are always in x[,1].
I am trying to figure out how to partition the sections (DG, TREES...,) and read them into R as their own dataframes. I think the most ideal way to do this is by reading the files into a list and then separating the sections into their own dataframes into a nested list.
Lastly, if you notice, in x[,5], there are headers named Quantity, which are also bolded, and then there are integers under each of the Quantity's that are not bolded. x[,6] is the price of each of those quantities and x[,7] is those 2 columns multiplied together. I am trying to preserve these numbers as well.
In the end, I am trying to have multiple tables or dataframes in R that look like so:
df1
Trees Quantity Price Totals
1 36"... 20 10 200
2 36"... 38 10 380
df2
DG Quantity Price Totals
1 Desert Gold 26 10 260
2 Pink Coral 32 10 320
I am trying to create some way to efficiently do that over hundreds of .xlsx datasets.
So far, I have created a list that has each of the excel files in it. There are 248 files in a folder that I have on my local PC. I read in each of the files into a list like so:
excel_list <- vector(mode = "list", length = 248)
for(i in 1:length(list.files("."))){
excel_list[[i]] <- read_excel(list.files(".")[i], col_names = F)
}
To achieve your desired result you first have to identify the rows containing the section headers which according to your example data could be achieved by finding rows containing "Quantity" in the fifth column. After doing so we some additional data wrangling steps to first convert your data into a tidy format. Finally, we could split the data by section to achieve your desired result:
library(janitor)
library(dplyr, warn = FALSE)
library(tidyr)
tidy_data <- function(x) {
x %>%
remove_empty() %>%
mutate(is_header_row = grepl("^Quan", `...5`),
section = ifelse(is_header_row, `...1`, NA_character_)) %>%
fill(section) %>%
filter(!is.na(section), !is_header_row) %>%
select(-is_header_row) %>%
remove_empty() %>%
rename(Item = 1, Quantity = 2, Price = 3, Totals = 4)
}
xx <- tidy_data(x)
xx
#> # A tibble: 4 × 5
#> Item Quantity Price Totals section
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200 Trees
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Ca… 38 10 380 Trees
#> 3 "Desert Gold" 26 10 260 DG
#> 4 "Pink Coral" 32 10 320 DG
xx %>%
split(., .$section) %>%
purrr::imap(function(x, y) { x %>% select(-section) %>% rename("{y}" := 1) })
#> $DG
#> # A tibble: 2 × 4
#> DG Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 Desert Gold 26 10 260
#> 2 Pink Coral 32 10 320
#>
#> $Trees
#> # A tibble: 2 × 4
#> Trees Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)" 38 10 380
I have a data df2
date X Days
2020-01-06 525 NA
2020-01-07 799 NA
2020-01-08 782 NA
2020-01-09 542 NA
2020-01-10 638 5
2020-01-11 1000 5
2020-01-12 1400 3
2020-01-13 3500 1
I want to count how many days it will take for the sum of X to surpass a value. In this case, the value is 3000.
For example on 1/13, it took 1 day because X is 3500, so it already surpassed 3000. On 1/12 it took 3 days (1400+1000+638)=3038.
I wish to get the column Days.
dput(df2)
structure(list(date = structure(c(1578268800, 1578355200, 1578441600,
1578528000, 1578614400, 1578700800, 1578787200, 1578873600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), X = c(525, 799, 782, 542, 638, 1000,
1400, 3500), Days = c(NA, NA, NA, NA, 5, 5, 3, 1)), class = "data.frame", row.names = c(NA,
-8L))
I think a rolling-function works well. Unlike most rolling functions which have a fixed window that is smaller than the length of data, we will intentionally make this full-width.
zoo::rollapplyr(
df2$X, nrow(df2),
FUN = function(z) which(cumsum(rev(z)) > 3000)[1],
partial = TRUE)
# [1] NA NA NA NA 5 5 3 1
(I'm ignoring date, assuming that the rows are consecutive-days.)
cs <- c(0, cumsum(rev(df$X)))
out <- sapply(cs, function(x) which(cs - x > 3e3)[1])
rev(out - seq_along(cs))
#> [1] NA NA NA NA NA 5 5 3 1
Created on 2022-01-06 by the reprex package (v2.0.1)
I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]
I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...
Not being familiar with R, I've got the following problem: I want to add the values probeposition from the dataframe mlpa to the dataframe patients, with the values of probeposition being linked by values being present both in mlpa and patients (i.e. probe and patprobe). As far as I've seen, this problem is not covered by the usual data management tutorials.
#mlpa:
probe <- c(12,15,18,19)
probeposition <- c(100,1200,500,900)
mlpa = data.frame(probe = probe, probeposition = probeposition)
#patients:
patid <- c('AT', 'GA', 'TT', 'AG', 'GG', 'TA')
patprobe <- c(12, 12, NA, NA, 18, 19)
patients = data.frame(patid = patid, patprobe = patprobe)
#And that's what I finally want:
patprobeposition = c(100, 100, NA, NA, 500, 900)
patients$patprobeposition = patprobeposition
Update
Upon the response of Andrie, I got aware that that I have to mention that there are several "probes" in the patients dataset, so actually the data would more look like this (in fact, there would not only be probe1 and probe2, but probe1-probe4):
mlpa <- data.frame(probe = c(12,15,18,19),
probeposition = c(100,1200,500,900) )
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19) )
And what I want is this:
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19),
position1 = c(100, 100, NA, NA, 500, 900),
position2 = c(1200, 1200, NA, NA, 900, 900))
You can do this very easily using merge, which takes two data frames and joins them on common columns or row names.
The easiest way to get merge to work, is to make sure you have matching columns names where those columns refer to the same information. To be specific, I have renamed your column patprobe to probe:
mlpa <- data.frame(
probe = c(12,15,18,19),
probeposition = c(100,1200,500,900)
)
patients <- data.frame(
patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe = c(12, 12, NA, NA, 18, 19)
)
Now you can call merge. However, note that the default values of merge only returns matching rows (in database terminology this is an inner join). What you want, is to include all of the rows in patients (a left outer join). You do this by specifying all.x=TRUE:
merge(patients, mlpa, all.x=TRUE, sort=FALSE)
probe patid probeposition
1 12 AT 100
2 12 GA 100
3 18 GG 500
4 19 TA 900
5 NA TT NA
6 NA AG NA
Install the reshape2 package and try the following:
require(reshape2)
m.patients = melt(patients)
m.patients = merge(m.patients, mlpa,
by.x = "value",
by.y = "probe",
all = TRUE)
reshape(m.patients, direction="wide",
timevar="variable", idvar="patid")
This should give you output like the following, which can be cleaned up to match your desired output.
patid value.probe1 probeposition.probe1 value.probe2 probeposition.probe2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA
Update
Of course, you can also do it all with the reshape2 package as below:
m.patients = melt(patients, id.vars="patid", variable_name="time")
m.patients = melt(merge(m.patients, mlpa, by.x = "value",
by.y = "probe", all = TRUE))
dcast(m.patients, patid ~ variable + time )
Which results in:
patid value_probe1 value_probe2 probeposition_probe1 probeposition_probe2
1 AG NA NA NA NA
2 AT 12 15 100 1200
3 GA 12 15 100 1200
4 GG 18 19 500 900
5 TA 19 19 900 900
Update 2: Using Base R Reshape
You can also avoid using the reshape2 package entirely.
patients.l = reshape(patients, direction="long", idvar="patid",
varying=c("probe1", "probe2"), sep="")
reshape(merge(patients.l, mlpa, all = TRUE), direction="wide",
idvar="patid", timevar="time")
This gets you closest to your desired output:
patid probe.1 probeposition.1 probe.2 probeposition.2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA