Reading in xlsx files that are unstructured in R - r

I am having issues trying to find an efficient way to read in multiple unstructured .xlsx files into R. This requires a bit of explaining, so anyone who is trying to assist can understand exactly what I am trying to do.
I have been suggested this and have decided it would be easier to use dput to replicate my dataset. The structure can be replicated with the code below:
x <- structure(list(...1 = c("Company Name", "Contact",
"Name", "Phone #", "Scope of Work", NA,
"Trees", "36\" Box Southern Live Oak (1.5\" Caliper)", "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)",
NA, "DG",
"Desert Gold", "Pink Coral"
), ...2 = c("To:", "Date:", "Job Name:", "Plan Date:", "Install All Trees, Shrubs, Irrigation and Landscape Material to Meet all Landscape Plans and Specs",
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c("Contractor",
"DATE ID", "Job ID", "DATE ID", NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...5 = c(NA, NA, NA, NA, NA, NA, "Quantity", "20", "38",
NA, "Quantity", "26", "32"), ...6 = c(NA, NA, NA, NA, NA, NA, NA,
10, 10, NA, NA, 10, 10), ...7 = c(NA, NA, NA, NA, NA, NA, NA,
200, 380, NA, NA, 260, 320)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
The tibble will look like this, if you use the code above:
...1 ...2 ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 "Company Name" To: Cont~ NA NA NA NA
2 "Contact" Date: DATE~ NA NA NA NA
3 "Name" Job ~ Job ~ NA NA NA NA
4 "Phone #" Plan~ DATE~ NA NA NA NA
5 "Scope of Work" Inst~ NA NA NA NA NA
6 NA NA NA NA NA NA NA
7 "Trees" NA NA NA Quan~ NA NA
8 "36\" Box Southern Live Oak (1.~ NA NA NA 20 10 200
9 "36\" Box Thornless Chilean Mes~ NA NA NA 38 10 380
10 NA NA NA NA NA NA NA
11 "DG" NA NA NA Quan~ NA NA
12 "Desert Gold" NA NA NA 26 10 260
13 "Pink Coral" NA NA NA 32 10 320
TLDR: These files consist of landscape bid forms to contractors. If you notice, the subset x[1:5,1:3] are information about the job, such as the job name, the date, the contractors name, the landscape company's name, etc. Every single one of the .xlsx files have the exact same format regarding that subset. I would like to keep the job name, but, for the purpose of this question, I will not make it the focus.
Under the x[1:5,1:3] subset, starting on x[7,1], there is a header named Trees, which is bolded on the .xlsx files. The next header is DG, which is also bolded. The values are right under the headers, so the first value for Trees is "36\" Box Southern Live Oak (1.5\" Caliper)" and the first value for DG is "Desert Gold". These values are not bolded.
It is important to stress that there are about 10-15 different headers throughout hundreds of files and the amount of values for each header can range from 1 to 100+ rows. These headers and values are always in x[,1].
I am trying to figure out how to partition the sections (DG, TREES...,) and read them into R as their own dataframes. I think the most ideal way to do this is by reading the files into a list and then separating the sections into their own dataframes into a nested list.
Lastly, if you notice, in x[,5], there are headers named Quantity, which are also bolded, and then there are integers under each of the Quantity's that are not bolded. x[,6] is the price of each of those quantities and x[,7] is those 2 columns multiplied together. I am trying to preserve these numbers as well.
In the end, I am trying to have multiple tables or dataframes in R that look like so:
df1
Trees Quantity Price Totals
1 36"... 20 10 200
2 36"... 38 10 380
df2
DG Quantity Price Totals
1 Desert Gold 26 10 260
2 Pink Coral 32 10 320
I am trying to create some way to efficiently do that over hundreds of .xlsx datasets.
So far, I have created a list that has each of the excel files in it. There are 248 files in a folder that I have on my local PC. I read in each of the files into a list like so:
excel_list <- vector(mode = "list", length = 248)
for(i in 1:length(list.files("."))){
excel_list[[i]] <- read_excel(list.files(".")[i], col_names = F)
}

To achieve your desired result you first have to identify the rows containing the section headers which according to your example data could be achieved by finding rows containing "Quantity" in the fifth column. After doing so we some additional data wrangling steps to first convert your data into a tidy format. Finally, we could split the data by section to achieve your desired result:
library(janitor)
library(dplyr, warn = FALSE)
library(tidyr)
tidy_data <- function(x) {
x %>%
remove_empty() %>%
mutate(is_header_row = grepl("^Quan", `...5`),
section = ifelse(is_header_row, `...1`, NA_character_)) %>%
fill(section) %>%
filter(!is.na(section), !is_header_row) %>%
select(-is_header_row) %>%
remove_empty() %>%
rename(Item = 1, Quantity = 2, Price = 3, Totals = 4)
}
xx <- tidy_data(x)
xx
#> # A tibble: 4 × 5
#> Item Quantity Price Totals section
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200 Trees
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Ca… 38 10 380 Trees
#> 3 "Desert Gold" 26 10 260 DG
#> 4 "Pink Coral" 32 10 320 DG
xx %>%
split(., .$section) %>%
purrr::imap(function(x, y) { x %>% select(-section) %>% rename("{y}" := 1) })
#> $DG
#> # A tibble: 2 × 4
#> DG Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 Desert Gold 26 10 260
#> 2 Pink Coral 32 10 320
#>
#> $Trees
#> # A tibble: 2 × 4
#> Trees Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)" 38 10 380

Related

Can pivot_longer or any R function lead to a tidy dataframe with columns of different types?

Given a survey dataframe in this kind of shape :
ID
Age_1
Age_2
Age_3
Lang_1_E
Lang_1_F
Lang_2_E
Lang_2_F
Lang_3_E
Lang_3_F
1
20
25
30
English
NA
English
NA
English
French
2
21
25
47
English
French
English
NA
English
French
3
17
42
43
NA
French
NA
French
NA
French
where each row represents an interview, and the respondent answers different questions about all his/her family members.
I have to reshape the dataframe so each row represents a person, like that :
ID
person
Age
E
F
1
1
20
English
NA
1
2
25
English
NA
1
3
30
English
French
2
1
21
English
French
2
2
25
English
NA
2
3
47
English
French
3
1
17
NA
French
3
2
42
NA
French
3
3
43
NA
French
here is the code to create the example dataframe
df <- tribble(
~ID, ~Age_1, ~Age_2, ~Age_3, ~Lang_1_1, ~Lang_1_2, ~Lang_2_1, ~Lang_2_2, ~Lang_3_1, ~Lang_3_2,
1, 20, 25, 30, "English", NA, "English", NA, "English", "French",
2, 21, 25, 47, "English", "French", "English", NA, "English", "French",
3, 17, 42, 43, NA, "French", NA, "French", NA, "French"
)
I will be grateful if anyone knows an easy way to achieve this
I tried to gather then to spread again but the fact that there are numeric columns and others in characters complicates things.
Doing it for each question separately and then binding the columns would take forever given the gigantic number of questions in this survey.
pivot_longer() will work with a slight modification to the variable names. Right now, you've got the age variables as Age_<person> and the language variables as Lang_<person>_<language>. Normally, you would use a regular expression to find the variable name and long-form obs id (i.e., person in this case). For example, you might normally do (.*)_(\\d) - which would find everything before the last digit and the underscore as the variable name and the last digit as the person identifier. In your case, though, the person identifier is in the middle of the string. The setNames() line in my code is swapping the digit after the first underscore and digit after the second underscore so that the regular expression will work in the appropriate way.
library(tidyr)
library(dplyr)
df <- tribble(
~ID, ~Age_1, ~Age_2, ~Age_3, ~Lang_1_1, ~Lang_1_2, ~Lang_2_1, ~Lang_2_2, ~Lang_3_1, ~Lang_3_2,
1, 20, 25, 30, "English", NA, "English", NA, "English", "French",
2, 21, 25, 47, "English", "French", "English", NA, "English", "French",
3, 17, 42, 43, NA, "French", NA, "French", NA, "French"
)
df %>%
setNames(gsub("(.*)_(\\d)_(\\d)", "\\1_\\3_\\2", names(df))) %>%
pivot_longer(-ID, names_pattern="(.*)_(\\d)", names_to=c(".value", "person"))
#> # A tibble: 9 × 5
#> ID person Age Lang_1 Lang_2
#> <dbl> <chr> <dbl> <chr> <chr>
#> 1 1 1 20 English <NA>
#> 2 1 2 25 English <NA>
#> 3 1 3 30 English French
#> 4 2 1 21 English French
#> 5 2 2 25 English <NA>
#> 6 2 3 47 English French
#> 7 3 1 17 <NA> French
#> 8 3 2 42 <NA> French
#> 9 3 3 43 <NA> French
Created on 2023-02-16 by the reprex package (v2.0.1)
Similar to Dave's answer but using rename_with and the names_sep argument of pivot_longer:
library(dplyr)
library(tidyr)
dat |>
rename_with(~gsub("Lang_(\\d)_(.*)$", "\\2_\\1", .x), starts_with("Lang")) |>
pivot_longer(-ID, names_to = c(".value", "person"), names_sep = "_")
#> # A tibble: 9 × 5
#> ID person Age E F
#> <int> <chr> <int> <chr> <chr>
#> 1 1 1 20 English <NA>
#> 2 1 2 25 English <NA>
#> 3 1 3 30 English French
#> 4 2 1 21 English French
#> 5 2 2 25 English <NA>
#> 6 2 3 47 English French
#> 7 3 1 17 <NA> French
#> 8 3 2 42 <NA> French
#> 9 3 3 43 <NA> French
DATA
dat <- data.frame(
ID = c(1L, 2L, 3L),
Age_1 = c(20L, 21L, 17L),
Age_2 = c(25L, 25L, 42L),
Age_3 = c(30L, 47L, 43L),
Lang_1_E = c("English", "English", NA),
Lang_1_F = c(NA, "French", "French"),
Lang_2_E = c("English", "English", NA),
Lang_2_F = c(NA, NA, "French"),
Lang_3_E = c("English", "English", NA),
Lang_3_F = c("French", "French", "French")
)

identify words within a phrase and code as 0 or 1

I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').
Likewise, if there are one or more words in the statement that match a different predefined listed of 'fringe' words (probably 300 fringe words; again which are different than the core words), then I want to input '1' into 'Fringe' (and if none, then input '0' into 'Fringe').
Basically, right now I have only the utterances and from those, I need to identify if any words match one of the core and match any fringe word. Here is a snippet of my data.
Core Fringe Utterance
1 NA NA small
2 NA NA small
3 NA NA where's his bed
4 NA NA there's his bed
5 NA NA there's his bed
6 NA NA is that a pillow
Thanks in advance. I've searched the archives but have had a hard time finding a solution that corresponds to my situation.
The dput() code is:
structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))
A tidyverse option could be:
library(dplyr)
library(stringr)
coreWords <- c('small', 'bed')
fringeWords <- c('head', 'night')
df %>%
mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
Fringe = + str_detect(Utterance, str_c(fringeWords, collapse = '|')))
# Utterance Core Fringe
# 1 small 1 0
# 2 small 1 0
# 3 where's his bed 1 0
# 4 there's his bed 1 0
# 5 there's his bed 1 0
# 6 is that a pillow 0 0
# 7 what is that on his head 0 1
# 8 hey he has his arm stuck here 0 0
# 9 there there's it 0 0
# 10 now you're gonna go night_night 0 1
# 11 and that's the thing you can turn on 0 0
# 12 yeah where's the music+box 0 0
Here is a quick way to maybe solve your question (though I'm sure there are more elegant solutions)...
df <- structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))
## Define object with all the core terms:
CorePatterns <- c("his", "music", "turn")
## Define value of `df$Core` as `1` if `df$Utterance`
## contains one of the patterns in `CorePatterns`,
## otherwise, define it as `0`:
df$Core <- ifelse(grepl(paste(CorePatterns, collapse = "|"),
df$Utterance),
1, 0)
df
Utterance Core Fringe
> 1 small 0 NA
> 2 small 0 NA
> 3 where's his bed 1 NA
> 4 there's his bed 1 NA
> 5 there's his bed 1 NA
> 6 is that a pillow 0 NA
> 7 what is that on his head 1 NA
> 8 hey he has his arm stuck here 1 NA
> 9 there there's it 0 NA
> 10 now you're gonna go night_night 0 NA
> 11 and that's the thing you can turn on 1 NA
> 12 yeah where's the music+box 1 NA
You can do the same with the Fringe data.

Add different time periods in R (Create new total age column)

I have time periods as specified below. I have the age in terms of different units of time (Year, Month, Weeks, Days) but would like to add the different time units to give me one total age. My issue is that the functions I am finding in R when trying to convert the time units take the year as a specific year and the month as a specific month rather than a number of years or a number of months.
Could you show me how to add say, 68 years to 5 months to 3 days and 2 hours and so on. In other words to create a column with the total age in years which I can then easily convert to the total age in months and so on?
> head(KimAge)
# A tibble: 6 x 7
ID Years Months Weeks Days Hours
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 68 5 NA NA NA
2 2 70 2 NA NA NA
3 3 NA NA NA NA NA
4 4 23 NA NA NA NA
5 5 NA NA NA 3 NA
6 6 NA NA NA NA NA
In some pseudo-code, I am trying to write something like the pseudo-code below
KimAge$TotalAge = as.Year(Years) + as.month(Months) + as.week(Weeks) + as.days(Days) + as.hour(Hours)
To create the total age column you can use the lubridate library:
library(dplyr)
library(lubridate)
KimAge <- tibble(years = c(68, 70, NA, 23, NA, NA),
months = c(5, 2, NA, NA, NA, NA),
weeks = c(NA, NA, NA, NA, NA, NA),
days = c(NA, NA, NA, NA, 3, NA),
hours = c(NA, NA, NA, NA, NA, NA))
# convert NA to zero
KimAge[is.na(KimAge)] <- 0
# time to duration
KimAge$TotalAge <- duration(year = KimAge$years,
month = KimAge$months,
week = KimAge$weeks,
day = KimAge$days,
hour = KimAge$hours)
If you know the birthdate and death date:
KimAge <- tibble(birth = c("1974/03/21 12:40",
"2004/9/2 00:10",
"2014/12/12 00:00",
"2012/2/1 0:0"),
death = c("2020/03/11 16:40",
"2020/7/2 14:00",
"2021/1/4 23:01",
"2012/3/2 0:0"))
KimAge$birth <- parse_date_time(KimAge$birth, "ymd H:M")
KimAge$death <- parse_date_time(KimAge$death, "ymd H:M")
KimAge$TotalAge_d <- as.duration(KimAge$death - KimAge$birth)
KimAge$TotalAge_i <- as.interval(KimAge$birth , KimAge$death)
# interval version
KimAge$years = KimAge$TotalAge_i %/% years(1)
KimAge$months = KimAge$TotalAge_i %% years(1) %/%months(1)
KimAge$days = KimAge$TotalAge_i %% years(1) %% months(1) %/% days(1)
For more information on lubridate https://lubridate.tidyverse.org/. Read the differences between lubridate::period() and lubridate::interval().

Computing Growth Rates

I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]

Fill in NA column values with the last value that was not NA (na.locf by column) [duplicate]

This question already has answers here:
Fill missing values rowwise (right / left)
(2 answers)
Closed 2 years ago.
I am cleaning my data of which the dput looks as follows.
DF <- structure(list(toberevised = c("[Money amounts are in thousands of dollars]",
NA, NA, NA, "Item", NA, NA, NA, NA, "Number of returns", "Number of joint returns",
"Number with paid preparer's signature", "Number of exemptions",
"Adjusted gross income (AGI) [3]", "Salaries and wages in AGI: [4] Number",
"Salaries and wages in AGI: Amount", "Taxable interest: Number",
"Taxable interest: Amount", "Ordinary dividends: Number", "Ordinary dividends: Amount"
), ...2 = c("UNITED STATES [2]", NA, NA, NA, "All returns", NA,
NA, "1", NA, "135257620", "52607676", "80455243", "273738434",
"7364640131", "114060887", "5161583318", "59553985", "161324824",
"31158675", "164247298"), ...3 = c(NA, NA, NA, NA, "Under", "$50,000 [1]",
NA, "2", NA, "92150166", "20743943", "53622647", "159649737",
"1797097083", "75422766", "1541276272", "28527550", "39043002",
"13174923", "23867893"), ...4 = c(NA, NA, "Size of adjusted gross income",
NA, "50000", "under", "75000", "3", NA, "18221115", "11329459",
"11025624", "44189517", "1119634632", "16299827", "896339313",
"10891905", "16353293", "5255958", "12810282"), ...5 = c(NA,
NA, NA, NA, "75000", "under", "100000", "4", NA, "10499106",
"8296546", "6260725", "28555195", "905336768", "9520214", "721137490",
"7636612", "12852148", "4095938", "11524298"), ...6 = c(NA, NA,
NA, NA, "100000", "under", "200000", "5", NA, "10797979", "9193700",
"6678965", "30919226", "1429575727", "9782173", "1083175205",
"9092673", "23160862", "5824522", "25842394"), ...7 = c(NA, NA,
NA, NA, "200000", "or more", NA, "6", NA, "3589254", "3044028",
"2867282", "10424759", "2112995921", "3035907", "919655038",
"3405245", "69915518", "2807334", "90202431")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
In the first row and the third row, I would like to use something like na.locf from zoo but not on the rows but on the columns, so that the DF becomes.
DF[1,3:7] <- "UNITED STATES [2]"
DF[1,5:7] <- "Size of adjusted gross income"
apply na.locf rowwise :
DF[] <- t(apply(DF, 1, zoo::na.locf, na.rm = FALSE))
DF
# A tibble: 20 x 7
# toberevised ...2 ...3 ...4 ...5 ...6 ...7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 [Money amounts are in th… UNITED ST… UNITED ST… UNITED STATES … UNITED STATES … UNITED STATES … UNITED STATES…
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA Size of adjust… Size of adjust… Size of adjust… Size of adjus…
# 4 NA NA NA NA NA NA NA
# 5 Item All retur… Under 50000 75000 100000 200000
# 6 NA NA $50,000 [… under under under or more
# 7 NA NA NA 75000 100000 200000 200000
# 8 NA 1 2 3 4 5 6
# 9 NA NA NA NA NA NA NA
#10 Number of returns 135257620 92150166 18221115 10499106 10797979 3589254
#11 Number of joint returns 52607676 20743943 11329459 8296546 9193700 3044028
#12 Number with paid prepare… 80455243 53622647 11025624 6260725 6678965 2867282
#13 Number of exemptions 273738434 159649737 44189517 28555195 30919226 10424759
#14 Adjusted gross income (A… 7364640131 1797097083 1119634632 905336768 1429575727 2112995921
#15 Salaries and wages in AG… 114060887 75422766 16299827 9520214 9782173 3035907
#16 Salaries and wages in AG… 5161583318 1541276272 896339313 721137490 1083175205 919655038
#17 Taxable interest: Number 59553985 28527550 10891905 7636612 9092673 3405245
#18 Taxable interest: Amount 161324824 39043002 16353293 12852148 23160862 69915518
#19 Ordinary dividends: Num… 31158675 13174923 5255958 4095938 5824522 2807334
#20 Ordinary dividends: Amou… 164247298 23867893 12810282 11524298 25842394 90202431
As suggested by #G. Grothendieck na.locf0 is a better candidate here.
DF[] <- t(apply(DF, 1, zoo::na.locf0))

Resources