Cleaning xlsx files - r

I am trying to wrangle messy large datasets from xlsx sheets. The table structures are such that the column headers are a combination of three rows.
I am using RStudio and trying to write a function that takes empty cells and fills them up with an attribute from previous filled cells, and finally concatenate all filled rows into one final column header with hyphens: e.g. Employment, Number, Males on three different rows should become Employment_Number_Male.
Any suggestions?
Please see the sample xlsx table I am working with.

Taking this data.frame:
df <- data.frame(..1 = c("year", NA, NA),
..2 = c(NA, "males", "all"),
..3 = c(NA, NA, "half"),
..4 = c(NA, NA, "some"),
..5 = c(NA, "females", "all"),
..6 = c(NA, NA, "half"),
..7 = c(NA, NA, "some"))
Here is an attempt to convert empty cells to NA's..
# convert empty cells to NA
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)}
# transform all columns
df %>% mutate_each(funs(empty_as_na))
# apply function
na.rows <- which( apply(df, 1, function(z) (all(is.na(z)) ) ) )
df[na.rows , ] <- df[na.rows-1, ]
issue is filling it in with the value of the cell beside it..
a reprex render

Related

How can I left-join two datasets multiple times with each time using a different variable as a key?

I am combining two dataframes using left_join as I need to keep all the rows from X and only matched ones from Y. However, there is no one column which provides a perfect match. Instead there is 1 column in X which partially matches 2 columns in Y, and a second column in X which partially matches a third column in Y. Only by matching on all 3 can I adequately merge the data since they all match on different combinations of rows (with some overlap).
My strategy so far has been to do three seperate left_join commands, each with a different match. Doing this enables me to match about 95% of the rows (the first match only gets about 70%). However, this leaves me with three versions of each column from Y in the merged data. I have tried using paste and a number of other ways to combine them, but either it hasn't worked or the new columns aren't helpful as they are concatenated (e.g. showing "12345 NA 12345" or "NA NA NA"). I need it to show only the first non-NA result as all non-NA results will be identical. So for the two examples above I would want to return only "12345" and "NA".
So I think I either need to figure out how to do a left_join on multiple columns (This doesn't work, but something like:
left_join(X, Y, by = (c("Column1" = Column1) OR c("Column1" = "Column2") OR c( Column 2, Y = "Column 3")).
Or, less elegantly, I just need to figure out how to reformat the merged/ pasted column to keep only the first non-NA result.
In the example below, X has 2 columns and 5 rows and Y has 4 columns and 7 rows (5 of which match the rows in X). The only way to merge X and Y fully is to match X$Column1 with Y$Column1 OR Y$Column2, OR X$Column2 with Y$Column3. In the real data there are around 50,000 rows in X and also lots of other columns in X and Y). The desired output from the below should be the five rows from X (which also have matches in Y), along with the corresponding values in Y$Column4.
X$Column1 = c(10, 150, 3550, 9421, 22000)
X$Column2 = c(Dog, Cat, Bird, Rat, Fox)
Y$Column1 = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$Column3 = c(NA, Badger, Dog, NA, NA, NA, Fox, Mouse)
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65, 20)
From the above I would want to return 5 rows (one for each of the rows in X), along with 3 columns (the 3 original columns of X plus the 4th column of Y). The other three columns in Y are only useful for matching. As in the above example, in my data there is no way of completing a full (or close to full) match without joining on all three matches. I have been searching ways to do this for ages with no luck but I'm quite new to R so sorry if I'm being stupid.
My code at the moment:
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_print"))
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_electronic"))
merged_pvga <- left_join(merged_pvga, sherpa, by = (c("journal_title" = "title")))
merged_pvga$id_all <- paste(merged_pvga$id.x, merged_pvga$id.y, merged_pvga$id)
merged_pvga$subject_all <- paste(merged_pvga$subject.x, merged_pvga$subject.y, merged_pvga$subject)
etc. for other columns
Here is an example of left-join data Y with X using X twice on variable ColumnA and then on variable ColumnB.
Note:
Y and X share two variables ColumnA, ColumnB, so that after each left-join, you want to combine the columns that are not used as the join-key (for example, after joining on ColumnA, combine ColumnB's of the two datasets).
Be sure to know want to do about potential overlap that may emerge by joining twice on two different variables. The example below prioritize the first left-join in that those already joined from X are excluded in the second left-join.
library(dplyr)
X = tibble(id_x=1:5)
Y = tibble(id_y=1:8)
X$ColumnA = c(10, 150, 3550, 9421, 22000)
X$ColumnB = c('Dog', 'Cat', 'Bird', 'Rat', 'Fox')
Y$ColumnA = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$ColumnB = c(NA, 'Badger', 'Dog', NA, NA, NA, 'Fox', 'Mouse')
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65)
replace_na_with_blank <- function(df, varnames) {
for (varname in varnames) {
df[is.na(df[[varname]]), varname] <- ""
}
return(df)
}
concat_columns <- function(df, v1, v2) {
idx_na <- df[[v1]]==""
df[[v1]][idx_na] <- paste(df[[v1]][idx_na], df[[v2]][idx_na], sep='')
df[[v2]] <- NULL
return(df)
}
concat_columns_num <- function(df, v1, v2) {
idx_na <- is.na(df[[v1]])
df[[v1]][idx_na] <- df[[v2]][idx_na]
df[[v2]] <- NULL
return(df)
}
merged_1 <- left_join(Y, X, by = c("ColumnA" = "ColumnA"), suffix=c("",".x"))
merged_1 = replace_na_with_blank(merged_1, c("ColumnB","ColumnB.x"))
merged_1 <- concat_columns(merged_1, "ColumnB", "ColumnB.x")
merged_1 # first merge indicator is "id_x.x"
merged_2 <- left_join(merged_1, X %>% filter(!(id_x %in% merged_1$id_x)),
by = c("ColumnB" = "ColumnB"), suffix=c("",".x"))
merged_2 <- concat_columns_num(merged_2, "ColumnA", "ColumnA.x")
merged_2 # second merge indicator is "id_x.x.x"
merge1 is
merge2 is

Tidy several variables with different keys at once in r

I have some trouble tidying my data. I have a table with 10 peptide sequences and I have recorded their abundance mean, standard deviation and coeficient of variance across three samples: Reference, ZAP02 and ZAP02_GA.
The initial table is a 10x10
example <- data.frame(
Sequence = c("YVVDTSK","EALDFFAR","VLGIDGGEGKEELFR","VLGIDGGEGK","DIPVPKPK","IGDYAGIK", "DWVQAVR","DNIEPILK","LLDGTVVSR","NQETSEEYQIK"),
Reference = c(1098144.12, 41276.04, 172023.14, 399734.69, 1242669.19, 1585792.75, 1676065.88, 2152511.00, 60473.17, 768250.31),
Reference_SD = c(48098.6407, 888.9603, 8572.5207, 2475.0947, 92398.6154, 287270.7919, 71968.6762, 73495.9717, 5610.4587, 52914.2146),
Reference_CV = c( 4.3799934, 2.1536957, 4.9833532, 0.6191844, 7.4354958, 18.1152797, 4.2939050, 3.4144296, 9.2776003, 6.8876268),
ZAP02_GA = c( NaN, NaN, 1788.838, NaN, 1298.561, NaN, NaN, 1926.935, NaN, NaN),
ZAP02_GA_SD = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
ZAP02_GA_CV = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
ZAP02 = c( NaN, NaN, 2286.836, NaN, 12303.839, NaN, 2535.902, 2806.022, NaN, NaN),
ZAP02_SD = c( NA, NA, 1393.2599, NA, NA, NA, NA, 218.3964, NA, NA),
ZAP02_CV = c(NA, NA, 60.925235, NA, NA, NA, NA, 7.783131, NA, NA))
I would like to tidy my data to have one column for the means, another for the SD values and another for the CV. At the end, I should have a table with 30 rows and 5 variables (Sequence, Sample, Abundance, Standard Deviation and CV).
I tried to use the gather() function for this purpose, but at the end I got a very long and confusing dataframe.
example_tidy <- example %>%
gather(Reference, ZAP02_GA, ZAP02,
key = "Sample",
value = "Abundance") %>%
gather(Reference_SD, ZAP02_GA_SD, ZAP02_SD,
key = "Sample",
value = "Standard deviation") %>%
gather(Reference_CV, ZAP02_GA_CV, ZAP02_CV,
key = "Sample",
value = "CV (%)")
To get what I want, I had to gather the means, sd and cv in separate dataframes and then column bind them. But this turn out tedious and time consuming.
example_mean <- example %>%
gather(Reference, ZAP02_GA, ZAP02,
key = "Sample",
value = "Abundance")
example_sd <- example %>%
gather(Reference_SD, ZAP02_GA_SD, ZAP02_SD,
key = "Sample",
value = "Standard deviation")
example_cv <- example %>%
gather(Reference_CV, ZAP02_GA_CV, ZAP02_CV,
key = "Sample",
value = "CV (%)")
example_tidy2 <- cbind(select(example_mean, Sequence, Sample, Abundance),
"Standard deviation" = example_sd$`Standard deviation`,
"CV (%)" = example_cv$`CV (%)`)
Is there a simpler way to do this? Can you do a gather() with several keys?
Thank you in advance for your help.
This is a typical case pivot_longer() can treat.
library(dplyr)
library(tidyr)
example %>%
rename_at(vars(-matches("Seq|SD|CV")), paste0, "_Abundance") %>%
pivot_longer(-Sequence, names_to = c("Sample", ".value"), names_pattern = "(.*)_(.*)")
# # A tibble: 30 x 5
# Sequence Sample Abundance SD CV
# <fct> <chr> <dbl> <dbl> <dbl>
# 1 YVVDTSK Reference 1098144. 48099. 4.38
# 2 YVVDTSK ZAP02_GA NaN NA NA
# 3 YVVDTSK ZAP02 NaN NA NA
# 4 EALDFFAR Reference 41276. 889. 2.15
# 5 EALDFFAR ZAP02_GA NaN NA NA
# 6 EALDFFAR ZAP02 NaN NA NA
# 7 VLGIDGGEGKEELFR Reference 172023. 8573. 4.98
# 8 VLGIDGGEGKEELFR ZAP02_GA 1789. NA NA
# 9 VLGIDGGEGKEELFR ZAP02 2287. 1393. 60.9
# 10 VLGIDGGEGK Reference 399735. 2475. 0.619
# … with 20 more rows
The term .value has special meaning in pivot_longer(). You can search ?pivot_longer for more details and practice its examples part.
Parentheses divide a string into multiple groups. The structure of names_pattern corresponds to the elements of names_to. Take ZAP02_SD for example. The first (.*) extracts ZAP02 and puts it into the Sample column. The second (.*) extracts SD and defines it as a new column, which is what .value works for.
The information about pattern matching can be found by searching "Regular expression" or "Regex" on google. Wikipedia of regular expression is a good resource for beginners. All the special symbols I use in my answer like "|", "(", ".", "*" are recorded and explained in it.
You can achieve the desired outputs without splitting each variable in a new dataframe, but I think there still needs to be an intermediary step involved - although my solution might not be the most elegant.
If your sample/variable names were more consistent, I would have used separate after the first gather to split e.g. Reference_CV into a Sample (Reference) and Measure (CV) column, but because means were not named and you have sample names containing underscores, I used regular expressions to select them.
First step gathers all the values (regardless what type of values they are) into one value column.
step1 <- gather(example, key = "Sample", value = "value", 2:10)
Then I create a "measure" column that gets filled based on information pulled from the sample name, and tidy up the "Sample" column to remove that information. (Here is where someone could chip in with a more elegant and widely applicable solution, but that's all I could come up with based on your naming conventions.)
step1 <- step1 %>% mutate(
measure = case_when(
grepl("_CV", Sample) ~ "CV",
grepl("_SD", Sample) ~ "SD",
!grepl("_CV", Sample) & !grepl("_SD", Sample) ~ "Abundance"
),
Sample = case_when(
grepl("Reference", Sample) ~ "Reference",
grepl("ZAP02_GA", Sample) ~ "ZAP02_GA",
grepl("ZAP02", Sample) ~ "ZAP02"
)
)
And finally I spread the resulting data frame to put the measures back into their own columns: Abundance, CV and SD.
output <- spread(step1, key = measure, value = value)
dim(output)
[1] 30 5
You can condense all of this in one long pipe, but I thought it would be easier to demonstrate the steps like this. Hope that helps!

Extract the first x observations in each column while keeping the indexing by row in R

The following code is designed to extract the first x observations of each column, which are time series spanning different periods. (or to erase everything else than the x first values in each column …)
The first values, can be numbers followed by NAs, as long as it is the beginning of the time series.
This is crucial that each value stay linked to its own place in the indexing (the first column 'Year')
# data example
df <- data.frame("Year" = 1791:1800,
"F1" = c(NA, NA, NA, 1.2,1.3, NA, NA, NA, NA, NA),
"F2" = c(NA, NA, 2.1, 2.2, 2.3, 2.4, 2.5, NA, NA, NA),
"F3" = c(NA, NA, NA, NA, NA, 0.1,0.2,0.3,0.4,0.5),
"F4" = c(NA, 3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9))
# Convert the dataframe to a list by column
long <- setNames(lapply(names(df)[-1], function(x) cbind(df[1], df[x])), names(df)[-1])
# and select only the first 3 elements after NAs in each column
mylist <- lapply(long, function(x){
head(na.omit(x), 3)
})
# or in a more concise writing ??
mylist2 <- lapply(df, function(x){
head(na.omit(cbind(df[[1]],x)), 3)
})
# Now ‘mylist’ (or ‘mylist2’) contains several vector of different lengths,
# not very appropriate for dataframe, let's switch to long format dataframe
mydata <- do.call(rbind, lapply(mylist, function(x){
require(reshape2)
melt(x, id.vars="Year")
})
)
# and switch back to regular spreadsheet format
library(tidyverse)
mydataCOL <- spread(mydata, key = "variable", value = "value")
write.table(mydataCOL, “sheet1.txt”)
This thing is complicated to apply to a list of dataframe (multiple excel files). Is there an easier way to achieve this ? To do such operations on each column of each dataframe of the list :)
I'm currently trying with 'nested' lapply() :
mylist <- lapply(d, function(x){
lapply(x, function(y){
head(na.omit(cbind(x[[1]],y)), 50)
})
})
but this is not the easiest way I guess... Thanks !
If you are using the tidyverse anyway, why not go all in with Hadley's stuff?
GetTop <- function(indf){
indf %>%
pivot_longer(-Year,names_to="F") %>%
na.omit() %>%
group_by(F) %>%
top_n(3,wt=-Year) %>%
pivot_wider(names_from="F")
}
Now if we can call it for one dataframe
> mytops <- GetTop(df)
If you have a list of these dataframes you can use lapply to do this to each one.
allmytop <- lapply(biglist,FUN=GetTop)
That will give you a list of dataframes. Seems like you also want to join them into one fat dataframe.
fatdf <- lapply(biglist,FUN=GetTop) %>% reduce(full_join,by="Year")

R get row number of the first row that has a string variable in a data frame column

I am working with data frames that are dynamically generated.
structure(list(`4` = c(NA, NA, "Location", NA), `5` = c(NA, NA,
"Size", "W")), row.names = c(NA, 4L), class = "data.frame")
The above looks like this:
4 5
1 <NA> <NA>
2 <NA> <NA>
3 Location Size
4 <NA> W
From each column in the data frame I want to get the first character variable. For example from the above table, I want to retrieve Location and Size and use them as my column header.
Since the tables are dynamically generated, I am not sure in which line the string variable would appear.
An option is to loop through the columns, get the first non-NA element with summarise_all
library(dplyr)
df1 %>%
summarise_all(funs(.[!is.na(.)][1]))
Or with sapply, use the same logic
sapply(df1, function(x) x[!is.na(x)][1])
Or with which on logical matrix (!is.na(df1)), subset the data, get the first element of each column by filtering out the duplicate column index
ind <- which(!is.na(df1), arr.ind = TRUE)
df1[ind][!duplicated(ind[,2])]
#[1] "Location" "Size"

Replacing NA values using a rolling window

How can I replace a NA value by the average of the previous non-NA and next non-NA values?
For example, I want to replace the first NA value by -0.873, and the 4th/5th by the average of -0.497+53.200.
Thanks!
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
=================== ADD ON ===================
Thank you all for answering the question! Sorry for the late response. This is only a part of a dataframe (10000 * 91) and I only took out the first 10 rows from the first column in order to simplify the question. I think David and MKR have the result that I am expected to have.
Here's a possible vectorized approach using base R (some steps could be probably improved but I have no time to look into it right now)
x <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
# Store a boolean vector of NA locaiotns for firther use
na_vals <- is.na(x)
# Find the NAs location compaed to the non-NAs
start_ind <- findInterval(which(na_vals), which(!na_vals))
# Createa right limit
end_ind <- start_ind + 1L
# Replace zero locations with NAs
start_ind[start_ind == 0L] <- NA_integer_
# Calculate the means and replace the NAs
x[na_vals] <- rowMeans(cbind(x[!na_vals][start_ind], x[!na_vals][end_ind]), na.rm = TRUE)
x
# [1] -0.8730 -0.8730 -0.4970 26.3515 26.3515 53.2000 39.6500 39.6500 39.6500 26.1000
This should work properly for NAs on both sides of the vector.
This function imputes values for NA in a vector based on the average of the non-NA values in a rolling window from the first element to the next element.
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
roll_impute <- function(x){
n <- length(x)
res <- x
for (i in seq_along(x)){
if (is.na(x[i])){
res[i] <- mean(rep_len(x, i+1), na.rm = TRUE )
}
}
if (is.na(x[n])) x[n] <- mean(x, na.rm = TRUE)
res
}
roll_impute(t)
# [1] -0.87300 -0.87300 -0.49700 -0.68500 17.27667 53.20000 17.27667 17.27667 19.48250
# [10] 26.10000
roll_impute() includes code that corrects the rolling window in the case that the final element is NA, so that the vector isn't recycled. This isn't the case in your example, but is needed in order to generalize the function. Any improvements on this function would be welcome :) It does use a for loop, but doesn't grow any vectors. No simple way to avoid the for loop and rely on the structure of the objects jumps to my mind right now.
One dplyr and tidyr based solution could be:
library(dplyr)
library(tidyr)
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
data.frame(t) %>%
mutate(last_nonNA = ifelse(!is.na(t), t, NA)) %>%
mutate(next_nonNA = ifelse(!is.na(t), t, NA)) %>%
fill(last_nonNA) %>%
fill(next_nonNA, .direction = "up") %>%
mutate(t = case_when(
!is.na(t) ~ t,
!is.na(last_nonNA) & !is.na(next_nonNA) ~ (last_nonNA + next_nonNA)/2,
is.na(last_nonNA) ~ next_nonNA,
is.na(next_nonNA) ~ last_nonNA
)
) %>%
select(t)
# t
# 1 -0.8730
# 2 -0.8730
# 3 -0.4970
# 4 26.3515
# 5 26.3515
# 6 53.2000
# 7 39.6500
# 8 39.6500
# 9 39.6500
# 10 26.1000
Note: It looks a bit complicated but it does the trick. One can achieve same thing via for loop.

Resources