Remove rows until columns are identical over multiple data frames - r

I have 4 data frames named w, x, y, z each with 3 columns and identical column names. I now execute an operation that removes rows until the column named Type is identical over all four data frames.
To achieve this I am using a while loop with the following code:
list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
while (i == 0) {
z <- z[(z$Type %in% x$Type),]
y <- y[(y$Type %in% x$Type),]
w <- w[(w$Type %in% x$Type),]
z <- z[(z$Type %in% w$Type),]
y <- y[(y$Type %in% w$Type),]
x <- x[(x$Type %in% w$Type),]
z <- z[(z$Type %in% y$Type),]
x <- x[(x$Type %in% y$Type),]
w <- w[(w$Type %in% y$Type),]
x <- x[(x$Type %in% z$Type),]
w <- w[(w$Type %in% z$Type),]
y <- y[(y$Type %in% z$Type),]
list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
}
In this code, a list is created for the column Type of every data frame. Then the value i tests for identicality and produces 0 if false and 1 if true. The while loop then performs the deletion of rows not included in every data frame and only stops until i becomes 1.
This code works, but applying it to bigger data can result in a long time for the code to go through. Does anybody have an idea on how to simplify this execution?
For reproducible example:
w <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K",
"130747"), X1980 = c(NA, NA, NA, 271835, NA, NA, NA, NA, NA,
NA, NA, NA), X1981 = c(NA, NA, NA, 290314, NA, NA, NA, NA, NA,
NA, NA, NA)), row.names = c("2", "4", "7", "8", "10", "11", "13",
"16", "17", "21", "22", "23"), class = "data.frame")
x <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909"), X1980 = c(NA, NA, NA, 1026815,
NA, NA, NA), X1981 = c(NA, NA, NA, 826849, NA, NA, NA)), row.names = c("2",
"4", "7", "8", "10", "11", "13"), class = "data.frame")
y <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V"), X1980 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), X1981 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), row.names = c("2", "4", "7", "8", "10"), class = "data.frame")
z <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K",
"130747", "50610H"), X1980 = c(NA, NA, NA, 0.264736101439889,
NA, NA, NA, NA, NA, NA, NA, NA, NA), X1981 = c(NA, NA, NA, 0.351108848169376,
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("2", "4",
"7", "8", "10", "11", "13", "16", "17", "21", "22", "23", "24"
), class = "data.frame")

We assume that the question is how to get the values of Type that are common to 4 data frames each of which has a Type column containing unique values.
Form a list L of the data frames, extract the Type column using lapply and [ and iterate merge over that using Reduce :
L <- list(w, x, y, z)
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
Reduce(merge, L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"
or replace last line with this giving the same result except for order:
Reduce(intersect, L.Type)$Type
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect:
intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another example
The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD is a built-in data frame has 6 rows. We assign it to X and rename the columns so that the first one has the name Type. Then for i equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type common to all 4. The result correctly shows that 5 and 7 are the only common Type values.
# set up input L, a list of 4 data frames
X <- BOD
names(X) <- c("Type", "X")
L <- lapply(1:4, function(i) X[-i, ])
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
Reduce(merge, L.Type)$Type
## [1] 5 7

Related

Is there a good, general approach to convert semi-structured data to tibble/dataframe in R?

I am new to R programming and most of my experience thus far is with using highly structured rectangular data from a .csv or .xlsx. But now I've been handed about 30 spreadsheets of budget data that look like this:
And in order to work with them, I'd like to get them into a more friendly format (not exactly tidy b/c of the Q1 to Q4 could/should be a single variable -- but I can fix that later with pivot_longer), like this:
Searching SO, the closest problem/solution I found was this: R importing semi-unstructured data CSV, but that example contains a series of structured tables that do not require the modification mine does, plus, it is a text file converting to character vectors, and I have Excel workbooks with multiple worksheets (I only need 1 of the sheets).
Here's what I've tried so far:
library(tidyverse)
library(xlsx)
# Straight read in the worksheet as-is
df <- read_xlsx(path = "filename.xlsx", sheet = "worksheet", col_names = FALSE)
# Get the location name into its own column, then delete row 1 since it's not needed
df <- df %>%
mutate(location = df[[1,1]])
df <- df[-c(1),]
# Add a column and initialize it to "empty"
df <- df %>%
add_column(budget_type = "empty")
# Now loop through the dataframe in Column 1, search for the keyword(s) and place
# them in the last "budget_type" column
for (row in 1:nrow(df)){
print(df[[row,1]])
if (df[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- df[[row,1]]
}
if (!is.na(df[[row,1]])){
if (str_detect(df[[row,1]], "[0-9]{4}") == TRUE){
df[[row, "budget_type"]] <- budget_type
}
}
}
# ...and from here I could write another loop going from bottom to top seeking
# the categories and placing them in another created column, and finally delete the rows
# that are empty, total rows, or unnecessary header rows.
My question is: Is there an obviously better way to do this in R in place of the loops and the general approach I'm describing to get my data in a tidy format?
Thank you in advance.
EDIT 6/7/2021:
It appears I cannot attach the Excel file, but if I'm following the "minimal reproducible example" guidelines correctly, here is the unprocessed data from dput() after reading in the data from Excel:
structure(list(...1 = c("Mehoopany", NA, "CLASS CODE", "Baseline",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, "CLASS CODE", "Scope Changes",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), ...2 = c(NA, NA, "Classification", NA, NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup",
"Specialized Cleaning", "Window Cleaning", "Cleaning", NA, "Gen Office Exp",
"Wellness / Fitness Centers", "Meeting Rooms", "Employee Convenience",
NA, "Reception", "Photographic Services", "Mail Room Services",
"Mail Room & Reception", NA, "Total", NA, "Classification", NA,
NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup", "Specialized Cleaning",
"Window Cleaning", "Cleaning", NA, "Gen Office Exp", "Wellness / Fitness Centers",
"Meeting Rooms", "Employee Convenience", NA, "Reception", "Photographic Services",
"Mail Room Services", "Mail Room & Reception", NA, "Total", NA,
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c(NA, NA, "FY2021 Phasing",
NA, "Q1", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, "FY2021 Phasing",
NA, "Q1", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...4 = c(NA, NA, NA, NA, "Q2", "1205", "0", "0", "1205",
NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625",
NA, "2832.625", NA, NA, NA, "Q2", "25", "0", "0", "25", NA, "0",
"0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA,
NA, NA, NA, NA, NA, NA, NA), ...5 = c(NA, NA, NA, NA, "Q3", "1205",
"0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625",
"1453.625", NA, "2832.625", NA, NA, NA, "Q3", "25", "0", "0",
"25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA,
"79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...6 = c(NA, NA, NA,
NA, "Q4", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA,
NA, "Q4", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...7 = c(NA, NA, NA, NA, "Total", "4820", "0", "0", "4820",
NA, "0", "0", "696", "696", NA, "0", "0", "5814.5", "5814.5",
NA, "11330.5", NA, NA, NA, "Total", "100", "0", "0", "100", NA,
"0", "0", "148", "148", NA, "0", "68", "0", "68", NA, "316",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
Here is the script I used -- it works -- with explanatory comments:
library(tidyverse)
library(xlsx)
file <- "C:/Path/To/Book1.xlsx"
names <- c("class_code", "classification", "Q1", "Q2", "Q3", "Q4",
"Total", "Location", "Budget_Type", "Category")
# Read in the file, setting range to restrict columns ingested as some
# scrap work exists in some files beyond column G; row 50 is well beyond
# expected data range.
dframe <- read_xlsx(path = file, range = "'Original Data'!A1:G50",
col_names = FALSE)
# Output above data so it can be included in Stack Overflow as a 'minimal
# reproducible example'.
dput(dframe)
# Move the location name ('Mehoopany') to it own column, then delete the row.
dframe <- dframe %>%
mutate(location = dframe[[1,1]])
dframe <- dframe[-c(1),]
# Add/define two additional columns which will be used in loops below.
dframe <- dframe %>%
add_column(budget_type = "empty", category = "empty")
# Loop 1: Move *DOWN* the data set, labeling every line with 'CLASS CODE' as
# being either "Baseline" or "Scope Changes" in 'budget_type' column.
for (row in 1:nrow(dframe)){
if (dframe[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- dframe[[row,1]]
}
if (!is.na(dframe[[row,1]])){
if (str_detect(dframe[[row,1]], "[0-9]{4}") == TRUE){
dframe[[row, "budget_type"]] <- budget_type
}
}
}
# Loop 2: Move *UP* the data set, labeling every line with 'CLASS CODE' with
# it's respective roll-up category, and otherwise delete the line.
for (row in nrow(dframe):1){
if ( dframe[[row,2]] == "Total" ||
is.na(dframe[[row,2]]) ||
dframe[[row, 2]] == "Classification" ) {
# delete rows where the 2nd column is <blank>, 'Classification', or 'Total'.
dframe <- dframe[-row,]
} else {
if ( !is.na(dframe[[row,2]]) && is.na(dframe[[row,1]]) ){
# if row no 'CLASS CODE' but has value in 2nd column, assign value to
# category then delete the row entirely.
category <- dframe[[row,2]]
dframe <- dframe[-row,]
} else if ( str_detect(dframe[[row,1]], "[:digit:]{4}") ){
# if row has 'CLASS CODE', then label the category column with the
# stored value.
dframe[[row, "category"]] <- category
}
}
}
# Assign the names from the character vector set at the beginning.
names(dframe) <- names
# Print out the resulting dataframe.
dframe

R read excel until you reach a certain criteria

I have a messy excel file that I need to read in as is, but I want to read the file in until it hits the row that says "Projects as usual"
This value will always be in the first column, and no other string in that column will ever match it. I also don't want any of the information below it in other columns because it's making my numeric columns be read in as strings (see example below with score).
For example, we can pretend this is the excel file:
library(tidyverse)
messy_excel <- tibble(id = c("1", "2", NA, NA, "Projects as usual", NA),
name = c("Joe", "Justin", NA, NA, NA, "Other info I don't want"),
score = c("50", "20", NA, NA, NA, "This shouldn't show"))
And this is what I want:
library(tidyverse)
beautiful_excel <- tibble(id = c("1", "2"),
name = c("Joe", "Justin"),
score = c(50, 20))
~~~~~
Thank you!
Edit:
Based on #G5W's suggestion, I filtered for where the value was, getting inspiration from this answer: How to find row number of a value in R code
Specifically, I assigned each row a row number, detected where the target string was, and removed that row and the ones below it. I then used the retype() function from the hablar package to fix the column types.
library(tidyverse)
library(hablar)
messy_excel <- tibble(id = c("1", "2", NA, NA, "Projects as usual", NA),
name = c("Joe", "Justin", NA, NA, NA, "Other info I don't want"),
score = c("50", "20", NA, NA, NA, "This shouldn't show"))
#Give each line a row number
messy_excel$row_num <- seq.int(nrow(messy_excel))
#Identify the row where the garbage starts
messy_row <- which(grepl("Projects as usual", messy_excel$id))
#remove all rows below the garbage, remove the row_num column, correct column types, and remove the rows of all nas
clean_excel <- messy_excel %>%
filter(row_num < messy_row) %>%
dplyr::select(-row_num) %>%
retype() %>%
na.omit()
glimpse(clean_excel)

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

R: Problem with calculating mean accuracy with strsplit function

I am trying to calculate the accuracy of participants' response (columns EQ_R and MEM_R) based on the correct response (columns EQ_C and MEM_C).
dput(example)
structure(list(TRIAL = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15"), EQ_C = c("0101", "1010",
"1010", "00111", "01011", "01101", "100011", "010101", "001101",
"0110011", "1101001", "1100101", "11100001", "11001010", "11001010"
), EQ_R = c("0101", "0010", "1010", "00111", "01011", "01101",
"10101", "11010", "001101", "0100011", "1101001", "0100101",
"11110001", "11001010", "11001010"), MEM_C = c("ZLHK", "RZKX",
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB",
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_R = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", "CHBXS",
"HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", "HDNBVZC",
"BCRHKVDM", "RVTBWKFS", "NWHVZFLD"), EQ_SUM = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MEM_SUM = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names
= c(NA,
15L), class = "data.frame")
I added a new column for the "sum"/accuracy scores that need to be calculated for the binary data (EQ) and letters (MEM).
OSPAN["EQ_SUM"] <- NA
OSPAN["MEM_SUM"]<- NA
I then tried to calculate the accuracy with strsplit, but I receive error notifications.
mean(strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, ""))
Error in strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, "") : comparison of these types is not implemented
In addition:
Warning messages:
1: In strsplit(OSPAN$MEM_R, "") : input string 342 is invalid UTF-8
2: In strsplit(OSPAN$MEM_R, "") : input string 580 is invalid UTF-8
My question is:
How do I match/calculate the accuracy or congruence between predictor (C) and actual (R) values into the sum columns?
For instance, in row #1, EQ_SUM would be 1 (or 100%), whereas it would be 0.75 or 75% in #2, as the participant chose the wrong answer (0 instead of 1). Thus, partial credit scores are given, and it is not a matter of absolute match/congruence.
Thank you in advance.
One possibility could be using the RecordLinkage library:
with(df, levenshteinSim(EQ_C, EQ_R))
[1] 1.0000000 0.7500000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
[9] 1.0000000 0.8571429 1.0000000 0.8571429 0.8750000 1.0000000 1.0000000
It calculates the similarity between the two strings using the Levenshtein distance.
I'm sure there is a most efficient way, however, you can compare list by list and add it to your data frame.
for (i in 1:nrow(OSPAN)){
OSPAN$EQ_SUM[i] <- sum(strsplit(OSPAN$EQ_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$EQ_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$EQ_C, "")[[i]])
OSPAN$MEM_SUM[i] <- sum(strsplit(OSPAN$MEM_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$MEM_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$MEM_C, "")[[i]])
}
On the other hand, there are cases with different length, what do we do with them?

r: create data frame with all possible options and number of variable combinations

This question might be obvious or asked already, but I can't find a solution:
I want to create a data frame with all possible combinations (and number of variables) such that it looks like the following example:
dataframe <- data.frame(variable = 1:4,
a = c("gender", NA, NA, NA),
b = c("age", NA, NA, NA),
c = c("city", NA, NA, NA),
d = c("education", NA, NA, NA),
e = c("gender", "age", NA, NA),
f = c("gender", "city", NA, NA),
g = c("gender", "education", NA, NA),
h = c("age", "city", NA, NA),
i = c("age", "education", NA, NA),
j = c("city", "education", NA, NA),
k = c("gender", "age", "city", NA),
l = c("gender", "age", "education", NA),
m = c("gender", "city", "education", NA),
n = c("gender", "age", "city", "education"))
I have too many variables, so it's not worth writing it out, and I want to avoid errors. Thank you for helping!
Here is an option with combn. Get the vector of variable names, loop through the sequence of the vector, apply the combn on the vector with m specified as the sequence from the loop, convert to data.frame and cbind all the list elements together. The cbind.fill from rowr is suitable to fill with NA for list elements that have less number of rows than the maximum row data.frame
library(rowr)
res <- do.call(cbind.fill, c(fill = NA, lapply(seq_along(v1), function(i) {
m1 <- combn(v1, i)
if(is.vector(m1)) as.data.frame.list(m1) else as.data.frame(m1)})))
colnames(res) <- letters[seq_along(res)]
Or as #Moody_Mudskipper suggested,
res1 <- do.call(cbind.fill, c(fill = NA, lapply(seq_along(v1), function(i) combn(v1, i))))
colnames(res1) <- letters[seq_len(ncol(res1))]
data
v1 <- c('gender', 'age', 'city', 'education')

Resources