I am new to R programming and most of my experience thus far is with using highly structured rectangular data from a .csv or .xlsx. But now I've been handed about 30 spreadsheets of budget data that look like this:
And in order to work with them, I'd like to get them into a more friendly format (not exactly tidy b/c of the Q1 to Q4 could/should be a single variable -- but I can fix that later with pivot_longer), like this:
Searching SO, the closest problem/solution I found was this: R importing semi-unstructured data CSV, but that example contains a series of structured tables that do not require the modification mine does, plus, it is a text file converting to character vectors, and I have Excel workbooks with multiple worksheets (I only need 1 of the sheets).
Here's what I've tried so far:
library(tidyverse)
library(xlsx)
# Straight read in the worksheet as-is
df <- read_xlsx(path = "filename.xlsx", sheet = "worksheet", col_names = FALSE)
# Get the location name into its own column, then delete row 1 since it's not needed
df <- df %>%
mutate(location = df[[1,1]])
df <- df[-c(1),]
# Add a column and initialize it to "empty"
df <- df %>%
add_column(budget_type = "empty")
# Now loop through the dataframe in Column 1, search for the keyword(s) and place
# them in the last "budget_type" column
for (row in 1:nrow(df)){
print(df[[row,1]])
if (df[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- df[[row,1]]
}
if (!is.na(df[[row,1]])){
if (str_detect(df[[row,1]], "[0-9]{4}") == TRUE){
df[[row, "budget_type"]] <- budget_type
}
}
}
# ...and from here I could write another loop going from bottom to top seeking
# the categories and placing them in another created column, and finally delete the rows
# that are empty, total rows, or unnecessary header rows.
My question is: Is there an obviously better way to do this in R in place of the loops and the general approach I'm describing to get my data in a tidy format?
Thank you in advance.
EDIT 6/7/2021:
It appears I cannot attach the Excel file, but if I'm following the "minimal reproducible example" guidelines correctly, here is the unprocessed data from dput() after reading in the data from Excel:
structure(list(...1 = c("Mehoopany", NA, "CLASS CODE", "Baseline",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, "CLASS CODE", "Scope Changes",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), ...2 = c(NA, NA, "Classification", NA, NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup",
"Specialized Cleaning", "Window Cleaning", "Cleaning", NA, "Gen Office Exp",
"Wellness / Fitness Centers", "Meeting Rooms", "Employee Convenience",
NA, "Reception", "Photographic Services", "Mail Room Services",
"Mail Room & Reception", NA, "Total", NA, "Classification", NA,
NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup", "Specialized Cleaning",
"Window Cleaning", "Cleaning", NA, "Gen Office Exp", "Wellness / Fitness Centers",
"Meeting Rooms", "Employee Convenience", NA, "Reception", "Photographic Services",
"Mail Room Services", "Mail Room & Reception", NA, "Total", NA,
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c(NA, NA, "FY2021 Phasing",
NA, "Q1", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, "FY2021 Phasing",
NA, "Q1", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...4 = c(NA, NA, NA, NA, "Q2", "1205", "0", "0", "1205",
NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625",
NA, "2832.625", NA, NA, NA, "Q2", "25", "0", "0", "25", NA, "0",
"0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA,
NA, NA, NA, NA, NA, NA, NA), ...5 = c(NA, NA, NA, NA, "Q3", "1205",
"0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625",
"1453.625", NA, "2832.625", NA, NA, NA, "Q3", "25", "0", "0",
"25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA,
"79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...6 = c(NA, NA, NA,
NA, "Q4", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA,
NA, "Q4", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...7 = c(NA, NA, NA, NA, "Total", "4820", "0", "0", "4820",
NA, "0", "0", "696", "696", NA, "0", "0", "5814.5", "5814.5",
NA, "11330.5", NA, NA, NA, "Total", "100", "0", "0", "100", NA,
"0", "0", "148", "148", NA, "0", "68", "0", "68", NA, "316",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
Here is the script I used -- it works -- with explanatory comments:
library(tidyverse)
library(xlsx)
file <- "C:/Path/To/Book1.xlsx"
names <- c("class_code", "classification", "Q1", "Q2", "Q3", "Q4",
"Total", "Location", "Budget_Type", "Category")
# Read in the file, setting range to restrict columns ingested as some
# scrap work exists in some files beyond column G; row 50 is well beyond
# expected data range.
dframe <- read_xlsx(path = file, range = "'Original Data'!A1:G50",
col_names = FALSE)
# Output above data so it can be included in Stack Overflow as a 'minimal
# reproducible example'.
dput(dframe)
# Move the location name ('Mehoopany') to it own column, then delete the row.
dframe <- dframe %>%
mutate(location = dframe[[1,1]])
dframe <- dframe[-c(1),]
# Add/define two additional columns which will be used in loops below.
dframe <- dframe %>%
add_column(budget_type = "empty", category = "empty")
# Loop 1: Move *DOWN* the data set, labeling every line with 'CLASS CODE' as
# being either "Baseline" or "Scope Changes" in 'budget_type' column.
for (row in 1:nrow(dframe)){
if (dframe[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- dframe[[row,1]]
}
if (!is.na(dframe[[row,1]])){
if (str_detect(dframe[[row,1]], "[0-9]{4}") == TRUE){
dframe[[row, "budget_type"]] <- budget_type
}
}
}
# Loop 2: Move *UP* the data set, labeling every line with 'CLASS CODE' with
# it's respective roll-up category, and otherwise delete the line.
for (row in nrow(dframe):1){
if ( dframe[[row,2]] == "Total" ||
is.na(dframe[[row,2]]) ||
dframe[[row, 2]] == "Classification" ) {
# delete rows where the 2nd column is <blank>, 'Classification', or 'Total'.
dframe <- dframe[-row,]
} else {
if ( !is.na(dframe[[row,2]]) && is.na(dframe[[row,1]]) ){
# if row no 'CLASS CODE' but has value in 2nd column, assign value to
# category then delete the row entirely.
category <- dframe[[row,2]]
dframe <- dframe[-row,]
} else if ( str_detect(dframe[[row,1]], "[:digit:]{4}") ){
# if row has 'CLASS CODE', then label the category column with the
# stored value.
dframe[[row, "category"]] <- category
}
}
}
# Assign the names from the character vector set at the beginning.
names(dframe) <- names
# Print out the resulting dataframe.
dframe
I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})
I am trying to calculate the accuracy of participants' response (columns EQ_R and MEM_R) based on the correct response (columns EQ_C and MEM_C).
dput(example)
structure(list(TRIAL = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15"), EQ_C = c("0101", "1010",
"1010", "00111", "01011", "01101", "100011", "010101", "001101",
"0110011", "1101001", "1100101", "11100001", "11001010", "11001010"
), EQ_R = c("0101", "0010", "1010", "00111", "01011", "01101",
"10101", "11010", "001101", "0100011", "1101001", "0100101",
"11110001", "11001010", "11001010"), MEM_C = c("ZLHK", "RZKX",
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB",
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_R = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", "CHBXS",
"HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", "HDNBVZC",
"BCRHKVDM", "RVTBWKFS", "NWHVZFLD"), EQ_SUM = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MEM_SUM = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names
= c(NA,
15L), class = "data.frame")
I added a new column for the "sum"/accuracy scores that need to be calculated for the binary data (EQ) and letters (MEM).
OSPAN["EQ_SUM"] <- NA
OSPAN["MEM_SUM"]<- NA
I then tried to calculate the accuracy with strsplit, but I receive error notifications.
mean(strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, ""))
Error in strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, "") : comparison of these types is not implemented
In addition:
Warning messages:
1: In strsplit(OSPAN$MEM_R, "") : input string 342 is invalid UTF-8
2: In strsplit(OSPAN$MEM_R, "") : input string 580 is invalid UTF-8
My question is:
How do I match/calculate the accuracy or congruence between predictor (C) and actual (R) values into the sum columns?
For instance, in row #1, EQ_SUM would be 1 (or 100%), whereas it would be 0.75 or 75% in #2, as the participant chose the wrong answer (0 instead of 1). Thus, partial credit scores are given, and it is not a matter of absolute match/congruence.
Thank you in advance.
One possibility could be using the RecordLinkage library:
with(df, levenshteinSim(EQ_C, EQ_R))
[1] 1.0000000 0.7500000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
[9] 1.0000000 0.8571429 1.0000000 0.8571429 0.8750000 1.0000000 1.0000000
It calculates the similarity between the two strings using the Levenshtein distance.
I'm sure there is a most efficient way, however, you can compare list by list and add it to your data frame.
for (i in 1:nrow(OSPAN)){
OSPAN$EQ_SUM[i] <- sum(strsplit(OSPAN$EQ_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$EQ_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$EQ_C, "")[[i]])
OSPAN$MEM_SUM[i] <- sum(strsplit(OSPAN$MEM_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$MEM_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$MEM_C, "")[[i]])
}
On the other hand, there are cases with different length, what do we do with them?
This question might be obvious or asked already, but I can't find a solution:
I want to create a data frame with all possible combinations (and number of variables) such that it looks like the following example:
dataframe <- data.frame(variable = 1:4,
a = c("gender", NA, NA, NA),
b = c("age", NA, NA, NA),
c = c("city", NA, NA, NA),
d = c("education", NA, NA, NA),
e = c("gender", "age", NA, NA),
f = c("gender", "city", NA, NA),
g = c("gender", "education", NA, NA),
h = c("age", "city", NA, NA),
i = c("age", "education", NA, NA),
j = c("city", "education", NA, NA),
k = c("gender", "age", "city", NA),
l = c("gender", "age", "education", NA),
m = c("gender", "city", "education", NA),
n = c("gender", "age", "city", "education"))
I have too many variables, so it's not worth writing it out, and I want to avoid errors. Thank you for helping!
Here is an option with combn. Get the vector of variable names, loop through the sequence of the vector, apply the combn on the vector with m specified as the sequence from the loop, convert to data.frame and cbind all the list elements together. The cbind.fill from rowr is suitable to fill with NA for list elements that have less number of rows than the maximum row data.frame
library(rowr)
res <- do.call(cbind.fill, c(fill = NA, lapply(seq_along(v1), function(i) {
m1 <- combn(v1, i)
if(is.vector(m1)) as.data.frame.list(m1) else as.data.frame(m1)})))
colnames(res) <- letters[seq_along(res)]
Or as #Moody_Mudskipper suggested,
res1 <- do.call(cbind.fill, c(fill = NA, lapply(seq_along(v1), function(i) combn(v1, i))))
colnames(res1) <- letters[seq_len(ncol(res1))]
data
v1 <- c('gender', 'age', 'city', 'education')