R: Problem with calculating mean accuracy with strsplit function

R: Problem with calculating mean accuracy with strsplit function - r

I am trying to calculate the accuracy of participants' response (columns EQ_R and MEM_R) based on the correct response (columns EQ_C and MEM_C).
dput(example)
structure(list(TRIAL = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15"), EQ_C = c("0101", "1010",
"1010", "00111", "01011", "01101", "100011", "010101", "001101",
"0110011", "1101001", "1100101", "11100001", "11001010", "11001010"
), EQ_R = c("0101", "0010", "1010", "00111", "01011", "01101",
"10101", "11010", "001101", "0100011", "1101001", "0100101",
"11110001", "11001010", "11001010"), MEM_C = c("ZLHK", "RZKX",
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB",
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_R = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", "CHBXS",
"HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", "HDNBVZC",
"BCRHKVDM", "RVTBWKFS", "NWHVZFLD"), EQ_SUM = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MEM_SUM = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names
= c(NA,
15L), class = "data.frame")
I added a new column for the "sum"/accuracy scores that need to be calculated for the binary data (EQ) and letters (MEM).
OSPAN["EQ_SUM"] <- NA
OSPAN["MEM_SUM"]<- NA
I then tried to calculate the accuracy with strsplit, but I receive error notifications.
mean(strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, ""))
Error in strsplit(OSPAN$MEM_C, "") == strsplit(OSPAN$MEM_R, "") : comparison of these types is not implemented
In addition:
Warning messages:
1: In strsplit(OSPAN$MEM_R, "") : input string 342 is invalid UTF-8
2: In strsplit(OSPAN$MEM_R, "") : input string 580 is invalid UTF-8
My question is:
How do I match/calculate the accuracy or congruence between predictor (C) and actual (R) values into the sum columns?
For instance, in row #1, EQ_SUM would be 1 (or 100%), whereas it would be 0.75 or 75% in #2, as the participant chose the wrong answer (0 instead of 1). Thus, partial credit scores are given, and it is not a matter of absolute match/congruence.
Thank you in advance.

One possibility could be using the RecordLinkage library:
with(df, levenshteinSim(EQ_C, EQ_R))
[1] 1.0000000 0.7500000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
[9] 1.0000000 0.8571429 1.0000000 0.8571429 0.8750000 1.0000000 1.0000000
It calculates the similarity between the two strings using the Levenshtein distance.

I'm sure there is a most efficient way, however, you can compare list by list and add it to your data frame.
for (i in 1:nrow(OSPAN)){
OSPAN$EQ_SUM[i] <- sum(strsplit(OSPAN$EQ_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$EQ_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$EQ_C, "")[[i]])
OSPAN$MEM_SUM[i] <- sum(strsplit(OSPAN$MEM_C, "", useBytes = TRUE)[[i]] == strsplit(OSPAN$MEM_R, "", useBytes = TRUE)[[i]])/length(strsplit(OSPAN$MEM_C, "")[[i]])
}
On the other hand, there are cases with different length, what do we do with them?

Related

How do I create a line graph using multiple variables when the multiple variables are all in the same column?

structure(list(Sample.Id = c(NA, "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3"
), Sampling..Date = c(NA, "08-Sep-14", "14-Oct-14", "02-Nov-14",
"21-Nov-14", "03-Dec-14", "15-Dec-14", "11-Jan-15", "08-Feb-15",
"01-Mar-15", "06-Apr-15", "03-Sep-14", "08-Sep-14", "14-Oct-14",
"02-Nov-14", "21-Nov-14", "03-Dec-14", "15-Dec-14", "11-Jan-15",
"26-Jan-15"), Tot.P = c("µg/ml", "0.002", "0.017", "0.035",
"0.04", "0.059", "0.155", "0.021", "0.022", "0.025", "<0.009",
"0.021", "0.003", "0.036", "0.141", "0.041", "0.044", "0.01",
"0.023", "0.016"), DOC = c("µg/ml", NA, "12.3", "13.4", "12.5",
"9.9", "14.7", "8.8", "8.3", "0.026", "7.5", "13.4", NA, "14.6",
"16.6", "14.7", "12.6", "12.6", "10.6", "11.4"), Tot.N = c("µg/ml",
NA, "3.63", "4.12", "3.98", "4.08", "3.38", "3.63", "4.88", "8.3",
"2.74", "2.48", NA, "3.07", "3.38", "3.3", "3.43", "2.19", "2.77",
"4.25"), DOC.1 = c("µg/ml", "13.6", NA, NA, NA, NA, NA, NA,
NA, NA, NA, "14.44", "16.85", NA, NA, NA, NA, NA, NA, NA), Tot.P.1 = c("µg/ml",
"0.053", NA, NA, NA, NA, NA, NA, NA, NA, NA, "0.08", "0.071",
NA, NA, NA, NA, NA, NA, NA), Total.N = c("µg/ml", "3.363", NA,
NA, NA, NA, NA, NA, NA, NA, NA, "2.645", "2.637", NA, NA, NA,
NA, NA, NA, NA)), row.names = c(NA, 20L), class = "data.frame"
I have a set of water quality data from 2014-2022 over different sites and different time periods. Each site has a different monitoring period and the data was analysed using two different devices of which there are only two periods of overlap where the samples were analysed using both machines. I am trying to plot a time series showing the P, N and DOC across each site over time and shade in the areas where one machine was used instead of another. This is all a bit complicated and I am so new to R so have been running in circles for a week. My problem is I am unsure how to select the section of a column I need to create the variable I want so it makes sense.
I have tried to look it up on blogs but can't seem to mash the different pieces of advice together to make it work. Any tips would be much appreciated. Here is the data that I'm on about.

You will definitely need to clean up your data to fit this solution, but your basic way about this is pivoting from wide to long form.
Then you need to ensure that your dates are the propper POSIXct format.
Then it is just a matter of grouping by your relevant variables and plotting with geom_line()
I added the facet_grid to separate by Sample.Id.
library(tidyverse)
#> Warning: pakke 'ggplot2' blev bygget under R version 4.2.2
#> Warning: pakke 'tidyr' blev bygget under R version 4.2.2
#> Warning: pakke 'purrr' blev bygget under R version 4.2.2
#> Warning: pakke 'dplyr' blev bygget under R version 4.2.2
#> Warning: pakke 'stringr' blev bygget under R version 4.2.2
#> Warning: pakke 'forcats' blev bygget under R version 4.2.2
df <- structure(list(Sample.Id = c("2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3"),
Sampling..Date = c("08-Sep-14", "14-Oct-14", "02-Nov-14",
"21-Nov-14", "03-Dec-14", "15-Dec-14", "11-Jan-15", "08-Feb-15",
"01-Mar-15", "06-Apr-15", "03-Sep-14", "08-Sep-14", "14-Oct-14",
"02-Nov-14", "21-Nov-14", "03-Dec-14", "15-Dec-14", "11-Jan-15",
"26-Jan-15"), Tot.P = c("0.002", "0.017", "0.035", "0.04",
"0.059", "0.155", "0.021", "0.022", "0.025", "<0.009", "0.021",
"0.003", "0.036", "0.141", "0.041", "0.044", "0.01", "0.023",
"0.016"), DOC = c(NA, "12.3", "13.4", "12.5", "9.9", "14.7",
"8.8", "8.3", "0.026", "7.5", "13.4", NA, "14.6", "16.6",
"14.7", "12.6", "12.6", "10.6", "11.4"), Tot.N = c(NA, "3.63",
"4.12", "3.98", "4.08", "3.38", "3.63", "4.88", "8.3", "2.74",
"2.48", NA, "3.07", "3.38", "3.3", "3.43", "2.19", "2.77",
"4.25"), DOC.1 = c("13.6", NA, NA, NA, NA, NA, NA, NA, NA,
NA, "14.44", "16.85", NA, NA, NA, NA, NA, NA, NA)), row.names = 2:20, class = "data.frame")
df |>
mutate(Tot.P = str_replace(Tot.P, "<", ""),
across(Tot.P:DOC.1, as.numeric),
Sampling..Date = as.POSIXct(Sampling..Date, format = "%d-%b-%y")) |>
select(-c(DOC.1)) |>
pivot_longer(cols = c(Tot.P, DOC, Tot.N)) |>
ggplot(aes(x = Sampling..Date, y = value, group = name, col = name)) +
geom_line() +
facet_grid(~Sample.Id)
#> Warning: Removed 5 rows containing missing values (`geom_line()`).
Created on 2023-02-14 with reprex v2.0.2

Is there a good, general approach to convert semi-structured data to tibble/dataframe in R?

I am new to R programming and most of my experience thus far is with using highly structured rectangular data from a .csv or .xlsx. But now I've been handed about 30 spreadsheets of budget data that look like this:
And in order to work with them, I'd like to get them into a more friendly format (not exactly tidy b/c of the Q1 to Q4 could/should be a single variable -- but I can fix that later with pivot_longer), like this:
Searching SO, the closest problem/solution I found was this: R importing semi-unstructured data CSV, but that example contains a series of structured tables that do not require the modification mine does, plus, it is a text file converting to character vectors, and I have Excel workbooks with multiple worksheets (I only need 1 of the sheets).
Here's what I've tried so far:
library(tidyverse)
library(xlsx)
# Straight read in the worksheet as-is
df <- read_xlsx(path = "filename.xlsx", sheet = "worksheet", col_names = FALSE)
# Get the location name into its own column, then delete row 1 since it's not needed
df <- df %>%
mutate(location = df[[1,1]])
df <- df[-c(1),]
# Add a column and initialize it to "empty"
df <- df %>%
add_column(budget_type = "empty")
# Now loop through the dataframe in Column 1, search for the keyword(s) and place
# them in the last "budget_type" column
for (row in 1:nrow(df)){
print(df[[row,1]])
if (df[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- df[[row,1]]
}
if (!is.na(df[[row,1]])){
if (str_detect(df[[row,1]], "[0-9]{4}") == TRUE){
df[[row, "budget_type"]] <- budget_type
}
}
}
# ...and from here I could write another loop going from bottom to top seeking
# the categories and placing them in another created column, and finally delete the rows
# that are empty, total rows, or unnecessary header rows.
My question is: Is there an obviously better way to do this in R in place of the loops and the general approach I'm describing to get my data in a tidy format?
Thank you in advance.
EDIT 6/7/2021:
It appears I cannot attach the Excel file, but if I'm following the "minimal reproducible example" guidelines correctly, here is the unprocessed data from dput() after reading in the data from Excel:
structure(list(...1 = c("Mehoopany", NA, "CLASS CODE", "Baseline",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, "CLASS CODE", "Scope Changes",
NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA,
NA, "5710", "8224", "8235", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), ...2 = c(NA, NA, "Classification", NA, NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup",
"Specialized Cleaning", "Window Cleaning", "Cleaning", NA, "Gen Office Exp",
"Wellness / Fitness Centers", "Meeting Rooms", "Employee Convenience",
NA, "Reception", "Photographic Services", "Mail Room Services",
"Mail Room & Reception", NA, "Total", NA, "Classification", NA,
NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup", "Specialized Cleaning",
"Window Cleaning", "Cleaning", NA, "Gen Office Exp", "Wellness / Fitness Centers",
"Meeting Rooms", "Employee Convenience", NA, "Reception", "Photographic Services",
"Mail Room Services", "Mail Room & Reception", NA, "Total", NA,
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c(NA, NA, "FY2021 Phasing",
NA, "Q1", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, "FY2021 Phasing",
NA, "Q1", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...4 = c(NA, NA, NA, NA, "Q2", "1205", "0", "0", "1205",
NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625",
NA, "2832.625", NA, NA, NA, "Q2", "25", "0", "0", "25", NA, "0",
"0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA,
NA, NA, NA, NA, NA, NA, NA), ...5 = c(NA, NA, NA, NA, "Q3", "1205",
"0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625",
"1453.625", NA, "2832.625", NA, NA, NA, "Q3", "25", "0", "0",
"25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA,
"79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...6 = c(NA, NA, NA,
NA, "Q4", "1205", "0", "0", "1205", NA, "0", "0", "174", "174",
NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA,
NA, "Q4", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA,
"0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA,
NA), ...7 = c(NA, NA, NA, NA, "Total", "4820", "0", "0", "4820",
NA, "0", "0", "696", "696", NA, "0", "0", "5814.5", "5814.5",
NA, "11330.5", NA, NA, NA, "Total", "100", "0", "0", "100", NA,
"0", "0", "148", "148", NA, "0", "68", "0", "68", NA, "316",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))

Here is the script I used -- it works -- with explanatory comments:
library(tidyverse)
library(xlsx)
file <- "C:/Path/To/Book1.xlsx"
names <- c("class_code", "classification", "Q1", "Q2", "Q3", "Q4",
"Total", "Location", "Budget_Type", "Category")
# Read in the file, setting range to restrict columns ingested as some
# scrap work exists in some files beyond column G; row 50 is well beyond
# expected data range.
dframe <- read_xlsx(path = file, range = "'Original Data'!A1:G50",
col_names = FALSE)
# Output above data so it can be included in Stack Overflow as a 'minimal
# reproducible example'.
dput(dframe)
# Move the location name ('Mehoopany') to it own column, then delete the row.
dframe <- dframe %>%
mutate(location = dframe[[1,1]])
dframe <- dframe[-c(1),]
# Add/define two additional columns which will be used in loops below.
dframe <- dframe %>%
add_column(budget_type = "empty", category = "empty")
# Loop 1: Move *DOWN* the data set, labeling every line with 'CLASS CODE' as
# being either "Baseline" or "Scope Changes" in 'budget_type' column.
for (row in 1:nrow(dframe)){
if (dframe[[row,1]] %in% c("Baseline","Scope Changes")){
budget_type <- dframe[[row,1]]
}
if (!is.na(dframe[[row,1]])){
if (str_detect(dframe[[row,1]], "[0-9]{4}") == TRUE){
dframe[[row, "budget_type"]] <- budget_type
}
}
}
# Loop 2: Move *UP* the data set, labeling every line with 'CLASS CODE' with
# it's respective roll-up category, and otherwise delete the line.
for (row in nrow(dframe):1){
if ( dframe[[row,2]] == "Total" ||
is.na(dframe[[row,2]]) ||
dframe[[row, 2]] == "Classification" ) {
# delete rows where the 2nd column is <blank>, 'Classification', or 'Total'.
dframe <- dframe[-row,]
} else {
if ( !is.na(dframe[[row,2]]) && is.na(dframe[[row,1]]) ){
# if row no 'CLASS CODE' but has value in 2nd column, assign value to
# category then delete the row entirely.
category <- dframe[[row,2]]
dframe <- dframe[-row,]
} else if ( str_detect(dframe[[row,1]], "[:digit:]{4}") ){
# if row has 'CLASS CODE', then label the category column with the
# stored value.
dframe[[row, "category"]] <- category
}
}
}
# Assign the names from the character vector set at the beginning.
names(dframe) <- names
# Print out the resulting dataframe.
dframe

Remove rows until columns are identical over multiple data frames

I have 4 data frames named w, x, y, z each with 3 columns and identical column names. I now execute an operation that removes rows until the column named Type is identical over all four data frames.
To achieve this I am using a while loop with the following code:
list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
while (i == 0) {
z <- z[(z$Type %in% x$Type),]
y <- y[(y$Type %in% x$Type),]
w <- w[(w$Type %in% x$Type),]
z <- z[(z$Type %in% w$Type),]
y <- y[(y$Type %in% w$Type),]
x <- x[(x$Type %in% w$Type),]
z <- z[(z$Type %in% y$Type),]
x <- x[(x$Type %in% y$Type),]
w <- w[(w$Type %in% y$Type),]
x <- x[(x$Type %in% z$Type),]
w <- w[(w$Type %in% z$Type),]
y <- y[(y$Type %in% z$Type),]
list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
}
In this code, a list is created for the column Type of every data frame. Then the value i tests for identicality and produces 0 if false and 1 if true. The while loop then performs the deletion of rows not included in every data frame and only stops until i becomes 1.
This code works, but applying it to bigger data can result in a long time for the code to go through. Does anybody have an idea on how to simplify this execution?
For reproducible example:
w <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K",
"130747"), X1980 = c(NA, NA, NA, 271835, NA, NA, NA, NA, NA,
NA, NA, NA), X1981 = c(NA, NA, NA, 290314, NA, NA, NA, NA, NA,
NA, NA, NA)), row.names = c("2", "4", "7", "8", "10", "11", "13",
"16", "17", "21", "22", "23"), class = "data.frame")
x <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909"), X1980 = c(NA, NA, NA, 1026815,
NA, NA, NA), X1981 = c(NA, NA, NA, 826849, NA, NA, NA)), row.names = c("2",
"4", "7", "8", "10", "11", "13"), class = "data.frame")
y <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V"), X1980 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), X1981 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), row.names = c("2", "4", "7", "8", "10"), class = "data.frame")
z <- structure(list(Type = c("26809D", "28503C", "360254", "69298N",
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K",
"130747", "50610H"), X1980 = c(NA, NA, NA, 0.264736101439889,
NA, NA, NA, NA, NA, NA, NA, NA, NA), X1981 = c(NA, NA, NA, 0.351108848169376,
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("2", "4",
"7", "8", "10", "11", "13", "16", "17", "21", "22", "23", "24"
), class = "data.frame")

We assume that the question is how to get the values of Type that are common to 4 data frames each of which has a Type column containing unique values.
Form a list L of the data frames, extract the Type column using lapply and [ and iterate merge over that using Reduce :
L <- list(w, x, y, z)
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
Reduce(merge, L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"
or replace last line with this giving the same result except for order:
Reduce(intersect, L.Type)$Type
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect:
intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another example
The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD is a built-in data frame has 6 rows. We assign it to X and rename the columns so that the first one has the name Type. Then for i equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type common to all 4. The result correctly shows that 5 and 7 are the only common Type values.
# set up input L, a list of 4 data frames
X <- BOD
names(X) <- c("Type", "X")
L <- lapply(1:4, function(i) X[-i, ])
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
Reduce(merge, L.Type)$Type
## [1] 5 7

Concatenating the strings of selected rows for every column

My data is as follows:
DF <- structure(list(toberevised = c("[Money amounts are in thousands of dollars]",
NA, NA, NA, "Item", NA, NA, NA, NA, "Number of returns", "Number of joint returns",
"Number with paid preparer's signature", "Number of exemptions",
"Adjusted gross income (AGI) [3]", "Salaries and wages in AGI: [4] Number",
"Salaries and wages in AGI: Amount", "Taxable interest: Number",
"Taxable interest: Amount", "Ordinary dividends: Number", "Ordinary dividends: Amount"
), ...2 = c("UNITED STATES [2]", NA, NA, NA, "All returns", NA,
NA, "1", NA, "135257620", "52607676", "80455243", "273738434",
"7364640131", "114060887", "5161583318", "59553985", "161324824",
"31158675", "164247298"), ...3 = c(NA, NA, NA, NA, "Under", "$50,000 [1]",
NA, "2", NA, "92150166", "20743943", "53622647", "159649737",
"1797097083", "75422766", "1541276272", "28527550", "39043002",
"13174923", "23867893"), ...4 = c(NA, NA, "Size of adjusted gross income",
NA, "50000", "under", "75000", "3", NA, "18221115", "11329459",
"11025624", "44189517", "1119634632", "16299827", "896339313",
"10891905", "16353293", "5255958", "12810282"), ...5 = c(NA,
NA, NA, NA, "75000", "under", "100000", "4", NA, "10499106",
"8296546", "6260725", "28555195", "905336768", "9520214", "721137490",
"7636612", "12852148", "4095938", "11524298"), ...6 = c(NA, NA,
NA, NA, "100000", "under", "200000", "5", NA, "10797979", "9193700",
"6678965", "30919226", "1429575727", "9782173", "1083175205",
"9092673", "23160862", "5824522", "25842394"), ...7 = c(NA, NA,
NA, NA, "200000", "or more", NA, "6", NA, "3589254", "3044028",
"2867282", "10424759", "2112995921", "3035907", "919655038",
"3405245", "69915518", "2807334", "90202431")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
All I would like to do is concatenate for each column, rows 5, 6 and 7. I tried:
DF[,5:7] <- lapply(DF[,5:7], paste(DF[,5:7],collapse=" "))
But I get the error:
Error in get(as.character(FUN), mode = "function", envir = envir) :
variable names are limited to 10000 bytes
This happens even when I concatenate one row it with another empty row instead (which obviously should not be much more bytes)!

lapply(DF[5:7, ], paste, collapse=" ")

R - Filling a matrix with data from specific data blocks from another matrix

I have my data in a matrix, but it still has lot of unnecessary information in it (due to the process of grabbing the data from mhtml files). I want to "filter" those things out and "collapse" the matrix (so that there is no empty cell between the data), so that after I saved it to an spreadsheet I do not need to do extra clean up on it (would be quite handy when you need to do it for 400+ files).
However, the only way I know of doing this is by using gsub and delete the stuff I do not want, before I generate the matrix.
However, since I just need specific blocks of the matrix and I know where those blocks are (I can determine it by using which to get the specific cell one row before the block(s) I'm needing) I was thinking, if it is possible to copy out specific data blocks when I know where it starts and where it ends (fixed size of the blocks).
Hence, does somebody know a way to copy several specific areas of an Matrix into a single different matrix when you know the cells, where the data block begins, which has a fixed size (as in colums and rows)?
I kinda have a feeling, that I oversee something, cause it sounds rather easy.
Edit says: dumb me, forgot a data example (hope this works):
dput(var_table[1:20,1:6])
structure(c("coration:none", "", "Zeit", "kV", "-------------------------------------------------------",
"1", "2", "3", "4", "5", "6", "7", "8", "", "Phase", "Datum/Zeit",
"Stufe", "tan-delta-Mittelwert", "Standardabweichung", "Anzahl",
"color:000000\">Details:", NA, "Spannung", "mA", NA, "12:54:09",
"12:54:19", "12:54:30", "12:54:39", "12:54:49", "12:55:00", "12:55:10",
"12:55:20", NA, ".......................", "..................",
".......................", "........", "..........", "der", NA,
NA, "Strom", "E-3", NA, "5.8", "5.8", "5.8", "5.8", "5.8", "5.8",
"5.8", "5.8", NA, ":", ":", ":", ":", ":", "Messungen", NA, NA,
"tan", NA, NA, "3.07", "3.07", "3.07", "3.07", "3.07", "3.07",
"3.07", "3.07", NA, "L1", "29-09-2015", "1", "0.343", "0.001",
"........", NA, NA, "delta", NA, NA, "0.34", "0.34", "0.34",
"0.34", "0.34", "0.34", "0.34", "0.34", NA, NA, "12:55:20", NA,
"E-3", "E-3", ":", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "8"), .Dim = c(20L, 6L))
Just need the data block from [6:13,1:5].
Second data snippet, same file:
structure(c("Phase", "Datum/Zeit", "Stufe", "tan-delta-Mittelwert",
"Standardabweichung", "Anzahl", "Last", "Prfcfobjekt", "Generator",
"", "", "Zeit", "kV", "-------------------------------------------------------",
"1", "2", "3", "4", "5", "6", "7", "8", "", "Phase", "Datum/Zeit",
"Stufe", "tan-delta-Mittelwert", "Standardabweichung", "Anzahl",
"Last", "Prfcfobjekt", "Generator", ".......................",
"..................", ".......................", "........",
"..........", "der", "........................", "VSE-Strom",
"VSE-Strom", NA, NA, "Spannung", "mA", NA, "12:56:40", "12:56:50",
"12:57:00", "12:57:10", "12:57:21", "12:57:31", "12:57:41", "12:57:51",
NA, ".......................", "..................", ".......................",
"........", "..........", "der", "........................",
"VSE-Strom", "VSE-Strom", ":", ":", ":", ":", ":", "Messungen",
":", "........", ".........", NA, NA, "Strom", "E-3", NA, "11.7",
"11.7", "11.7", "11.7", "11.7", "11.7", "11.7", "11.7", NA, ":",
":", ":", ":", ":", "Messungen", ":", "........", ".........",
"L1", "29-09-2015", "1", "0.343", "0.001", "........", "847.6",
":", ":", NA, NA, "tan", NA, NA, "6.18", "6.18", "6.18", "6.18",
"6.18", "6.18", "6.18", "6.19", NA, "L1", "29-09-2015", "2",
"0.355", "0.001", "........", "843.2", ":", ":", NA, "12:55:20",
NA, "E-3", "E-3", ":", "nF", "32", "2", NA, NA, "delta", NA,
NA, "0.35", "0.35", "0.35", "0.36", "0.36", "0.36", "0.36", "0.36",
NA, NA, "12:57:52", NA, "E-3", "E-3", ":", "nF", "66", "6", NA,
NA, NA, NA, NA, "8", NA, "b5A", "b5A", NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "8", NA,
"b5A", "b5A"), .Dim = c(32L, 6L))
Here I would need just the "Phase" (aka [15:4] and [38:4]), anybody has an idea?

According to your example-data something like this would work.
code:
keepRows <- apply(df1,1,function(x){all(grepl("^(\\d|[:.])+$",x)|is.na(x))})
df2 <- df1[keepRows,]
keepCols <- apply(df2,2,function(x){!all(is.na(x))})
df2[,keepCols]
result:
# [,1] [,2] [,3] [,4] [,5]
#[1,] "1" "12:54:09" "5.8" "3.07" "0.34"
#[2,] "2" "12:54:19" "5.8" "3.07" "0.34"
#[3,] "3" "12:54:30" "5.8" "3.07" "0.34"
#[4,] "4" "12:54:39" "5.8" "3.07" "0.34"
#[5,] "5" "12:54:49" "5.8" "3.07" "0.34"
#[6,] "6" "12:55:00" "5.8" "3.07" "0.34"
#[7,] "7" "12:55:10" "5.8" "3.07" "0.34"
#[8,] "8" "12:55:20" "5.8" "3.07" "0.34"
Please note:
You have to set the types
At the moment all cells are character strings (text). With text you cannot do math. Your second column is a special TIME-Format. You need to convert the columns into there meaningful type. Do research.
Do you really want to use a matrix here? You can only have ONE type inside EACH matrix
Like I mentioned in 1). 2nd col is time-format. The others are integer or numeric. Those are 2/3 distinct data.types. In a matrix, you can only have 1.
So as a first step I would convert as.data.frame().
In the above code i'm looking for rows that have only [NA, NUMBERS, : , . ]. This might not be general enough for your real data.
Type every function you see into r console like this. ?all, ?apply, ?grepl, ... and read!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Problem with calculating mean accuracy with strsplit function - r

Related

How do I create a line graph using multiple variables when the multiple variables are all in the same column?

Is there a good, general approach to convert semi-structured data to tibble/dataframe in R?

Remove rows until columns are identical over multiple data frames

Concatenating the strings of selected rows for every column

R - Filling a matrix with data from specific data blocks from another matrix

Categories

Resources