Data munging in R: Subsetting and arranging vectors of uneven length - r
I am sorry I could not make a more specific title. I am trying to wean myself off of spreadsheets for the more difficult tasks and this one is giving me particular trouble - I can do it in Excel but I don't really know how to begin in R. It is somewhat hard to describe. I imagine a mix of techniques could be involved here so I hope this is of use to others.
I have data that comes in the following form from a spreadsheet:
Data:
1 GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD
2 KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR,
3 KJTJ, FKOF, VCNS, FLEP
4 FKKF, EPTR
5 QPOE, PEOL, WJRN, VCNS, PEKF, PFPW
And this data is associated with the following key:
Key:
Items A B C
ALAO NA 0.12246503 0.137902549
ALLF 0.016262491 0.557522799 0.622560763
ALPQ 0.409770566 0.770904525 NA
CNIR NA 0.38075281 0.698236443
EPTR 0.718354484 0.290028597 0.525661861
FKKF 0.801489091 0.878405308 0.645004844
FKOF 0.643251028 0.131643544 NA
FLEP 0.018262707 0.211220859 0.457302727
GOEK 0.902121539 NA NA
JFOKE 0.808410498 0.301443669 0.575188395
JFPG NA NA 0.343824191
KENC 0.882285296 0.372821865 0.593742731
KFDL 0.077569421 0.076497291 NA
KJTJ 0.249613609 0.227241864 NA
LFOD NA 0.000343115 0.329546051
LSPF 0.088451014 0.65148309 0.267490643
PEKF 0.645309773 NA 0.116601451
PEOL 0.626916187 0.093812247 0.152577881
PFOW 0.86690534 0.596673645 NA
PFPW NA 0.018869604 NA
POQK 0.683221579 NA 0.472456955
PRCP 0.486488748 0.860947689 0.097916066
PWEO 0.665854791 0.814111848 0.026085774
PWEOR 0.611034332 0.17254104 0.212386401
PWKR NA NA 0.357298987
QPOE 0.815885005 0.083834541 NA
VCNS 0.394817612 0.250760686 0.419539549
WJRN 0.403002388 0.705142265 0.768961818
WOWP 0.794250738 NA 0.967405211
Here is the general approach:
Each row shown in data comes from one cell of a spreadsheet so it would be interpreted by R as one string if imported directly. Split the string for each row into a form that can be stored as a vector in R.
Filter the data into three categories (A, B, or C) depending on the value in the row it is associated with. For example, for the 5th row of data, we have the values: QPOE, PEOL, WJRN, VCNS, PEKF, PFPW. Looking at the key, we can turn this into three subcategories based on what is contained in A, B, or C. This is based on whether or not there is an NA in that row or not:
A QPOE PEOL WJRN VCNS PEKF
B QPOE PEOL WJRN VCNS PFPW
C PEOL WJRN VCNS PEKF
Now that we have divided up row 5 of our data into its respective categories, we can make a separate table for this row that includes the associated value:
A 0.815885005 0.626916187 0.403002388 0.394817612 0.645309773
B 0.083834541 0.093812247 0.705142265 0.250760686 0.018869604
C 0.152577881 0.768961818 0.419539549 0.116601451
So we have a kind of hash table... sort of. Now I want to store these values in one table. It would essentially look something like this in the final form (shown for row 5 of data only):
Cat A Item A Value B Item B Value C Item C Value
5 QPOE 0.815885005 QPOE 0.083834541 PEOL 0.152577881
5 PEOL 0.626916187 PEOL 0.093812247 WJRN 0.768961818
5 WJRN 0.403002388 WJRN 0.705142265 VCNS 0.419539549
5 VCNS 0.394817612 VCNS 0.250760686 PEKF 0.116601451
5 PEKF 0.645309773 PFPW 0.018869604 NA NA
In reality, I have 400 rows of "Cat" in data not just 5.
Is this the best way to store the data for easy reference? Would a nested list be preferred like so?
Cat Row 1
A Items
Values
B Items
Values
C Items
Values
Cat Row 2...
I am just hesitant to make data frames for this data because there is so much variability in the length of the rows in my original data when divided into A, B, and C. The shortest ones would have to have NA's to fill up to the length of the longest ones to fit in the data frame. Something about this just makes me uncomfortable.
I can always look up the functions used in answer and figure it out so an in-depth explanation is not necessary unless your are feeling particularly generous! Thank you for your time.
I think that this is what I'd do, although it returns the answer in a slightly different form than you've asked for - my approach is to avoid ragged arrays (ones with different column lengths).
Start with your data:
d <- c("GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD",
"KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR",
"KJTJ, FKOF, VCNS, FLEP", "FKKF, EPTR", "QPOE, PEOL, WJRN, VCNS, PEKF, PFPW" )
key <- structure(list(Items = c("ALAO", "ALLF", "ALPQ", "CNIR", "EPTR",
"FKKF", "FKOF", "FLEP", "GOEK", "JFOKE", "JFPG", "KENC", "KFDL",
"KJTJ", "LFOD", "LSPF", "PEKF", "PEOL", "PFOW", "PFPW", "POQK",
"PRCP", "PWEO", "PWEOR", "PWKR", "QPOE", "VCNS", "WJRN", "WOWP"
), A = c(NA, 0.016262491, 0.409770566, NA, 0.718354484, 0.801489091,
0.643251028, 0.018262707, 0.902121539, 0.808410498, NA, 0.882285296,
0.077569421, 0.249613609, NA, 0.088451014, 0.645309773, 0.626916187,
0.86690534, NA, 0.683221579, 0.486488748, 0.665854791, 0.611034332,
NA, 0.815885005, 0.394817612, 0.403002388, 0.794250738), B = c(0.12246503,
0.557522799, 0.770904525, 0.38075281, 0.290028597, 0.878405308,
0.131643544, 0.211220859, NA, 0.301443669, NA, 0.372821865, 0.076497291,
0.227241864, 0.000343115, 0.65148309, NA, 0.093812247, 0.596673645,
0.018869604, NA, 0.860947689, 0.814111848, 0.17254104, NA, 0.083834541,
0.250760686, 0.705142265, NA), C = c(0.137902549, 0.622560763,
NA, 0.698236443, 0.525661861, 0.645004844, NA, 0.457302727, NA,
0.575188395, 0.343824191, 0.593742731, NA, NA, 0.329546051, 0.267490643,
0.116601451, 0.152577881, NA, NA, 0.472456955, 0.097916066, 0.026085774,
0.212386401, 0.357298987, NA, 0.419539549, 0.768961818, 0.967405211
)), .Names = c("Items", "A", "B", "C"), class = "data.frame", row.names = c(NA, -29L))
#split it up as you suggest
d <- strsplit(d,",")
d <- lapply(d, gsub, pattern=" ", replacement="") #Get rid of trailing spaces
#Convert key to a long data.frame with no NAs
library(reshape2)
key <- melt(key)
names(key)[2] <- "letter" #You might have better name for this
key <- key[complete.cases(key),]
#Extract subsets for each row of data
lapply(d, function(x)key[key$Items %in% x,])
Related
Continuous multiplication same column previous value
I have a problem. I have the following data frame. 1 2 NA 100 1.00499 NA 1.00813 NA 0.99203 NA Two columns. In the second column, apart from the starting value, there are only NAs. I want to fill the first NA of the 2nd column by multiplying the 1st value from column 2 with the 2nd value from column 1 (100* 1.00499). The 3rd value of column 2 should be the product of the 2nd new created value in column 2 and the 3rd value in column 1 and so on. So that at the end the NAs are replaced by values. These two sources have helped me understand how to refer to different rows. But in both cases a new column is created.I don't want that. I want to fill the already existing column 2. Use a value from the previous row in an R data.table calculation https://statisticsglobe.com/use-previous-row-of-data-table-in-r Can anyone help me? Thanks so much in advance. Sample code library(quantmod) data.N225<-getSymbols("^N225",from="1965-01-01", to="2022-03-30", auto.assign=FALSE, src='yahoo') data.N225[c(1:3, nrow(data.N225)),] data.N225<- na.omit(data.N225) N225 <- data.N225[,6] N225$DiskreteRendite= Delt(N225$N225.Adjusted) N225[c(1:3,nrow(N225)),] options(digits=5) N225.diskret <- N225[,3] N225.diskret[c(1:3,nrow(N225.diskret)),] N225$diskretplus1 <- N225$DiskreteRendite+1 N225[c(1:3,nrow(N225)),] library(dplyr) N225$normiert <-"Value" N225$normiert[1,] <-100 N225[c(1:3,nrow(N225)),] N225.new <- N225[,4:5] N225.new[c(1:3,nrow(N225.new)),] Here is the code to create the data frame in R studio. a <- c(NA, 1.0050,1.0081, 1.0095, 1.0016,0.9947) b <- c(100, NA, NA, NA, NA, NA) c<- data.frame(ONE = a, TWO=b)
You could use cumprod for cummulative product transform( df, TWO = cumprod(c(na.omit(TWO),na.omit(ONE))) ) which yields ONE TWO 1 NA 100.0000 2 1.0050 100.5000 3 1.0081 101.3140 4 1.0095 102.2765 5 1.0016 102.4402 6 0.9947 101.8972 data > dput(df) structure(list(ONE = c(NA, 1.005, 1.0081, 1.0095, 1.0016, 0.9947 ), TWO = c(100, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -6L))
What about (gasp) a for loop? (I'll use dat instead of c for your dataframe to avoid confusion with function c()). for (row in 2:nrow(dat)) { if (!is.na(dat$TWO[row-1])) { dat$TWO[row] <- dat$ONE[row] * dat$TWO[row-1] } } This means: For each row from the second to the end, if the TWO in the previous row is not a missing value, calculate the TWO in this row by multiplying ONE in the current row and TWO from the previous row. Output: #> ONE TWO #> 1 NA 100.0000 #> 2 1.0050 100.5000 #> 3 1.0081 101.3140 #> 4 1.0095 102.2765 #> 5 1.0016 102.4402 #> 6 0.9947 101.8972 Created on 2022-04-28 by the reprex package (v2.0.1) I'd love to read a dplyr solution!
(R) Filter rows based on string names if the only resulting match in another column is NA
The title may sound kinda weird but I have found no way of better defining my issue. Here an example data set: test = data.frame(genus = c("Acicarpha", "Acicarpha", "Acicarpha", "Acicarpha", "Acisanthera", "Acisanthera", "Acisanthera", "Acisanthera", "Acmella", "Acmella"), sp1 = c("NA", "bonariensis", "bonariensis", "spathulata", NA, "variabilis", "variabilis", "variabilis", NA, NA)) As you can see, I have a few species names (genus+sp1) possible: Acicarpha NA, Acicarpha bonariensis, Acicarpha spathulata, Acisanthera variabilis, Acisanthera NA, and Acmella NA. Here's the deal: I'm trying to select only the row related to Acmella NA since the only returning value on the sp1 column is NA. Other species also have NA, but they do not have only NA. How can I do this? I'm bashing my head.
Here's some code that does what I think you're asking for. It has four steps: Group the rows by genus. Make a new column called all_sp1_na that is TRUE if all of each genus's sp1 observations are NA, FALSE otherwise (i.e. FALSE if at least one sp1 observation is not NA for that genus). Filter for rows where all_sp1_na is true. Remove the temporary column all_sp1_na. library(tidyverse) test %>% group_by(genus) %>% mutate(all_sp1_na = all(is.na(sp1))) %>% filter(all_sp1_na) %>% select(-all_sp1_na) And it gives this result: # A tibble: 2 x 2 # Groups: genus [1] genus sp1 <chr> <chr> 1 Acmella NA 2 Acmella NA Let me know if you're looking for something else.
We may use subset from base R subset(test, !genus %in% genus[!is.na(sp1)]) genus sp1 9 Acmella <NA> 10 Acmella <NA> Or with filter from dplyr library(dplyr) test %>% filter(!genus %in% genus[!is.na(sp1)])
R: take first 13 characters from a string, concatenate with other strings
I have a data frame df with three columns ref, driver and tour. ref driver tour 02062018-1130SGA->BRT-Buttes Chaumont Mark NA 02162018-1230BRT-A2Pas Courbevoie Marceau John NA 02067018-1300SGA->BRT-Brune 2/2 Sam NA 020718-0800-CHILLY-CHARENTON Claire 678 020718-0800-CHILLY-BATIGNOLLES NA NA I want to mutate a new column ID where if tour is NA, it concatenates the first 13 letters of ref and driver. If tour is not NA, it just returns the same value in tour. So the result should look like this: ref driver tour ID 02062018-1130SGA->BRT-Buttes Chaumont Mark NA 02062018-1130Mark 02162018-1230BRT-A2Pas Courbevoie Marceau John NA 02162018-1230John 02067018-1300SGA->BRT-Brune 2/2 Sam NA 02067018-1300Sam 020718-0800-CHILLY-CHARENTON Claire 678 678 020718-0800-CHILLY-BATIGNOLLES NA NA 020718-0800-C Note that if drivercolumn is also NA, I don't want to take NA as a character but instead I just want to return the first 13 characters from ref. My idea is to use ifelse function, but just don't know how to work around taking the first 13 letters and concatenating to Driver. df$ID <- ifelse(tour == NA, yes = #####some function######, no = tour) Any help would be appreciated. Thank you!
You are looking for substr in combination with paste0. df1$ID <- ifelse(is.na(df1$tour), paste0(substr(df1$ref, 1, 13), ifelse(is.na(df1$driver), "", df1$driver)), df1$tour) "02062018-1130Mark" "02162018-1230John" "02067018-1300Sam" "678" "020718-0800-C"
Efficient way to conditionally edit value labels
I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways. The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames. d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1, 2, 3, 8, 9), .Names = c("Protection of environment should be given priority", "Economic growth should be given priority", "[DON'T READ] Both equally", "[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))) d$var1 <Labelled double> [1] 1 2 NA NA 3 NA 1 1 Labels: value label 1 Protection of environment should be given priority 2 Economic growth should be given priority 3 [DON'T READ] Both equally 8 [DON'T READ] Don't Know 9 [DON'T READ] Refused If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)." Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame. library(labelled) val_labels(d$var1) <- c("Protection of environment should be given priority" = 1, "Economic growth should be given priority" = 2, "Both equally (VOL)" = 3, "Don't Know (VOL)" = 8, "Refused (VOL)" = 9) How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame? The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate(): 1). Using lapply() This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well! d[] <- lapply(d, function(x) { labels <- attributes(x)$labels names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels)) attributes(x)$labels <- labels x }) d$var1 [1] 1 2 NA NA 3 NA 1 1 attr(,"labels") Protection of environment should be given priority Economic growth should be given priority 1 2 Both equally (VOL) Don't Know (VOL) 3 8 Refused (VOL) 9 attr(,"class") [1] "labelled" 2) Using mutate_all() Using the same logic (with the same result) you could change the name of the labels in a tidier way: d %>% mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>% map(attributes) # just to check on the result
to find count of distinct values across two columns in r
I have two columns . both are of character data type. One column has strings and other has got strings with quote. I want to compare both columns and find the no. of distinct names across the data frame. string f.string.name john NA bravo NA NA "john" NA "hulk" Here the count should be 2, as john is common. Somehow i am not able to remove quotes from second column. Not sure why. Thanks
The main problem I'm seeing are the NA values. First, let's get rid of the quotes you mention. dat$f.string.name <- gsub('["]', '', dat$f.string.name) Now, count the number of distinct values. i1 <- complete.cases(dat$string) i2 <- complete.cases(dat$f.string.name) sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1]) DATA dat <- structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA, NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name" ), class = "data.frame", row.names = c(NA, -4L))
library(stringr) table(str_replace_all(unlist(df), '["]', '')) # bravo hulk john # 1 1 2