My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200
One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.
I have a huge data that I cannot upload here because.
I have two types of columns, their names start with T.H.L or T.H.L.varies..... Both types have are numbered in the format So####, e.g., T.H.L.So1_P1_A2 until T.H.L.So10000_P1_A2.
For each T.H.L column there is a column named T.H.L.varies.... with the same ending.
I want to order the columns by the numbers after So, with first the T.H.L and then the corresponding T.H.L.varies.... version for each So number.
What I tried was to do
library(gtools)
mySorted<- df2[,mixedorder(colnames(df2))]
Which is close, it sorts them correctly by number, but first all T.H.L and then all T.H.L.varies instead of alternating them.
I have posted the column names to Github:
Okay, let's call the names of your data frame (the names you want to reorder) x:
x = names(df2)
# first remove the ones without numbers
# because we want to use the numbers for ordering
no_numbers = c("T.H.L", "T.H.L.varies....")
x = x[! x %in% no_numbers]
# now extract the numbers so we can order them
library(stringr)
x_num = as.numeric(str_extract(string = x, pattern = "(?<=So)[0-9]+"))
# calculate the order first by number, then alphabetically to break ties
ord = order(x_num, x)
# verify it is working
head(c(no_numbers, x[ord]), 10)
# [1] "T.H.L" "T.H.L.varies...." "T.H.L.So1_P1_A1"
# [4] "T.H.L.varies.....So1_P1_A1" "T.H.L.So2_P1_A2" "T.H.L.varies.....So2_P1_A2"
# [7] "T.H.L.So3_P1_A3" "T.H.L.varies.....So3_P1_A3" "T.H.L.So4_P1_A4"
# [10] "T.H.L.varies.....So4_P1_A4"
# finally, reorder your data frame columns
df2 = df2[, c(no_numbers, x[ord])]
And you should be done.
Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40
It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.
I have a unique dataset, a portion of which can be reproduced using:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
rs298,13,66664221,C,G,rs298,6
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
It is formatted for use in a program that requires holding spots for missing data entries. In this case, a missing entry is indicated by a numeric skip in the Sort Order column. An entry is complete if the column descends 6 - 7 - 8 - 9, with a new entry beginning again with 6.
I need a way to read through the data file, and insert a row of zeros for each missing entry, so that the file looks like this:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
0,0,0,0,0,rs142,8
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
0,0,0,0,0,rs160,9
rs298,13,66664221,C,G,rs298,6
0,0,0,0,0,rs289, 7
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
Ultimately, the last two columns, ForSortSNP and SortOrder will be deleted from the data file, but they are included now for convenience's sake.
Any suggestions are greatly appreicated.
Here is a solution using the expand.grid and merge functions.
grid <- with(mydata, expand.grid(ForSortSNP=unique(ForSortSNP), SortOrder=unique(SortOrder)))
complete <- merge(mydata, grid, all=TRUE, sort=FALSE)
complete[is.na(complete)] <- 0 # replace NAs with 0's
complete <- complete[order(complete$ForSortSNP, complete$SortOrder), ] # re-sort