How to change values in unnamed first column - r

How do I change the entries of the first column in the matrix returned by read_csv if it doesn't have a header?
My variables currently looks like this:
PostFC C1Mean
WBGene00001816 2.475268e-01 415.694457
WBGene00001817 4.808575e+00 2451.018711
and I'd like to rename WBGene0000XXXX to XXXX.

If the first column is actually the rownames do the following
rownames(data) <- gsub(pattern = "WBGene0000", replacement = "", x = rownames(data))
If it isn't consistent, you may want to consider the stringr package and use the substr function
But if it is actually a vector with no header column, I do not know how to reference it without knowing the structure of the data.
run the str function of the data set and see what it returns. Or do the following as a test
colnames(data)[1] <- "test"
Can't exactly help until we know how you have a "zero-length" variable name

If I understand your question correctly the first "unnamed" column you describe are rownames and are not actually in you data.frame
# Example data
df = data.frame(PostFC = c(2.475268e-01, 4.808575e+00), C1Mean = c(415.694457, 2451.018711) )
rownames(df) = c("WBGene00001816", "WBGene00001817")
df
# PostFC C1Mean
# WBGene00001816 0.2475268 415.6945
# WBGene00001817 4.8085750 2451.0187
# change rownames
rownames(df) = c("rowname1", "rowname2")
df
# PostFC C1Mean
# rowname1 0.2475268 415.6945
# rowname2 4.8085750 2451.0187

The entries addressed are actually row names. We can access them with rownames(.).
rownames(df1)
# [1] "WBGene00001816" "WBGene00001817" "WBGene00001818" "WBGene00001819"
# [5] "WBGene00001820" "WBGene00001821" "WBGene00001822"
In R also implemented is rownames<-, i.e. we can assign new rownames by doing rownames(.) <- c(.).
Now in your case it looks like if you want to keep just the last four digits. We may use substring here, which we tell from which digit it should extract. In our case it is the 11th digit to the last, so we do:
rownames(df1) <- substring(rownames(df1), 11)
df1
# PostFC C1Mean
# 1816 0.36250598 2.1073145
# 1817 0.51068402 0.4186838
# 1818 -0.96837330 -0.7239156
# 1819 0.02331745 -0.5902216
# 1820 -0.56927945 1.7540356
# 1821 -0.51252943 0.1343385
# 1822 0.47263180 1.4366233
Note, that duplicated row names are not allowed, i.e. if you obtain duplicates applying this method it will yield an error.
Data used
df1 <- structure(list(PostFC = c(0.362505982864934, 0.510684020059692,
-0.968373302351162, 0.0233174467410604, -0.56927945273647, -0.512529427359891,
0.472631804850333), C1Mean = c(2.10731450148575, 0.418683823183885,
-0.723915648073638, -0.590221641040516, 1.75403562218217, 0.134338480077884,
1.43662329542089)), class = "data.frame", row.names = c("1816",
"1817", "1818", "1819", "1820", "1821", "1822"))

Related

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

Sort strings based on number in part of string

I have a huge data that I cannot upload here because.
I have two types of columns, their names start with T.H.L or T.H.L.varies..... Both types have are numbered in the format So####, e.g., T.H.L.So1_P1_A2 until T.H.L.So10000_P1_A2.
For each T.H.L column there is a column named T.H.L.varies.... with the same ending.
I want to order the columns by the numbers after So, with first the T.H.L and then the corresponding T.H.L.varies.... version for each So number.
What I tried was to do
library(gtools)
mySorted<- df2[,mixedorder(colnames(df2))]
Which is close, it sorts them correctly by number, but first all T.H.L and then all T.H.L.varies instead of alternating them.
I have posted the column names to Github:
Okay, let's call the names of your data frame (the names you want to reorder) x:
x = names(df2)
# first remove the ones without numbers
# because we want to use the numbers for ordering
no_numbers = c("T.H.L", "T.H.L.varies....")
x = x[! x %in% no_numbers]
# now extract the numbers so we can order them
library(stringr)
x_num = as.numeric(str_extract(string = x, pattern = "(?<=So)[0-9]+"))
# calculate the order first by number, then alphabetically to break ties
ord = order(x_num, x)
# verify it is working
head(c(no_numbers, x[ord]), 10)
# [1] "T.H.L" "T.H.L.varies...." "T.H.L.So1_P1_A1"
# [4] "T.H.L.varies.....So1_P1_A1" "T.H.L.So2_P1_A2" "T.H.L.varies.....So2_P1_A2"
# [7] "T.H.L.So3_P1_A3" "T.H.L.varies.....So3_P1_A3" "T.H.L.So4_P1_A4"
# [10] "T.H.L.varies.....So4_P1_A4"
# finally, reorder your data frame columns
df2 = df2[, c(no_numbers, x[ord])]
And you should be done.

Creating a palindrome function in r

Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40
It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.

Conditional Insert of Rows

I have a unique dataset, a portion of which can be reproduced using:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
rs298,13,66664221,C,G,rs298,6
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
It is formatted for use in a program that requires holding spots for missing data entries. In this case, a missing entry is indicated by a numeric skip in the Sort Order column. An entry is complete if the column descends 6 - 7 - 8 - 9, with a new entry beginning again with 6.
I need a way to read through the data file, and insert a row of zeros for each missing entry, so that the file looks like this:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
0,0,0,0,0,rs142,8
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
0,0,0,0,0,rs160,9
rs298,13,66664221,C,G,rs298,6
0,0,0,0,0,rs289, 7
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
Ultimately, the last two columns, ForSortSNP and SortOrder will be deleted from the data file, but they are included now for convenience's sake.
Any suggestions are greatly appreicated.
Here is a solution using the expand.grid and merge functions.
grid <- with(mydata, expand.grid(ForSortSNP=unique(ForSortSNP), SortOrder=unique(SortOrder)))
complete <- merge(mydata, grid, all=TRUE, sort=FALSE)
complete[is.na(complete)] <- 0 # replace NAs with 0's
complete <- complete[order(complete$ForSortSNP, complete$SortOrder), ] # re-sort

Resources