I Have a tab delim file with 400 columns.Now I want to append text to the column names.ie if there is column name is A and B,I want it to change A to A.ovca and B to B.ctrls.Like wise I want to add the texts(ovca and ctrls) to 400 coulmns.Some column names with ovca and some with ctrls.All the columns are unique and contains more than 1000 rows.A sample code of the delim file is given below:
X Y Z A B C
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
And i want the file to be look like:
X.ovca Y.ctrls Z.ctrls A.ovca B.ctlrs C.ovca
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
Please do help me
Regards
Thileepan
If you data.frame is called dat, you can access (and write to) the column names with colnames(dat).
Therefore:
cn <- colnames(dat)
cn <- sub("([AXC])","\\1.ovca",cn)
cn <- sub("([YZB])","\\1.ctrls",cn)
colnames(dat) <- cn
> cn
[1] "X.ovca" "Y.ctrls" "Z.ctrls" "A.ovca" "B.ctrls" "C.ovca"
The \\1 is called back-substitution within your regular expression. It will replace \\1 with whatever's inside the parentheses in the pattern. Since inside the parentheses you have a bracket, it will match any of the letters inside. In this case, "A" becomes "A.ovca" and "X" becomes "X.ovca".
If your variable names are more than one letter, easy enough to extend; just look up a bit on regex's.
Here is a two liner using the stringr package.
nam <- names(mydf)
names(mydf) <- ifelse(nam %in% c('X', 'A', 'Z'),
str_c(nam, '.ovca'), str_c(nam, '.ctrls'))
How about this? You basically find columns that you want to append "ovca" and "ctrls" using %in%, and append the appropriate tag.
> (mydf <- data.frame(X = runif(10), Y = runif(10), Z = runif(10), A = runif(10), B = runif(10), C = runif(10)))
X Y Z A B C
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634
> names(mydf)[names(mydf) %in% c("X", "A", "C")] <- paste(names(mydf)[names(mydf) %in% c("X", "A", "C")], "ovca", sep = ".")
> names(mydf)[names(mydf) %in% c("Y", "Z", "B")] <- paste(names(mydf)[names(mydf) %in% c("Y", "Z", "B")], "ctrls", sep = ".")
> mydf
X.ovca Y.ctrls Z.ctrls A.ovca B.ctrls C.ovca
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634
Related
I have raw, messy data for time series containing around 1400 observations. Here is a snippet of what it looks like:
[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null] ... etc
I want to pull the date and its respective value to form a tsibble in R. So, from the above values, it would be like
Date
y-variable
2021-08-24
1.67
2021-08-23
1.65
2021-08-22
1.62
Notice how only the first value is to be paired with its respective date - I don't need the other values. Right now, the raw data has been copied and pasted into a word document and I am unsure about how to approach data wrangling to import into R.
How could I achieve this?
#replace the text conncetion with a file connection if desired, the file should be a txt then
input <- readLines(textConnection("[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"))
#insert line breaks
input <- gsub("],[", "\n", input, fixed = TRUE)
#remove "new Date"
input <- gsub("new Date", "", input, fixed = TRUE)
#remove parentheses and brackets
input <- gsub("[\\(\\)\\[\\]]", "", input, perl = TRUE)
#import cleaned data
DF <- read.csv(text = input, header = FALSE, quote = "'")
DF$V1 <- as.Date(DF$V1)
print(DF)
# V1 V2 V3 V4 V5
#1 2021-08-24 1.67 1.68 0.9 null
#2 2021-08-23 1.65 1.68 0.9 null
#3 2021-08-22 1.62 1.68 0.9 null
How is this?
text <- "[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"
df <- read.table(text = unlist(strsplit(gsub('new Date\\(|\\)', '', gsub('^.(.*).$', '\\1', text)), "].\\[")), sep = ",")
> df
V1 V2 V3 V4 V5
1 2021-08-24 1.67 1.68 0.9 null
2 2021-08-23 1.65 1.68 0.9 null
3 2021-08-22 1.62 1.68 0.9 null
Changing column names and removing the last columns is trivial from this point
I am fairly new to R and I feel like this should be pretty straightforward, but I keep getting errors. I have a data table called Master2 that has concentration data for several analytes, taken at different stations. My analytes are my column names. Several ND, or non-detect values exist in each column. I would like to change NDs in my TKN column to 0.05 and NDs in all other columns to 0.005.
I was able to easily change all values in the frame to 0.005 with this code:
Master2 <- Master2 %>%
mutate_if(is.character, ~ if_else(. == "ND", "0.005", .))
I have tried a variation of approaches (replace, mutate_at...) to try to change NDs in the TKN column separately prior to running this line of code with no success. Below is a mock up of my data, any help is greatly appreciated!
Master2 <- structure(list(Station = c(C3A,C3A,C3A,MD10,MD10,MD10,C10A),
Date = c(1/15/2009,1/16/2009,1/17/2009,1/18/2009,1/19/2009,1/20/2009,1/21/2009),
DissAmmonia = c(0.3,0.25,0.18,ND,1.2,0.5,0.8),
DissNitrateNitrite = c(0.6,ND,0.15,0.2,0.4,0.6,ND),
TotPhos = c(0.1,0.3,ND,0.4,0.2,0.12,0.1),
TKN = c(ND,0.2,0.13,0.5,ND,0.8,1.2)))
You can do :
#Change 'ND' values in TKN to 0.05
Master2$TKN[Master2$TKN == 'ND'] <- 0.05
#Change 'ND' values in all other columns to 0.005
Master2[Master2 == 'ND'] <- 0.005
#Change the classes of data to respective types.
Master2 <- type.convert(Master2, as.is = TRUE)
Master2
# Station Date DissAmmonia DissNitrateNitrite TotPhos TKN
#1 C3A 0.00003318401 0.300 0.600 0.100 0.05
#2 C3A 0.00003111000 0.250 0.005 0.300 0.20
#3 C3A 0.00002928000 0.180 0.150 0.005 0.13
#4 MD10 0.00002765334 0.005 0.200 0.400 0.50
#5 MD10 0.00002619790 1.200 0.400 0.200 0.05
#6 MD10 0.00002488800 0.500 0.600 0.120 0.80
#7 C10A 0.00002370286 0.800 0.005 0.100 1.20
If you want to stick with dplyr and still use mutate, try:
Master2 <- Master2 %>%
mutate(TKN = case_when(TKN == "ND" ~ 0.05, TRUE ~ TKN)) %>%
mutate(across(select(-TKN), ~case_when(cur_column() == "ND" ~ 0.005,
TRUE ~ cur_column())))
As you haven't provided a reproducible example, I can't test this code to verify I get your desired output
A combination of my code and #Ronak Shah's solution appears to work!
#Change 'ND' values in TKN to 0.05
Master2$TKN[Master2$TKN == 'ND'] <- 0.05
#Change 'ND' values in all other columns to 0.005
Master2 <- Master2 %>%
mutate_if(is.character, ~ if_else(. == "ND", "0.005", .))
I have a very large data frame with SNPs in rows (~50.000) and IDs in columns (~500), imagine an extraction would look something like this:
R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
Now I want to save this as a txt, normally no problem with
write.table(example, "example.txt", colnames=T, rownames=T, quotes=F)
BUT I need to have a tab (\t) as first column entrance, so in the txt file the data frame should look sth like:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
(\t for the tab)
Can anyone help me how to do this?
Btw I also tried:
write.table(data.frame("\t"=rownames(example),example),"example.txt", row.names=FALSE)
It did not work, unfortunately...
Thanks!
This kind of works, just replace stdout() with the path to your output-file:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c(' ', names(data)), collapse = '\t'),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = '\t')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 35 97 27
#> B 12 69 24
#> C 25 9 34
Or with spaces as seperators and the tab you wished for in the first column:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c('\t', names(data)), collapse = ' '),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = ' ')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 3 30 11
#> B 62 69 70
#> C 93 55 73
Using a data frame like the following, where I've changed one row name to illustrate how to deal with cases of unequal length:
df <- read.table(text = "R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58")
You could do something like this:
df <- format(as.matrix(df))
df <- cbind("\\t" = rownames(df), df)
df <- rbind(colnames(df), df)
df[,1] <- stringr::str_pad(df[,1], max(nchar(df[,1])), "right")
write.table(df,
file = "example.txt",
sep = " ",
quote = F,
row.names = F,
col.names = F)
Output:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58
I first converted the numeric values to character and formatted them to make sure they have the same number of digits, otherwise they won't line up. Then I turn the row names into a new variable named \\t, and then I turn the column names into a new row. I use stringr::str_pad() to account for row names of differing lengths. Finally, I write the data frame to TXT file without the row or column names.
I have data with factors lang and alg and I like to compare difference for selected lang pair between all alg:
> perf[perf$lang == "java", c("alg", "cpu")]
alg cpu
173 binarytrees 0.196
174 chameneosredux 0.404
175 fannkuchredux 0.648
> perf[perf$lang == "python3", c("alg", "cpu")]
alg cpu
246 binarytrees 0.972
248 fannkuchredux 13.752
249 fasta 1.152
For binarytrees I expect to get 0.196/0.972, but for chameneosredux is NA, for fannkuchredux is 0.648/13.752, for fasta is NA, ...
One way is to sort rows on alg but I don't understand how to inject rows with NA on missing factors (all factors available in unique(perf$alg)).
UPDATE Despite original question I think that I like to combine columns of two data frames into single data frame on same factor:
binarytrees 0.196 0.972
chameneosredux 0.404 NA
fasta NA 1.152
fannkuchredux 0.648 13.752
What you are looking for is essentially a FULL OUTER JOIN in SQL, which can be done with base::merge using all = TRUE.
Here is a comparable data set to demonstrate with:
Df <- data.frame(
Lang = rep(LETTERS[1:5], rep(3, 5)),
Alg = c(replicate(5, sample(letters[1:4], 3))),
Cpu = rnorm(15),
stringsAsFactors = FALSE
)
Note that I'm using stringsAsFactors = FALSE. I would suggest you convert your columns to character vectors as well; I don't see any need for using factors here.
This is the merge operation in a light wrapper function, just to make the presentation a little cleaner:
compare <- function(x, y, data) {
merge(x = data[data$Lang == x[1], 2:3],
y = data[data$Lang == y[1], 2:3],
by = "Alg", all = TRUE,
suffixes = c(paste0(".", x[1]),
paste0(".", y[1]))
)
}
And here it is in use:
compare("A", "D", Df)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
compare("A", "C", Df)
# Alg Cpu.A Cpu.C
#1 b 1.0587151 -1.0230431
#2 c -2.0390119 -0.7691048
#3 d -0.8574474 -1.2421078
Regarding my comment, this can also be achieved using sqldf. SQLite does not support FULL OUTER JOIN, but this shouldn't be too much of an issue if you are comfortable with SQL, as there are probably a dozen or so ways to work around that:
library(sqldf)
sqldf(
"select x.Alg
,lhs.Cpu as 'Cpu.A'
,rhs.Cpu as 'Cpu.D'
from (
select distinct d.Alg
from Df d
) x
left join Df lhs on lhs.Alg = x.Alg and lhs.Lang = 'A'
left join Df rhs on rhs.Alg = x.Alg and rhs.Lang = 'D'
order by x.Alg"
)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
I am wanting to convert several columns in a data.frame from chr to numeric and I would like to do it in a single line. Here is what I am trying to do:
items[,2:4] <- as.numeric(sub("\\$","",items[,2:4]))
But I get an error saying:
Warning message:
NAs introduced by coercion
If I do it column by column though it works:
items[,2:2] <- as.numeric(sub("\\$","",items[,2:2]))
items[,3:3] <- as.numeric(sub("\\$","",items[,3:3]))
items[,4:4] <- as.numeric(sub("\\$","",items[,4:4]))
What am I missing here? Why I specify this command for multiple columns? Is this some odd R idiosyncrasy that I am not aware of?
Example Data:
Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09
Your problem is, that gsub converts its x argument to character. If a list (a data.frame is in fact a list) is converted to character something wired happen:
as.character(list(a=c("1", "1"), b="1"))
# "c(\"1\", \"1\")" "1"
# and "c(\"1\", \"1\")" can not convert into a numeric
as.numeric("c(\"1\", \"1\")")
# NA
A one line solution would be to unlist the x argument:
items[, 2:5] <- as.numeric(gsub("\\$", "", unlist(items[, 2:5])))
Yes there is: apply is the command you are looking for:
items<-read.table(text="Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09", header=TRUE,sep=",")
items[,2:4]<-apply(items[,2:4],2,function(x){as.numeric(gsub("\\$","",x))})
items
Name Cost1 Cost2 Cost3 Cost4
1 A 10.00 15.50 13.20 $45.45
2 B 45.23 34.23 34.24 $23.34
3 C 23.43 45.23 65.23 $34.23
4 D 76.34 98.34 90.34 $45.09
A more efficient approach would be:
items[-1] <- lapply(items[-1], function(x) as.numeric(gsub("$", "", x, fixed = TRUE)))
items
# Name Cost1 Cost2 Cost3 Cost4
# 1 A 10.00 15.50 13.20 45.45
# 2 B 45.23 34.23 34.24 23.34
# 3 C 23.43 45.23 65.23 34.23
# 4 D 76.34 98.34 90.34 45.09
Some benchmarks of the answers so far
fun1 <- function() {
A[-1] <- lapply(A[-1], function(x) as.numeric(gsub("$", "", x, fixed=TRUE)))
A
}
fun2 <- function() {
A[, 2:ncol(A)] <- as.numeric(gsub("\\$", "", unlist(A[, 2:ncol(A)])))
A
}
fun3 <- function() {
A[, 2:ncol(A)] <- apply(A[,2:ncol(A)], 2, function(x) { as.numeric(gsub("\\$","",x)) })
A
}
Here's some sample data and processing times
set.seed(1)
A <- data.frame(Name = sample(LETTERS, 10000, TRUE),
matrix(paste0("$", sample(99, 10000*100, TRUE)),
ncol = 100))
system.time(fun1())
# user system elapsed
# 0.72 0.00 0.72
system.time(fun2())
# user system elapsed
# 5.84 0.00 5.85
system.time(fun3())
# user system elapsed
# 4.14 0.00 4.14