R apply conversion to multiple columns of data.frame - r

I am wanting to convert several columns in a data.frame from chr to numeric and I would like to do it in a single line. Here is what I am trying to do:
items[,2:4] <- as.numeric(sub("\\$","",items[,2:4]))
But I get an error saying:
Warning message:
NAs introduced by coercion
If I do it column by column though it works:
items[,2:2] <- as.numeric(sub("\\$","",items[,2:2]))
items[,3:3] <- as.numeric(sub("\\$","",items[,3:3]))
items[,4:4] <- as.numeric(sub("\\$","",items[,4:4]))
What am I missing here? Why I specify this command for multiple columns? Is this some odd R idiosyncrasy that I am not aware of?
Example Data:
Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09

Your problem is, that gsub converts its x argument to character. If a list (a data.frame is in fact a list) is converted to character something wired happen:
as.character(list(a=c("1", "1"), b="1"))
# "c(\"1\", \"1\")" "1"
# and "c(\"1\", \"1\")" can not convert into a numeric
as.numeric("c(\"1\", \"1\")")
# NA
A one line solution would be to unlist the x argument:
items[, 2:5] <- as.numeric(gsub("\\$", "", unlist(items[, 2:5])))

Yes there is: apply is the command you are looking for:
items<-read.table(text="Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09", header=TRUE,sep=",")
items[,2:4]<-apply(items[,2:4],2,function(x){as.numeric(gsub("\\$","",x))})
items
Name Cost1 Cost2 Cost3 Cost4
1 A 10.00 15.50 13.20 $45.45
2 B 45.23 34.23 34.24 $23.34
3 C 23.43 45.23 65.23 $34.23
4 D 76.34 98.34 90.34 $45.09

A more efficient approach would be:
items[-1] <- lapply(items[-1], function(x) as.numeric(gsub("$", "", x, fixed = TRUE)))
items
# Name Cost1 Cost2 Cost3 Cost4
# 1 A 10.00 15.50 13.20 45.45
# 2 B 45.23 34.23 34.24 23.34
# 3 C 23.43 45.23 65.23 34.23
# 4 D 76.34 98.34 90.34 45.09
Some benchmarks of the answers so far
fun1 <- function() {
A[-1] <- lapply(A[-1], function(x) as.numeric(gsub("$", "", x, fixed=TRUE)))
A
}
fun2 <- function() {
A[, 2:ncol(A)] <- as.numeric(gsub("\\$", "", unlist(A[, 2:ncol(A)])))
A
}
fun3 <- function() {
A[, 2:ncol(A)] <- apply(A[,2:ncol(A)], 2, function(x) { as.numeric(gsub("\\$","",x)) })
A
}
Here's some sample data and processing times
set.seed(1)
A <- data.frame(Name = sample(LETTERS, 10000, TRUE),
matrix(paste0("$", sample(99, 10000*100, TRUE)),
ncol = 100))
system.time(fun1())
# user system elapsed
# 0.72 0.00 0.72
system.time(fun2())
# user system elapsed
# 5.84 0.00 5.85
system.time(fun3())
# user system elapsed
# 4.14 0.00 4.14

Related

Make bootstrap function more efficient with lapply

I have a data frame with numeric columns and a character column with labels. See example:
library(tidyverse)
a <- c(0.036210845, 0.005546561, 0.004394322 ,0.006635205, 2.269306824 ,0.013542101, 0.006580308 ,0.006854309,0.009076331 ,0.006577178 ,0.099406840 ,0.010962796, 0.011491922,0.007454443 ,0.004463684,0.005836916,0.011119906 ,0.009543205, 0.003990476, 0.007793532 ,0.020776231, 0.011713687, 0.010045341, 0.008411304, 0.032514994)
b <- c(0.030677829, 0.005210211, 0.004164294, 0.006279456 ,1.095908581 ,0.012029876, 0.006193405 ,0.006486812, 0.008589699, 0.006167356, 0.068956516 ,0.010140064 ,0.010602171 ,0.006898081 ,0.004193735, 0.005447855 ,0.009936211, 0.008743681, 0.003774822, 0.007375678, 0.019695336, 0.010827791, 0.009258572, 0.007960328,0.026956408)
c <- c(0.025855453, 0.004882746 ,0.003946182, 0.005929399 ,0.466284591 ,0.010704604 ,0.005815709, 0.006125196, 0.008110854, 0.005769223, 0.046847336, 0.009356712, 0.009803620 ,0.006366758, 0.003936953 ,0.005072295, 0.008885989 ,0.007989028, 0.003565631, 0.006964512, 0.018636187, 0.010009413, 0.008540876, 0.007516569,0.022227924)
label <- c("fa05","fa05" ,"fa05", "fa10", "fa10", "fa10", "fa20","fa20", "faflat", "faflat", "sa05", "sa05", "sa10" , "sa10" , "sa10" , "sa10", "sa10", "sa10", "sa20", "sa20", "sa20" ,"sa20", "saflat", "saflat", "saflat")
dataframe <- as.data.frame(cbind(a,b,c,label))
dataframe <- dataframe %>%
transform(a = as.numeric(a)) %>%
transform(b = as.numeric(b)) %>%
transform(c = as.numeric(c))
I have written a function that takes a sample of rows for each label (number of rows in sample = number of rows for the specific label) and as output gives the average of the samples. Example: in the source data (dataframe) there are 3 rows of the label "fa05". Lets call them fa05_1, fa05_2, fa05_3 (just for explaining it). The function takes a sample of these three rows that each consist of 3 columns (a,b and c). The number of fa05 in the sample equals the number fa05 in the source data, so 3 in this case. The function takes a sample with replacement so it could fx be fa05_3, fa05_1, fa05_1. Then it takes the average of those three samples for each of the three columns a,b and c and gives the output. It looks like this:
samp <- function(df, col1, var){
df %>%
group_by(!!col1) %>%
nest() %>%
ungroup() %>%
mutate(n = !!var) %>%
mutate(samp = map2(data, n, sample_n, replace=T)) %>%
select(-data) %>%
unnest(samp) %>%
group_by(!!col1) %>%
dplyr::summarise(across("a":"c", mean))
}
list <- c(3,3,2,2,2,6,4,3) # The number of times each label occur in the data
samp(dataframe, quo(label), quo(list))
label a b c
<chr> <dbl> <dbl> <dbl>
1 fa05 0.00439 0.00416 0.00395
2 fa10 0.00894 0.00820 0.00752
3 fa20 0.00672 0.00634 0.00597
4 faflat 0.00908 0.00859 0.00811
5 sa05 0.0552 0.0395 0.0281
6 sa10 0.00715 0.00657 0.00603
7 sa20 0.0101 0.00956 0.00903
8 saflat 0.0250 0.0211 0.0177
I would like to use this function on some data and repeat it 1000 times efficiently. At first it was not a function and I used rerun() but that was very inefficient. I read that I could write it as a function and the use lapply which should be more efficient, but it does not work when I do like this:
lapply(dataframe, samp, col1=quo(Pattern), var=quo(list))
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
How do I make this work with lapply? And how to I tell lapply to rerun the function 1000 times? I hope you can help.
You can just do this
replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE)
However, this is really slow.
> system.time(replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE))
user system elapsed
33.83 0.03 33.87
To make it faster, we need to rewrite your samp function. Here is a tidyverse approach
group_sample_size <- c("fa05" = 3, "fa10" = 3, "fa20" = 2, "faflat" = 2, "sa05" = 2, "sa10" = 6, "sa20" = 4, "saflat" = 3)
prep <- function(df, grp_var, sample_size) {
df %>%
mutate(size = sample_size[.data[[grp_var]]]) %>%
group_by(across(!!grp_var))
}
rep_sample <- function(df, n) {
replicate(
n,
df %>%
slice(sample.int(n(), size[[1L]], replace = TRUE)) %>%
summarise(across(a:c, mean), .groups = "drop"),
simplify = FALSE
)
}
dataframe %>%
prep("label", group_sample_size) %>%
rep_sample(1000)
Performance has improved significantly but is still suboptimal IMO. It takes about 5-6 seconds to finish the simulation.
> system.time(dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000))
user system elapsed
5.80 0.01 5.81
For efficiency, I think the following data.table approach would be better.
library(data.table)
fsamp <- function(df, grp_var, size, nsim) {
df <- as.data.table(df)
group_info <- table(df[[grp_var]], dnn = list(grp_var))
simu_pool <- df[, -grp_var, with = FALSE]
simu_vars <- names(simu_pool)
simu_pool <- split(simu_pool, df[[grp_var]])
out <- data.table(
simu = rep(seq_len(nsim), each = length(group_info)),
group_info
)
out[
, size := size[out[[grp_var]]]
][
, (simu_vars) := lapply(simu_pool[[.BY[[grp_var]]]][sample.int(N, size, replace = TRUE)], mean),
by = c("simu", grp_var)
][]
}
This one is about four times faster than the optimised tidyverse approach.
> system.time(fsamp(dataframe, "label", group_sample_size, 1000))
user system elapsed
1.47 0.04 1.50
All three approaches produce the same set of results
> set.seed(124)
> # rbindlist converts a list of tibbles into a single data.table
> dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
> set.seed(124)
> fsamp(df, "label", group_sample_size, 1000)
simu label N size a b c
1: 1 fa05 3 3 0.015383909 0.013350778 0.011561460
2: 1 fa10 3 3 0.763161377 0.371405971 0.160972865
3: 1 fa20 2 2 0.006717308 0.006340109 0.005970452
4: 1 faflat 2 2 0.009076331 0.008589699 0.008110854
5: 1 sa05 2 2 0.055184818 0.039548290 0.028102024
---
7996: 1000 faflat 2 2 0.007826754 0.007378527 0.006940039
7997: 1000 sa05 2 2 0.099406840 0.068956516 0.046847336
7998: 1000 sa10 6 6 0.006648513 0.006118159 0.005626362
7999: 1000 sa20 4 4 0.020776231 0.019695336 0.018636187
8000: 1000 saflat 3 3 0.008411304 0.007960328 0.007516569
> set.seed(124)
> replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569

Difficulty importing text file with multiple different delimiters in R

I'm having trouble figuring out how to import my data with multiple delimiters. The following is what my computer automatically saves into text file. The issue is that some of the results are printed with differently spaced delimiters. Some of the delimiters are colons (:) and others are multiple spaces with inconsistent length.
Each letter (B: to Z:) codes for some unique variable. For example:
B: Number of responses
C: Number of seconds, etc.
However, the information below "Z: 0.000", where the layout changes, is when the variables get subset. So,
A:
0: value1 value2 value3 value4
is referenced as:
A(0) = value1 (e.x. number of responses in the first trial)
A(1) = value2 (e.x. number of responses in the second trial)
A(2) = value3 (e.x. number of responses in the third trial)
A(3) = value4 (e.x. number of responses in the fourth trial)
Here, there are 4 "A" variables that each can carry unique values too.
Example of Text File:
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
I've tried a few methods, but they get stuck because the multiple delimiters issue. Let's assume "data" is the text file.
# This is the closest - some of the values are still not separated properly
temp <- read.delim2(file = "data", quote = ":", sep = "",)
# This one separate the information mostly correctly for the top half only
temp <- read.delim2(file = "data", sep = ":")
I eventually want a dataframe with labels in one column (StartDate, A(0), B, etc.) and values in the other (05/20/2018, 1, 7).
library(dplyr)
library(splitstackshape)
#read file
txt <- readLines("test.txt")
#Fix 'A:' rows
A_idx <- grep("A:", txt)
txt[A_idx] <- paste0(txt[A_idx], gsub("0:\\s+", "", txt[A_idx+1]))
txt <- txt[-(A_idx+1)]
#Fix 'F:' rows
F_idx <- grep("F:", txt)
txt[F_idx] <- paste0(txt[F_idx], paste(gsub("0:\\s+", "", txt[F_idx+1]),
gsub("5:\\s+", "", txt[F_idx+2])))
txt <- txt[-c(F_idx+1, F_idx+2)]
Now txt is in DCF format so it can be read using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F) %>%
cSplit("A", " ") %>%
cSplit("F", " ")
Output is:
df
Start.Date End.Date Subject Start.Time End.Time B C D E Q T
1: 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100
U W X Y Z A_1 A_2 A_3 A_4 F_1 F_2 F_3 F_4 F_5 F_6 F_7
1: 218.000 7.000 347.000 0.000 0.000 1 0 0 0 11500 9500 13500 7500 15500 5500 17500
Sample data: test.txt contains
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
Start Date: 05/20/18
End Date: 05/20/18
... another block of data
Edit: If you want column A & F's index to start from 0
#read DCF data (i.e 'txt') using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F)
#convert column A into wide format by splitting it into multiple columns
A_df <- data.frame(do.call(rbind, strsplit(as.character(df$A),'\\s+')), stringsAsFactors = F)
colnames(A_df) <- paste("A", sequence(ncol(A_df))-1, sep = "_")
#convert column F into wide format by splitting it into multiple columns
F_df <- data.frame(do.call(rbind, strsplit(as.character(df$F),'\\s+')), stringsAsFactors = F)
colnames(F_df) <- paste("F", sequence(ncol(F_df))-1, sep = "_")
#final data
final_df <- cbind(df[, !names(df) %in% c("A", "F")], A_df, F_df)
which gives
final_df
# Start.Date End.Date Subject Start.Time End.Time B C D E Q T U
#1 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100 218.000
# W X Y Z A_0 A_1 A_2 A_3 F_0 F_1 F_2 F_3 F_4
#1 7.000 347.000 0.000 0.000 1.000 0.000 0.000 0.000 11500.000 9500.000 13500.000 7500.000 15500.000
# F_5 F_6
#1 5500.000 17500.000
The good news is that your file does NOT have different delimiters. It is "Debian Control File" format. The whitespace marks continuous lines. See ?read.dcf Unfortunately, I cannot figure out if there is a way to parse .dcf including the semantics of continuous lines. But what the heck, once the data is in R, you can just clean it with library(tidyr)
x <- read.dcf("yoursourcefilename.txt")
y <– as.data.frame(x) # read.dcf reads in as matrix
z <- y %>%
separate("A", into = c("drop", "A0"), sep = "0:") %>%
separate("A0", into = c("drop", paste0("A0_val_", 1:4)), sep = "\\s{2,}") %>%
separate("F", into = c("drop", "F0"), sep = "0:") %>%
separate("F0", into = c("F0", "F5"), sep = "5:") %>%
separate("F0", into = c("drop", paste0("F0_val_", 1:5)), sep = "\\s{2,}") %>%
separate("F5", into = c("drop", paste0("F5_val_", 1:2)), sep = "\\s{2,}") %>%
select(-drop) %>% t() %>% as.data.frame()
z$V1 <- trimws(z$V1) # clean whatever whitespace is left
This will yield you a long dataframe:
dim(z)
[1] 27 1
Like so:
> z
V1
Start Date 05/20/18
End Date 05/20/18
Subject 1
Start Time 16:23:11
End Time 17:26:24
B 7.000
C 12000.000
D 9500.000
E 1.000
Q 203.000
T 1200.100
U 218.000
W 7.000
X 347.000
Y 0.000
Z 0.000
F5_val_1 5500.000
F5_val_2 17500.000
F0_val_1 11500.000
F0_val_2 9500.000
F0_val_3 13500.000
F0_val_4 7500.000
F0_val_5 15500.000
A0_val_1 1.000
A0_val_2 0.000
A0_val_3 0.000
A0_val_4 0.000
I am not sure this is the most efficient to work with the data (not a tidy format), but sounds like this is what you wanted?

Take column-wise differences across a data.table

How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?
Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:
pop <- data.table(group_id = c(1, 2, 3),
N = c(4588L, 4589L, 4589L),
N_surv_1 = c(4213, 4243, 4264),
N_surv_2 = c(3703, 3766, 3820),
N_surv_3 = c(2953, 3054, 3159) )
# group_id N N_surv_1 N_surv_2 N_surv_3
# 1 4588 4213 3703 2953
# 2 4589 4243 3766 3054
# 3 4589 4264 3820 3159
(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)
What I have done: using the base diff and matrix transposition, we can:
diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff)
# produces desired output:
# group_id deaths_1 deaths_2 deaths_3
# 1 -375 -510 -750
# 2 -346 -477 -712
# 3 -325 -444 -661
I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:
melt(pop,
id.vars = "group_id"
)[order(group_id)][, setNames(as.list(diff(value)),
paste0("deaths_",1:(ncol(pop)-2)) ),
keyby = group_id]
Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?
Well, you could subtract the subsets:
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
# N_surv_1 N_surv_2 N_surv_3
# 1: -375 -510 -750
# 2: -346 -477 -712
# 3: -325 -444 -661
You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by #akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].
Anyway, this is pretty convoluted compared to simply reshaping:
melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
# group_id V1
# 1: 1 -375
# 2: 1 -510
# 3: 1 -750
# 4: 2 -346
# 5: 2 -477
# 6: 2 -712
# 7: 3 -325
# 8: 3 -444
# 9: 3 -661
Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):
pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]
# group_id deaths_1 deaths_2 deaths_3
# 1: 1 -375 -510 -750
# 2: 2 -346 -477 -712
# 3: 3 -325 -444 -661
Essentially, something like this if you ignore setting up the column names:
pop[, as.list(diff(unlist(.SD))), group_id]
Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]
I did some benchmarking
rows<-10000000
pop <- data.table(group_id = 1:rows,
N = runif(rows,3000,4000),
N_surv_1 = runif(rows,3000,4000),
N_surv_2 = runif(rows,3000,4000),
N_surv_3 = runif(rows,3000,4000))
system.time({
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]})
and it returned
user system elapsed
0.192 0.808 1.003
In contrast I did
system.time(pop[, as.list(diff(unlist(.SD))), group_id])
and it returned
user system elapsed
169.836 0.428 170.469
I also did
system.time({
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
})
which returned
user system elapsed
0.044 0.044 0.089
Finally, doing
system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])
returns
user system elapsed
223.360 1.736 225.315
Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.

Assigning value to a huge data table element too slow

Given a data.table defined as follow:
n <- 34916339
ds2 <- data.table(X=rep(as.integer(NA),n), Y=rep(as.integer(NA),n),
LAT=rep(as.numeric(NA),n), LON=rep(as.numeric(NA),n),
FCT_DATE=rep(as.Date(NA),n), VAR=rep(as.character(NA),n),
TYPE=rep(as.character(NA),n), VALUE=rep(as.numeric(NA),n))
The following code is too slow.
ds = data.table dim = 572399 x 66
colNames <- rep("any_string",66) # only an example
for (i in 1:nrow(ds)) {
for (j in 6:66) {
colName <- colNames[j]
colName.split <- strsplit(colName, "_") # Split the elements by "_"
k <- ((i-1) * length(colIndex))+(j-5) # creates 61 lines each complete loop
ds2[k,6] <- colName.split[[1]][1]
ds2[k,7] <- colName.split[[1]][2] # so, it reads 61 cols from ds
ds2[k,8] <- ds[i,get(colName)] # and creates 61 lines in ds2
}
}
Does anyone know how I can improve this code? In particular, the attributions to cols 6,7 and 8 are slow. I'm trying to convert the 66 columns of the data.table ds to a 8 column data.table.
Thanks in advance.
EDITED:
# Building of an example of the data.table ds (the faster way I know for the moment)
ds <- data.table(1:nds,1:nds,rep(3.3,nds),rep(4.4,nds),rep(as.Date("2014-08-16"),nds))
for (i in 1:61) {
ds <- cbind(ds,rep(i+i/10,nds))
}
# setting the real names
names.ds <- c("X","Y","LAT","LON","FCT_DATE",
"UVES_01N","VVES_01N","PSNM_01N","PREC_01N","UVES_01P","VVES_01P","PSNM_01P","PREC_01P",
"UVES_02N","VVES_02N","PSNM_02N","PREC_02N","UVES_02P","VVES_02P","PSNM_02P","PREC_02P",
"UVES_03N","VVES_03N","PSNM_03N","PREC_03N","UVES_03P","VVES_03P","PSNM_03P","PREC_03P",
"UVES_04N","VVES_04N","PSNM_04N","PREC_04N","UVES_04P","VVES_04P","PSNM_04P","PREC_04P",
"UVES_05N","VVES_05N","PSNM_05N","PREC_05N","UVES_05P","VVES_05P","PSNM_05P","PREC_05P",
"UVES_06N","VVES_06N","PSNM_06N","PREC_06N","UVES_06P","VVES_06P","PSNM_06P","PREC_06P",
"UVES_07N","VVES_07N","PSNM_07N","PREC_07N","UVES_07P","VVES_07P","PSNM_07P","PREC_07P",
"UVES_AVN","VVES_AVN","PSNM_AVN","PREC_AVN","PREC_OBS")
setnames(ds, old=1:66, new=names.ds)
My goal is to convert it to a data.table like this:
X Y LAT LON FCT_DATE VAR TYPE VALUE
1: 312 54 -39.7401 -68.4375 2009-01-02 UVES 01N 0.63
2: 312 54 -39.7401 -68.4375 2009-01-02 VVES 01N -3.17
3: 312 54 -39.7401 -68.4375 2009-01-02 PSNM 01N 1019.52
...
34916339: 341 83 -39.7401 -68.4375 2009-01-02 PREC OBS 0.50
I think you're trying to reinvent the wheel. This works:
library(reshape2)
ds2 <- melt(ds, 1:5, variable.name = "VAR", value.name = "VALUE")
ds2[, VAR := as.character(VAR)]
ds2[, `:=`(TYPE = sub(".*_", "", VAR), VAR = sub("_.*", "", VAR))]
It was fairly slow on just 1,000,000 rows (on a MacBook Pro w/ OS 10.9, 2.8 GHz i7):
# user system elapsed
# 73.373 1.398 74.809
but at least it's parsimonious and readable. You also didn't say how slow "too slow" was, so I have no idea if this is an improvement. A strsplit-based solution took even longer (> 100 seconds), and stringr::str_match_all longer than that.
Here's a faster way. The other answer calls sub(...) twice for each row. There's no need to do that since these are just the column names, and there are only 66 of them. Using your code with nds <- 1e6 to create ds, the code below runs about 20X faster.
library(reshape2)
# code from other answer
system.time({
ds2 <- melt(ds, 1:5, variable.name = "VAR", value.name = "VALUE")
ds2[, VAR := as.character(VAR)]
ds2[, `:=`(TYPE = sub(".*_", "", VAR), VAR = sub("_.*", "", VAR))]
})
# user system elapsed
# 239.43 1.05 240.78
# this code does not call sub(...) 2 million times
system.time({
cn <- strsplit(colnames(ds)[6:66],"_")
ds3 <- melt(ds,1:5,variable.name="VAR",value.name="VALUE")
ds3[,":="(VAR =rep(sapply(cn,"[",1),each=nrow(ds)),
TYPE=rep(sapply(cn,"[",2),each=nrow(ds)))]
})
# user system elapsed
# 13.87 8.96 22.83
identical(ds2,ds3)
# [1] TRUE

appending or Pasting names to Column names in R

I Have a tab delim file with 400 columns.Now I want to append text to the column names.ie if there is column name is A and B,I want it to change A to A.ovca and B to B.ctrls.Like wise I want to add the texts(ovca and ctrls) to 400 coulmns.Some column names with ovca and some with ctrls.All the columns are unique and contains more than 1000 rows.A sample code of the delim file is given below:
X Y Z A B C
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
And i want the file to be look like:
X.ovca Y.ctrls Z.ctrls A.ovca B.ctlrs C.ovca
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
Please do help me
Regards
Thileepan
If you data.frame is called dat, you can access (and write to) the column names with colnames(dat).
Therefore:
cn <- colnames(dat)
cn <- sub("([AXC])","\\1.ovca",cn)
cn <- sub("([YZB])","\\1.ctrls",cn)
colnames(dat) <- cn
> cn
[1] "X.ovca" "Y.ctrls" "Z.ctrls" "A.ovca" "B.ctrls" "C.ovca"
The \\1 is called back-substitution within your regular expression. It will replace \\1 with whatever's inside the parentheses in the pattern. Since inside the parentheses you have a bracket, it will match any of the letters inside. In this case, "A" becomes "A.ovca" and "X" becomes "X.ovca".
If your variable names are more than one letter, easy enough to extend; just look up a bit on regex's.
Here is a two liner using the stringr package.
nam <- names(mydf)
names(mydf) <- ifelse(nam %in% c('X', 'A', 'Z'),
str_c(nam, '.ovca'), str_c(nam, '.ctrls'))
How about this? You basically find columns that you want to append "ovca" and "ctrls" using %in%, and append the appropriate tag.
> (mydf <- data.frame(X = runif(10), Y = runif(10), Z = runif(10), A = runif(10), B = runif(10), C = runif(10)))
X Y Z A B C
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634
> names(mydf)[names(mydf) %in% c("X", "A", "C")] <- paste(names(mydf)[names(mydf) %in% c("X", "A", "C")], "ovca", sep = ".")
> names(mydf)[names(mydf) %in% c("Y", "Z", "B")] <- paste(names(mydf)[names(mydf) %in% c("Y", "Z", "B")], "ctrls", sep = ".")
> mydf
X.ovca Y.ctrls Z.ctrls A.ovca B.ctrls C.ovca
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634

Resources