I'm having trouble figuring out how to import my data with multiple delimiters. The following is what my computer automatically saves into text file. The issue is that some of the results are printed with differently spaced delimiters. Some of the delimiters are colons (:) and others are multiple spaces with inconsistent length.
Each letter (B: to Z:) codes for some unique variable. For example:
B: Number of responses
C: Number of seconds, etc.
However, the information below "Z: 0.000", where the layout changes, is when the variables get subset. So,
A:
0: value1 value2 value3 value4
is referenced as:
A(0) = value1 (e.x. number of responses in the first trial)
A(1) = value2 (e.x. number of responses in the second trial)
A(2) = value3 (e.x. number of responses in the third trial)
A(3) = value4 (e.x. number of responses in the fourth trial)
Here, there are 4 "A" variables that each can carry unique values too.
Example of Text File:
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
I've tried a few methods, but they get stuck because the multiple delimiters issue. Let's assume "data" is the text file.
# This is the closest - some of the values are still not separated properly
temp <- read.delim2(file = "data", quote = ":", sep = "",)
# This one separate the information mostly correctly for the top half only
temp <- read.delim2(file = "data", sep = ":")
I eventually want a dataframe with labels in one column (StartDate, A(0), B, etc.) and values in the other (05/20/2018, 1, 7).
library(dplyr)
library(splitstackshape)
#read file
txt <- readLines("test.txt")
#Fix 'A:' rows
A_idx <- grep("A:", txt)
txt[A_idx] <- paste0(txt[A_idx], gsub("0:\\s+", "", txt[A_idx+1]))
txt <- txt[-(A_idx+1)]
#Fix 'F:' rows
F_idx <- grep("F:", txt)
txt[F_idx] <- paste0(txt[F_idx], paste(gsub("0:\\s+", "", txt[F_idx+1]),
gsub("5:\\s+", "", txt[F_idx+2])))
txt <- txt[-c(F_idx+1, F_idx+2)]
Now txt is in DCF format so it can be read using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F) %>%
cSplit("A", " ") %>%
cSplit("F", " ")
Output is:
df
Start.Date End.Date Subject Start.Time End.Time B C D E Q T
1: 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100
U W X Y Z A_1 A_2 A_3 A_4 F_1 F_2 F_3 F_4 F_5 F_6 F_7
1: 218.000 7.000 347.000 0.000 0.000 1 0 0 0 11500 9500 13500 7500 15500 5500 17500
Sample data: test.txt contains
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
Start Date: 05/20/18
End Date: 05/20/18
... another block of data
Edit: If you want column A & F's index to start from 0
#read DCF data (i.e 'txt') using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F)
#convert column A into wide format by splitting it into multiple columns
A_df <- data.frame(do.call(rbind, strsplit(as.character(df$A),'\\s+')), stringsAsFactors = F)
colnames(A_df) <- paste("A", sequence(ncol(A_df))-1, sep = "_")
#convert column F into wide format by splitting it into multiple columns
F_df <- data.frame(do.call(rbind, strsplit(as.character(df$F),'\\s+')), stringsAsFactors = F)
colnames(F_df) <- paste("F", sequence(ncol(F_df))-1, sep = "_")
#final data
final_df <- cbind(df[, !names(df) %in% c("A", "F")], A_df, F_df)
which gives
final_df
# Start.Date End.Date Subject Start.Time End.Time B C D E Q T U
#1 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100 218.000
# W X Y Z A_0 A_1 A_2 A_3 F_0 F_1 F_2 F_3 F_4
#1 7.000 347.000 0.000 0.000 1.000 0.000 0.000 0.000 11500.000 9500.000 13500.000 7500.000 15500.000
# F_5 F_6
#1 5500.000 17500.000
The good news is that your file does NOT have different delimiters. It is "Debian Control File" format. The whitespace marks continuous lines. See ?read.dcf Unfortunately, I cannot figure out if there is a way to parse .dcf including the semantics of continuous lines. But what the heck, once the data is in R, you can just clean it with library(tidyr)
x <- read.dcf("yoursourcefilename.txt")
y <– as.data.frame(x) # read.dcf reads in as matrix
z <- y %>%
separate("A", into = c("drop", "A0"), sep = "0:") %>%
separate("A0", into = c("drop", paste0("A0_val_", 1:4)), sep = "\\s{2,}") %>%
separate("F", into = c("drop", "F0"), sep = "0:") %>%
separate("F0", into = c("F0", "F5"), sep = "5:") %>%
separate("F0", into = c("drop", paste0("F0_val_", 1:5)), sep = "\\s{2,}") %>%
separate("F5", into = c("drop", paste0("F5_val_", 1:2)), sep = "\\s{2,}") %>%
select(-drop) %>% t() %>% as.data.frame()
z$V1 <- trimws(z$V1) # clean whatever whitespace is left
This will yield you a long dataframe:
dim(z)
[1] 27 1
Like so:
> z
V1
Start Date 05/20/18
End Date 05/20/18
Subject 1
Start Time 16:23:11
End Time 17:26:24
B 7.000
C 12000.000
D 9500.000
E 1.000
Q 203.000
T 1200.100
U 218.000
W 7.000
X 347.000
Y 0.000
Z 0.000
F5_val_1 5500.000
F5_val_2 17500.000
F0_val_1 11500.000
F0_val_2 9500.000
F0_val_3 13500.000
F0_val_4 7500.000
F0_val_5 15500.000
A0_val_1 1.000
A0_val_2 0.000
A0_val_3 0.000
A0_val_4 0.000
I am not sure this is the most efficient to work with the data (not a tidy format), but sounds like this is what you wanted?
Related
I have a data frame with numeric columns and a character column with labels. See example:
library(tidyverse)
a <- c(0.036210845, 0.005546561, 0.004394322 ,0.006635205, 2.269306824 ,0.013542101, 0.006580308 ,0.006854309,0.009076331 ,0.006577178 ,0.099406840 ,0.010962796, 0.011491922,0.007454443 ,0.004463684,0.005836916,0.011119906 ,0.009543205, 0.003990476, 0.007793532 ,0.020776231, 0.011713687, 0.010045341, 0.008411304, 0.032514994)
b <- c(0.030677829, 0.005210211, 0.004164294, 0.006279456 ,1.095908581 ,0.012029876, 0.006193405 ,0.006486812, 0.008589699, 0.006167356, 0.068956516 ,0.010140064 ,0.010602171 ,0.006898081 ,0.004193735, 0.005447855 ,0.009936211, 0.008743681, 0.003774822, 0.007375678, 0.019695336, 0.010827791, 0.009258572, 0.007960328,0.026956408)
c <- c(0.025855453, 0.004882746 ,0.003946182, 0.005929399 ,0.466284591 ,0.010704604 ,0.005815709, 0.006125196, 0.008110854, 0.005769223, 0.046847336, 0.009356712, 0.009803620 ,0.006366758, 0.003936953 ,0.005072295, 0.008885989 ,0.007989028, 0.003565631, 0.006964512, 0.018636187, 0.010009413, 0.008540876, 0.007516569,0.022227924)
label <- c("fa05","fa05" ,"fa05", "fa10", "fa10", "fa10", "fa20","fa20", "faflat", "faflat", "sa05", "sa05", "sa10" , "sa10" , "sa10" , "sa10", "sa10", "sa10", "sa20", "sa20", "sa20" ,"sa20", "saflat", "saflat", "saflat")
dataframe <- as.data.frame(cbind(a,b,c,label))
dataframe <- dataframe %>%
transform(a = as.numeric(a)) %>%
transform(b = as.numeric(b)) %>%
transform(c = as.numeric(c))
I have written a function that takes a sample of rows for each label (number of rows in sample = number of rows for the specific label) and as output gives the average of the samples. Example: in the source data (dataframe) there are 3 rows of the label "fa05". Lets call them fa05_1, fa05_2, fa05_3 (just for explaining it). The function takes a sample of these three rows that each consist of 3 columns (a,b and c). The number of fa05 in the sample equals the number fa05 in the source data, so 3 in this case. The function takes a sample with replacement so it could fx be fa05_3, fa05_1, fa05_1. Then it takes the average of those three samples for each of the three columns a,b and c and gives the output. It looks like this:
samp <- function(df, col1, var){
df %>%
group_by(!!col1) %>%
nest() %>%
ungroup() %>%
mutate(n = !!var) %>%
mutate(samp = map2(data, n, sample_n, replace=T)) %>%
select(-data) %>%
unnest(samp) %>%
group_by(!!col1) %>%
dplyr::summarise(across("a":"c", mean))
}
list <- c(3,3,2,2,2,6,4,3) # The number of times each label occur in the data
samp(dataframe, quo(label), quo(list))
label a b c
<chr> <dbl> <dbl> <dbl>
1 fa05 0.00439 0.00416 0.00395
2 fa10 0.00894 0.00820 0.00752
3 fa20 0.00672 0.00634 0.00597
4 faflat 0.00908 0.00859 0.00811
5 sa05 0.0552 0.0395 0.0281
6 sa10 0.00715 0.00657 0.00603
7 sa20 0.0101 0.00956 0.00903
8 saflat 0.0250 0.0211 0.0177
I would like to use this function on some data and repeat it 1000 times efficiently. At first it was not a function and I used rerun() but that was very inefficient. I read that I could write it as a function and the use lapply which should be more efficient, but it does not work when I do like this:
lapply(dataframe, samp, col1=quo(Pattern), var=quo(list))
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
How do I make this work with lapply? And how to I tell lapply to rerun the function 1000 times? I hope you can help.
You can just do this
replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE)
However, this is really slow.
> system.time(replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE))
user system elapsed
33.83 0.03 33.87
To make it faster, we need to rewrite your samp function. Here is a tidyverse approach
group_sample_size <- c("fa05" = 3, "fa10" = 3, "fa20" = 2, "faflat" = 2, "sa05" = 2, "sa10" = 6, "sa20" = 4, "saflat" = 3)
prep <- function(df, grp_var, sample_size) {
df %>%
mutate(size = sample_size[.data[[grp_var]]]) %>%
group_by(across(!!grp_var))
}
rep_sample <- function(df, n) {
replicate(
n,
df %>%
slice(sample.int(n(), size[[1L]], replace = TRUE)) %>%
summarise(across(a:c, mean), .groups = "drop"),
simplify = FALSE
)
}
dataframe %>%
prep("label", group_sample_size) %>%
rep_sample(1000)
Performance has improved significantly but is still suboptimal IMO. It takes about 5-6 seconds to finish the simulation.
> system.time(dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000))
user system elapsed
5.80 0.01 5.81
For efficiency, I think the following data.table approach would be better.
library(data.table)
fsamp <- function(df, grp_var, size, nsim) {
df <- as.data.table(df)
group_info <- table(df[[grp_var]], dnn = list(grp_var))
simu_pool <- df[, -grp_var, with = FALSE]
simu_vars <- names(simu_pool)
simu_pool <- split(simu_pool, df[[grp_var]])
out <- data.table(
simu = rep(seq_len(nsim), each = length(group_info)),
group_info
)
out[
, size := size[out[[grp_var]]]
][
, (simu_vars) := lapply(simu_pool[[.BY[[grp_var]]]][sample.int(N, size, replace = TRUE)], mean),
by = c("simu", grp_var)
][]
}
This one is about four times faster than the optimised tidyverse approach.
> system.time(fsamp(dataframe, "label", group_sample_size, 1000))
user system elapsed
1.47 0.04 1.50
All three approaches produce the same set of results
> set.seed(124)
> # rbindlist converts a list of tibbles into a single data.table
> dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
> set.seed(124)
> fsamp(df, "label", group_sample_size, 1000)
simu label N size a b c
1: 1 fa05 3 3 0.015383909 0.013350778 0.011561460
2: 1 fa10 3 3 0.763161377 0.371405971 0.160972865
3: 1 fa20 2 2 0.006717308 0.006340109 0.005970452
4: 1 faflat 2 2 0.009076331 0.008589699 0.008110854
5: 1 sa05 2 2 0.055184818 0.039548290 0.028102024
---
7996: 1000 faflat 2 2 0.007826754 0.007378527 0.006940039
7997: 1000 sa05 2 2 0.099406840 0.068956516 0.046847336
7998: 1000 sa10 6 6 0.006648513 0.006118159 0.005626362
7999: 1000 sa20 4 4 0.020776231 0.019695336 0.018636187
8000: 1000 saflat 3 3 0.008411304 0.007960328 0.007516569
> set.seed(124)
> replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
I have two matrix (A and B). I am trying to subset the matching rows from B with an interval value. For example,
Matrix A contains (I have more than 200 compounds)
Name Mass. RT. Area. ID
Asa. 234.032 1.56. 6755. Sd323
bda 164.041. 4.48. 5353. SD424
dsf. 353.953. 6.53. 2535. SD422
fed. 535.535. 5.14. 4542 SD424
Matrix B contains (similarly original matrix or CSV contains 5000 compounds)
Name. mass. RT Area. chemID pubID score
csa. 234.031 1.56. 4354. frsg. gss. 90
bda. 164.041. 4.78. 4346. gsdg gsf. 80
dwf. 432.035. 9.84. 4245. grhr. hfg. 99
fsf. 535.042. 7.01. 5353. heth. gww. 90
Now I want to subset the matching compounds from matrix B using Mass ± 0.001 and RT ± 0.5 interval and final matrix look like
Name. mass. RT Area. chemID pubID score
csa. 234.031 1.56. 4354. frsg. gss. 90
bda. 164.041. 4.78. 4346. gsdg gsf. 80
I tried with following commands in R and didnt work well. Any help is really appreciated.
#Read in first table
fname = "A.csv"
df1 = read.csv(fname)
# Read in the second table
fname = "B.xlsx"
df2 = read_excel(fname, skip=4)
# Create an empy dataframe
new_df = setNames(data.frame(matrix(ncol = ncol(df2), nrow = 0)), colnames(df2))
# Set the threshold for the mass and the retention time
m_ths = 1.e-3 # Mass threshold
rt_ths = 0.5 # Retention time threshold
# Loop over the indices of one of the data frames
for (i in 1:nrow(df1)) {
# Get the mass and retention time of the current row
m = df1$Mass[i]
rt = df1$RT[i]
# Get boolean vectors of rows within the second table that are within the
# given tolerance of the current mass (m) and retention time (rt)
m_cond = df2$Mass >= m-m_ths & df2$Mass <= m+m_ths
rt_cond = df2$RT >= rt-rt_ths & df2$RT <= rt + rt_ths
# Get the subset of rows in second table that meet the required conditions
tmp_df = subset(df2, m_cond & rt_cond)
if (nrow(tmp_df) > 0) {
# If the new table is not empty add it to the empty new_df data frame
tmp_df$mb_data_index = i
new_df = rbind(new_df, tmp_df)
}
}
write.csv(new_df, "commoncompounds.csv")
Code:
library('data.table')
# join two data tables and get only the matching rows by Name
df3 <- setDT(df2)[df1, on = 'Name', nomatch = 0]
# subset based on conditions of Mass and RT
df3 <- df3[ (round(abs(Mass - i.Mass), 3) <= 0.001) &
(round(abs(RT - i.RT), 1) <= 0.5), ]
# remove columns of df1
df3[, `:=` (i.Mass = NULL, i.RT = NULL, i.Area = NULL, ID = NULL)]
df3
# Name Mass RT Area chemID pubID score
# 1: Asa 234.031 1.56 4354 frsg gss 90
# 2: bda 164.041 4.78 4346 gsdg gsf 80
Data:
df1 <- read.table(text =
'Name Mass RT Area ID
Asa 234.032 1.56 6755 Sd323
bda 164.041 4.48 5353 SD424
dsf 353.953 6.53 2535 SD422
fed 535.535 5.14 4542 SD424', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'Name Mass RT Area chemID pubID score
Asa 234.031 1.56 4354 frsg gss 90
bda 164.041 4.78 4346 gsdg gsf 80
dwf 432.035 9.84 4245 grhr hfg 99
fsf 535.042 7.01 5353 heth gww 90', header = TRUE, stringsAsFactors = FALSE)
I have a very large data frame with SNPs in rows (~50.000) and IDs in columns (~500), imagine an extraction would look something like this:
R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
Now I want to save this as a txt, normally no problem with
write.table(example, "example.txt", colnames=T, rownames=T, quotes=F)
BUT I need to have a tab (\t) as first column entrance, so in the txt file the data frame should look sth like:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
(\t for the tab)
Can anyone help me how to do this?
Btw I also tried:
write.table(data.frame("\t"=rownames(example),example),"example.txt", row.names=FALSE)
It did not work, unfortunately...
Thanks!
This kind of works, just replace stdout() with the path to your output-file:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c(' ', names(data)), collapse = '\t'),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = '\t')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 35 97 27
#> B 12 69 24
#> C 25 9 34
Or with spaces as seperators and the tab you wished for in the first column:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c('\t', names(data)), collapse = ' '),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = ' ')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 3 30 11
#> B 62 69 70
#> C 93 55 73
Using a data frame like the following, where I've changed one row name to illustrate how to deal with cases of unequal length:
df <- read.table(text = "R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58")
You could do something like this:
df <- format(as.matrix(df))
df <- cbind("\\t" = rownames(df), df)
df <- rbind(colnames(df), df)
df[,1] <- stringr::str_pad(df[,1], max(nchar(df[,1])), "right")
write.table(df,
file = "example.txt",
sep = " ",
quote = F,
row.names = F,
col.names = F)
Output:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58
I first converted the numeric values to character and formatted them to make sure they have the same number of digits, otherwise they won't line up. Then I turn the row names into a new variable named \\t, and then I turn the column names into a new row. I use stringr::str_pad() to account for row names of differing lengths. Finally, I write the data frame to TXT file without the row or column names.
I did a replication 10000 times where I took a random sample from a list of ID's and then paired them with another list of IDs. After that I added a colomn that gives the relatedness of pair to each other. Then I took thee mean of the relatedness for each set of random sampling. So I end up with 10000 values which represent the mean of the relatedness for each set of random sampling. However, I want to instead take the mean of the relatedness of first row for all the 10000 sets of random sampling.
An example of what I want:
Lets say I have 10000 sets of 3 random pairings.
Set 1
female_ID male_ID relatedness
0 12-34 23-65 0.034
1 44-62 56-24 0.56
2 76-11 34-22 0.044
Set 2
female_ID male_ID relatedness
0 98-54 53-12 0.022
1 22-43 13-99 0.065
2 09-22 65-22 0.12
etc...
I want the mean of the rows for relatedness of each set, so I want a list of 3 values: 0.028 (mean of 0.034 and 0.022), 0.3125 (mean of 0.56 and 0.065), 0.082 (mean of 0.044 and 0.12), except it would be the mean across 10000 sets, and not just 2.
Here's my code so far:
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
})
If you add simplify = FALSE to your replicate after the between the closing } and ), then mean_rel will be output as a list.
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
}, simplify = FALSE)
From there, you can use purrr to add two classification columns and then can use dplyr for the rest. Here is how I did it:
library(tidyverse)
mean_rel <- purrr::map2(.x = mean_rel, .y = seq_along(mean_rel),
function(x, y){
x %>%
mutate(set = paste0("set_", y)) %>%
# do this so the same row of each set can be
# compared
rownames_to_column(var = "row_number")
})
mean_rel_comb <- mean_rel %>%
do.call(rbind, .) %>%
as.tibble() %>%
mutate(relatedness = as.numeric(as.character(relatedness))) %>%
group_by(row_number) %>%
summarize(mean = mean(relatedness))
Using your two datasets combined as a list gave me this:
# A tibble: 3 x 2
row_number mean
<chr> <dbl>
1 1 0.0280
2 2 0.3125
3 3 0.0820
I am wanting to convert several columns in a data.frame from chr to numeric and I would like to do it in a single line. Here is what I am trying to do:
items[,2:4] <- as.numeric(sub("\\$","",items[,2:4]))
But I get an error saying:
Warning message:
NAs introduced by coercion
If I do it column by column though it works:
items[,2:2] <- as.numeric(sub("\\$","",items[,2:2]))
items[,3:3] <- as.numeric(sub("\\$","",items[,3:3]))
items[,4:4] <- as.numeric(sub("\\$","",items[,4:4]))
What am I missing here? Why I specify this command for multiple columns? Is this some odd R idiosyncrasy that I am not aware of?
Example Data:
Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09
Your problem is, that gsub converts its x argument to character. If a list (a data.frame is in fact a list) is converted to character something wired happen:
as.character(list(a=c("1", "1"), b="1"))
# "c(\"1\", \"1\")" "1"
# and "c(\"1\", \"1\")" can not convert into a numeric
as.numeric("c(\"1\", \"1\")")
# NA
A one line solution would be to unlist the x argument:
items[, 2:5] <- as.numeric(gsub("\\$", "", unlist(items[, 2:5])))
Yes there is: apply is the command you are looking for:
items<-read.table(text="Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09", header=TRUE,sep=",")
items[,2:4]<-apply(items[,2:4],2,function(x){as.numeric(gsub("\\$","",x))})
items
Name Cost1 Cost2 Cost3 Cost4
1 A 10.00 15.50 13.20 $45.45
2 B 45.23 34.23 34.24 $23.34
3 C 23.43 45.23 65.23 $34.23
4 D 76.34 98.34 90.34 $45.09
A more efficient approach would be:
items[-1] <- lapply(items[-1], function(x) as.numeric(gsub("$", "", x, fixed = TRUE)))
items
# Name Cost1 Cost2 Cost3 Cost4
# 1 A 10.00 15.50 13.20 45.45
# 2 B 45.23 34.23 34.24 23.34
# 3 C 23.43 45.23 65.23 34.23
# 4 D 76.34 98.34 90.34 45.09
Some benchmarks of the answers so far
fun1 <- function() {
A[-1] <- lapply(A[-1], function(x) as.numeric(gsub("$", "", x, fixed=TRUE)))
A
}
fun2 <- function() {
A[, 2:ncol(A)] <- as.numeric(gsub("\\$", "", unlist(A[, 2:ncol(A)])))
A
}
fun3 <- function() {
A[, 2:ncol(A)] <- apply(A[,2:ncol(A)], 2, function(x) { as.numeric(gsub("\\$","",x)) })
A
}
Here's some sample data and processing times
set.seed(1)
A <- data.frame(Name = sample(LETTERS, 10000, TRUE),
matrix(paste0("$", sample(99, 10000*100, TRUE)),
ncol = 100))
system.time(fun1())
# user system elapsed
# 0.72 0.00 0.72
system.time(fun2())
# user system elapsed
# 5.84 0.00 5.85
system.time(fun3())
# user system elapsed
# 4.14 0.00 4.14