importing chopped data in r - r

Here is a segment of my data. When i do read.csv(data, sep = " ") i get a dataframe with columns and rows. However this data is all of one type, so i just need either one row, one column or a vector.
Any help is apreciated.
0 0 0 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 20 30 30 30 30 30 30 30 30 30 40 40 40 40 40 40 50 50 50 50 50 50 60 60 60 60 60 60 60 60 60 60 60 60 70 70 70 70 70 70 70 70 70 80 80 80 80 80 80 80 80 80 90 90 90 90 90 90 90 90 90 100 100 100 100 100 100 100 100 100 110 110 110 110 110 110 110 110 110 110 110 110 120 120 120 120 120 130 130 130 130 130 130 140 140 140 140 140 140 140 140 150 150 150

What about:
scan(data,what="numeric", sep=" ")
When you use ´read.csv´ (or any read.xxx) R understand that you're trying to import a table, so it creates a dataframe with columns and rows with the contents of the file. You can read as a string directly with scan or change the dataframe later:
Load the data:
df=read.csv(data, sep=" ")
Change it to a string:
as.numeric(df)

I cut and paste your example into a text file called "test" and was able to import your example using this code here:
testdf = read.csv('test', sep=" ", header = FALSE)
When I first tried, I just got a bunch of columns with no data.
For me, the key was the term:
header = False
Hope this helps!

Related

Optimal way to reshape dataframe in R to have observation on columns

Given e.g. the Orange data set, I would like to arrange the observations in a matrix in which the measurements (circumference) taken on each tree are arranged in rows (for a total of 5 rows).
One unsatisfactory way of obtaining this result is as follows:
mat<-matrix(Orange[,3],nrow=5, ncol = 7,byrow=T, dimnames = list(c(unique(Orange$Tree)),c(1:7)))
An alternative way would be using the dcast( ) function within the data.table package.
This allows you to convert data from long to wide. In this case, I've created an ID to could the number of records per Tree.
In the re-shaped data, Tree becomes our primary column and circumference is recorded in 7 unique columns (one for each age).
library(data.table)
Orange <- data.table(Orange)[,ID := seq(1:.N), by=Tree]
Orange2 <- dcast(
data = Orange,
formula = Tree ~ ID,
value.var = "circumference")
Orange2
Tree 1 2 3 4 5 6 7
1: 3 30 51 75 108 115 139 140
2: 1 30 58 87 115 120 142 145
3: 5 30 49 81 125 142 174 177
4: 2 33 69 111 156 172 203 203
5: 4 32 62 112 167 179 209 214
EDIT (in response to additional comments/questions):
Technically the data is already ordered by Tree (defined within the data). This is because the variable Tree is a factor variable with preset levels. To order numerically, here are 2 things: (1) Order by as.character( ) and (2) Re-level the variable.
Orange2[order(as.character(Tree),]
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177
class(Orange$Tree)
[1] "ordered" "factor"
levels(Orange$Tree)
[1] "3" "1" "5" "2" "4"
Orange2[,Tree := factor(Tree, c("1","2","3","4","5"), ordered = FALSE)]
Orange2[order(Tree),]
Tree 1 2 3 4 5 6 7
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177
In base, you could simply do:
aggregate(circumference ~ Tree, Orange, I)
If you don't want to order it afterwards: aggregate(circumference ~ as.character(Tree), Orange, I) (that will strip the factor ordering).
Or similar to #RyanF:
Orange$id <- sequence(rle(as.character(Orange$Tree))$lengths)
reshape(Orange[,-2],
idvar = "Tree",
timevar = "id",
direction = "wide")
Output:
Tree circumference.1 circumference.2 circumference.3 circumference.4 circumference.5 circumference.6 circumference.7
1 1 30 58 87 115 120 142 145
8 2 33 69 111 156 172 203 203
15 3 30 51 75 108 115 139 140
22 4 32 62 112 167 179 209 214
29 5 30 49 81 125 142 174 177

How to find overlapping regions between two data frames based on conditions

I have two data frames, one called strain_1 and the other called strain_2. Each data frame has 4 columns (st_A, ed_A, st_B, ed_B : for "start" and "end" positions), but a different number of rows. st_A, ed_A and st_B, ed_B are the "start" and "end" positions of the block_A and block_B, respectively (see image 1 and the example below).
I am looking to identify the common overlapping blocks between strain_1 and strain_2.
Taking an example from image 1:
strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))
strain_2 <- data.frame(st_A=c(5,20,36,49) , ed_A=c(8,25,39,50),
st_B=c(124,95,141,105) , ed_B=c(129,100,147,110))
From this example, we see three overlapping regions (image 1):
The overlapping region is defined by : the min value of st_A
(or st_B) and max value of ed_A (or ed_B) for block_A and
block_B, respectively (see image 2: green box = common region).
The objective is to create a new data frame with these common regions (pair of blocks)
## result_desired
result_desired <- data.frame(st_A=c(5,20,35), ed_A=c(9,28,39),
st_B=c(123,95,140), ed_B=c(129,100,147))
There are 16 possible combinations (see image 3), depending on the size of each block.
Is there a fast way to do this? knowing that I have data with several thousand rows.
I tried some code, based on #Gregor coments, but I can't get the desired result:
## require(dplyr)
require(dplyr)
## data
strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))
strain_2 <- data.frame(st_A=c(5,20,36,49) , ed_A=c(8,25,39,50),
st_B=c(124,95,141,105) , ed_B=c(129,100,147,110))
# merge data to get cross join
cj_data <-merge(strain_1,strain_2, by = NULL)
# Check block1 and block2
cj_filtered <- cj_data %>% mutate(c_block1= case_when(st_A.x <= st_A.y & ed_A.x <= ed_A.y |
st_A.x >= st_A.y & ed_A.x >= ed_A.y |
st_A.x <= st_A.y & ed_A.x >= ed_A.y |
st_A.x >= st_A.y & ed_A.x <= ed_A.y ~ "overlap_OK",
TRUE ~ "NO"),
c_block2= case_when(st_B.x <= st_B.y & ed_B.x <= ed_B.y |
st_B.x >= st_B.y & ed_B.x >= ed_B.y |
st_B.x <= st_B.y & ed_B.x >= ed_B.y |
st_B.x >= st_B.y & ed_B.x <= ed_B.y ~ "overlap_OK",
TRUE ~ "NO"))
## cj_filtered:
st_A.x ed_A.x st_B.x ed_B.x st_A.y ed_A.y st_B.y ed_B.y c_block1 c_block2
7 9 123 127 5 8 124 129 overlap_OK overlap_OK
25 28 97 98 5 8 124 129 overlap_OK overlap_OK
35 38 140 145 5 8 124 129 overlap_OK overlap_OK
48 51 73 76 5 8 124 129 overlap_OK overlap_OK
89 91 13 16 5 8 124 129 overlap_OK overlap_OK
7 9 123 127 20 25 95 100 overlap_OK overlap_OK
25 28 97 98 20 25 95 100 overlap_OK overlap_OK
35 38 140 145 20 25 95 100 overlap_OK overlap_OK
48 51 73 76 20 25 95 100 overlap_OK overlap_OK
89 91 13 16 20 25 95 100 overlap_OK overlap_OK
7 9 123 127 36 39 141 147 overlap_OK overlap_OK
25 28 97 98 36 39 141 147 overlap_OK overlap_OK
35 38 140 145 36 39 141 147 overlap_OK overlap_OK
48 51 73 76 36 39 141 147 overlap_OK overlap_OK
89 91 13 16 36 39 141 147 overlap_OK overlap_OK
7 9 123 127 49 50 105 110 overlap_OK overlap_OK
25 28 97 98 49 50 105 110 overlap_OK overlap_OK
35 38 140 145 49 50 105 110 overlap_OK overlap_OK
48 51 73 76 49 50 105 110 overlap_OK overlap_OK
89 91 13 16 49 50 105 110 overlap_OK overlap_OK
Thanks for your help.
Here are 2 options using data.table:
1a) Using non-equi joins:
cols <- c(paste0("x.", names(strain_1)), paste0("i.", names(strain_2)))
DT <- rbindlist(list(
strain_1[strain_2, on=.(st_A>=st_A, st_A<=ed_A), nomatch=0L, mget(cols)],
strain_1[strain_2, on=.(st_A<=st_A, ed_A>=st_A), nomatch=0L, mget(cols)]
))
1b) Using foverlaps:
setkey(strain_1, st_A, ed_A)
setkey(strain_2, st_A, ed_A)
foverlaps(strain_1, strain_2, nomatch=0L)
And then another step 2 to get desired output:
DT[between(x.st_B, i.st_B, i.ed_B) | between(i.st_B, x.st_B, x.ed_B),
.(st_A=pmin(x.st_A, i.st_A),
ed_A=pmax(x.ed_A, i.ed_A),
st_B=pmin(x.st_B, i.st_B),
ed_B=pmax(x.ed_B, i.ed_B))]
output:
st_A ed_A st_B ed_B
1: 5 9 123 129
2: 20 28 95 100
3: 35 39 140 147
data:
library(data.table)
strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))
strain_2 <- data.frame(st_A=c(5,20,36,49) , ed_A=c(8,25,39,50),
st_B=c(124,95,141,105) , ed_B=c(129,100,147,110))
result_desired <- data.frame(st_A=c(5,20,35), ed_A=c(9,28,39),
st_B=c(123,95,140), ed_B=c(129,100,147))
setDT(strain_1)
setDT(strain_2)
setDT(result_desired)
p.s.: There should be something in Bioconducter with IRanges as well.

group_by to manipulate several uniques

I create data by:
d <- data_frame(ID = rep(sample(500),each = 20))
I want to create a new column for each of 5 consecutive unique ID's. For this example it seems easy as the length of each ID is fixed. so simply:
d = d %>% mutate(new_col = rep(sample(100), each = 100))
gets consecutive 5 unique ID's. However I generate not fixed 20 ID's. I didn't add that part as it needs other long functions.
My question is simply after we have ID's, I want to take each of 5 consecutive unique ID's and create another column for each of these ID's. I believe group_by might be helpful, but I am not sure how to use it.
You might need this:
d <- d %>% mutate(new_col = cumsum(ID - lag(ID, default = first(ID)) != 0) %/% 5)
Basically, ID - lag(ID, default = first(ID)) != 0 evaluates to TRUE whenever there is an ID change. Doing a cumsum on the vector gives a rleid (take a look at this answer for more info) of the ID column such as 0 0 0 1 1 1 2 2 2. Since you want every five IDs to have the same ID in the new column, do a modular division by 5.
table(d$new_col)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
This should also work if IDs have different lengths.

How do I multiply (square, divide, etc.) a single column in R

print(elasticband)
strech distance tension
1 67 148 5
2 98 120 10
3 34 173 15
4 50 60 20
5 45 263 25
6 42 141 30
7 89 166 35
So I have this data frame, and I want to be able to alter a single column (For example, square everything in the tension column) without affecting the others like elasticband**2
Any tips?
P.S. I'm not too good at this, so the simpler the fix the better
> transform(elasticband, tension2=tension^2)
strech distance tension tension2
1 67 148 5 25
2 98 120 10 100
3 34 173 15 225
4 50 60 20 400
5 45 263 25 625
6 42 141 30 900
7 89 166 35 1225
other alternatives are:
elasticband$tension2 <- elasticband[, "tension"]^2
Or
elasticband$tension2 <- elasticband$tension^2
If you only want a vector as output
elasticband[, "tension"]^2
Or
elasticband$tension^2

Create a for loop which prints every number that is x%%3=0 between 1-200

Like the title says I need a for loop which will write every number from 1 to 200 that is evenly divided by 3.
Every other method posted so far generates the 1:200 vector then throws away two thirds of it. What a waste. In an attempt to be eco-conscious, this method does not waste any electrons:
seq(3,200,by=3)
You don't need a for loop, use match function instead, as in:
which(1:200 %% 3 == 0)
[1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81
[28] 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
[55] 165 168 171 174 177 180 183 186 189 192 195 198
Two other alternatives:
c(1:200)[c(F, F, T)]
c(1:200)[1:200 %% 3 == 0]

Resources