My rownames consists of 6 strings that are separated by a space. I would like to keep the rows that has 0 in the third string. I am not sure how this is done, since the strings are not defines as columns.
hsa-miR-143-3p TGAGAAGAAGCACTGTAGCTCTT 6AT u-TT 0 0 0 1
hsa-miR-10a-5p GACCCTGTAGATCCGAATTTGTA 1GT u-A 0 u-G 1 0
hsa-miR-10a-5p GACCCTGTAGATCCGAATTTGTG 1GT 0 0 0 54 24
hsa-miR-1296-5p TTAGGGCCCTGGCTCCATCT 0 0 0 u-CC 11 17
hsa-miR-887-3p GTGAACGGGCGCCATCCCGAGGCTT 0 0 0 d-CTT 1 8
hsa-miR-10a-5p ACCCGGTAGATCCGAATTTGTG 5GT 0 d-T 0 7 11
out:
hsa-miR-1296-5p TTAGGGCCCTGGCTCCATCT 0 0 0 u-CC 11 17
hsa-miR-887-3p GTGAACGGGCGCCATCCCGAGGCTT 0 0 0 d-CTT 1 8
Assuming df holds your data frame, you could try
idx <- which(read.table(text=rownames(df))$V3=="0")
df[idx, ]
Related
I have a positive integer value vector C which is of length 192. I want to generate another vector based on this vector C. The new vector to be created is called B (same length as C). The algorithm for creation is:
Whenever a value above 0 is observed in C, add the same value 12 places back in the vector B. For example, if the first 15 entries of C are 0 and the 16th entry is 3, then I want to add the value 3, 12 positions back (which is 16-12=position 4) in vector B. The vector B would be generated in this way over all values of C.
Any help would be greatly appreciated! The vector C can be obtained by the R library "outbreaks" and the data file from that package is ebola_kikwit_1995$onset.
It doesn't seem to be very tricky. An index vector and a for loop will do what the question asks for.
library("outbreaks")
i <- which(ebola_kikwit_1995$onset > 0)
i <- i[i > 12]
ebola_kikwit_1995$B <- 0L
for(j in i){
ebola_kikwit_1995$B[j] <- ebola_kikwit_1995$onset[j] + ebola_kikwit_1995$B[j - 12]
}
ebola_kikwit_1995$B
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [28] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [55] 0 0 0 0 0 1 1 1 0 0 0 0 3 0 0 0 1 2 0 2 0 2 0 1 0 0 0
# [82] 0 0 0 0 4 0 3 1 3 1 0 1 1 1 1 1 5 3 0 5 7 2 2 5 3 2 8
#[109] 5 9 5 4 10 10 13 14 20 10 9 16 7 14 13 10 18 13 17 21 31 13 21 21 15 17 16
#[136] 18 22 18 18 24 35 16 22 23 18 18 19 21 26 21 23 26 37 17 0 25 0 0 21 0 0 0
#[163] 25 28 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#[190] 0 0 0
I'm using the grepl function to try and sort through data; all the row numbers are different survey respondents, and each number in the "ANI_type" string represents a different type of animal - I need to sort these depending on animal type. More specifically, I need to group some of the digits in the strings into categories. For example, digits 6,7,8,9,10,11 all need to be placed in the animals$pock object. How would I go about that using the grep function?
> animals$dogs <- as.numeric(grepl("\\b1\\b", animals$ANI_type))
> animals
ANI_type dogs cats repamp
1 1,2,5,12,13,14,15,16,18,19,27 1 1 0
2 2 0 1 0
3 20,21,22,23,26 0 0 0
4 20,21,22,23 0 0 0
5 13 0 0 0
6 2 0 1 0
7 20,21,22 0 0 0
8 20,21,22,23 0 0 0
9 20,21,22 0 0 0
10 5,20,21,22,27 0 0 0
11 1,2,20,21,22 1 1 0
12 5,18,20,21,22,23,26 0 0 0
13 20,21 0 0 0
14 21 0 0 0
15 20,21 0 0 0
16 20,21,26 0 0 0
17 2 0 1 0
18 1,2 1 1 0
19 2 0 1 0
20 3,4 0 0 1
The expected output is the columns (dog, cat, repamp) above... these were easy to do as there is only one digit; what I'm having trouble with is splitting up multiples.
A tidyverse solution could be employed with mutate() and if_else() from the dplyr library, and grepl(), for example:
animals <- animals %>%
mutate(dogs = if_else(grepl("\\b1\\b|\\b22\\b", ANI_TYPE),
cats = if_else(grepl("\\b2\\b|\\b31\\b", ANI_TYPE))
In this case, you'd want to separate all the different potential codes for each animal using the pipe operator | which functions as an OR operator in R.
First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
And the second set looks like this:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
I got this code from this topic but I'm new to R and I still need some help with it:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?
There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1
Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))
For instance, if I have a text file (mytext.txt) with the following text:
Table1
13 3 20 0 0 0 0
3 10 0 0 0 6 0
20 0 5 0 0 0 0
0 0 0 7 20 0 0
0 0 0 20 19 0 0
0 0 0 0 0 8 0
0 0 0 0 0 0 13
Table2
0
2
10
-5
3
-10
-5
Can I retrieve both of them and get two tables?
So that if I print my data table1 I would get the first table and if I print my data table 2 I would get the second table.
I know that if mytext.txt only had one table I could do something like:
table1 <- read.table("mytext.txt")
1) Assuming the input file is tables.txt, read the lines into Lines and let names.ix be the indexes of the lines containing the table names -- these lines are identified as beginning with a character that is not a minus or digit. Then create a grouping variable grp which identifies which table each line belongs to, split the lines into those groups and read each group of lines. This uses no packages and can handle any number of tables in the file.
Lines <- readLines("tables.txt")
names.ix <- grep("^[^-0-9]", Lines)
grp <- Lines[names.ix][ cumsum(seq_along(Lines) %in% names.ix) ]
Read <- function(x) read.table(text = x)
L <- lapply(split(Lines[-names.ix], grp[-names.ix]), Read)
giving:
> L
$Table1
V1 V2 V3 V4 V5 V6 V7
1 13 3 20 0 0 0 0
2 3 10 0 0 0 6 0
3 20 0 5 0 0 0 0
4 0 0 0 7 20 0 0
5 0 0 0 20 19 0 0
6 0 0 0 0 0 8 0
7 0 0 0 0 0 0 13
$Table2
V1
1 0
2 2
3 10
4 -5
5 3
6 -10
7 -5
2) By the way, if you only need the first table then this does it:
library(data.table)
fread("tables.txt")
Not directly, but you can try read.mtable from my "SOfun" package, available only on GitHub.
The approach is similar to #G.Grothendieck's approach, but packaged into a function, so you can simply do:
read.mtable("tables.txt", chunkId = "Table", header = FALSE)
# $Table1
# V1 V2 V3 V4 V5 V6 V7
# 1 13 3 20 0 0 0 0
# 2 3 10 0 0 0 6 0
# 3 20 0 5 0 0 0 0
# 4 0 0 0 7 20 0 0
# 5 0 0 0 20 19 0 0
# 6 0 0 0 0 0 8 0
# 7 0 0 0 0 0 0 13
#
# $Table2
# V1
# 1 0
# 2 2
# 3 10
# 4 -5
# 5 3
# 6 -10
# 7 -5
The chunkId parameter can also be a regular expression, like `chunkId = "[A-Za-z]+".