Reformat wrapped data coerced into a dataframe? (R) - r

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

Related

Is there a way to merge multiple k-mer count outputs from JELLYFISH tool?

I need some help to merge multiple <files.jf>.
Default outputs of a jellyfish k-mer count are files with the k-mer and the relative counts. In my case they are files that have a very large size (even 150-200Gb).
Suppose I have multiple jf files, file1.jf file2.jf file3.jf and so on, of different size and therefore also different row lengths (it is possible that a k-mer in one sample does not exist in another, and so in that case I would like to add an NA value).
i.e. jeffyfish dump file1.jf
>2
AGGGTGGATTACACCCACA
>1
CGGGAAGCCATTGGGTAAA
>1
CCCAACCATTTTCTTAACC
>1
ACACCTGTTATGTTTACCA
>1
GTTAATTTTTTAAGTGGGA
i.e. jeffyfish dump file2.jf
>1
ACTCCTCCCTTGGCAGTAG
>1
CACCAGGCTGAGAAAAGTG
>1
TCTTTACCTAAAAAACAAA
i.e. jeffyfish dump file3.jf
>1
CCTTTATCCCTGAGACCAC
>1
AAATGCAAGAGAAACAAAG
>2
TCTTTACCTAAAAAACAAA
>1
ACCAAGCAGGTCCATGAGC
>1
GCATGGGGAGAAAGTGCCA
>1
AATTCTCTGGTGCCTGCTC
>1
TCGTTGGGCTGAGTCATCA
Actually the file shows up with the number >n in the row above and the sequence in the row below, so for all files (but here I can't represent them in column).
I would like to obtain an outputs like this
19-mer
file1
file2
file3
AGGGTGGATTACACCCACA
2
NA
NA
CGGGAAGCCATTGGGTAAA
1
NA
NA
CCCAACCATTTTCTTAACC
1
NA
NA
ACACCTGTTATGTTTACCA
1
NA
NA
GTTAATTTTTTAAGTGGGA
1
NA
NA
ACTCCTCCCTTGGCAGTAG
NA
1
NA
CACCAGGCTGAGAAAAGTG
NA
1
NA
TCTTTACCTAAAAAACAAA
NA
1
2
CCTTTATCCCTGAGACCAC
NA
NA
1
AAATGCAAGAGAAACAAAG
NA
NA
1
ACCAAGCAGGTCCATGAGC
NA
NA
1
GCATGGGGAGAAAGTGCCA
NA
NA
1
AATTCTCTGGTGCCTGCTC
NA
NA
1
TCGTTGGGCTGAGTCATCA
NA
NA
1
Could anyone tell me how to do it? Scripts in bash, R or other solutions with other commands (i.e. join bash command or others) are fine.
I also tried with jellyfish merge but it dowsn't work due two the different length of the files.
IMPORTANT: In some files the same sequence may also have been found, in which case I would like the sample / file count to be printed for both.

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Impute NA values with previous value in R

We have 101 variables (companys) their closing prices. We got a lot of NA values (because the stock market closes on saturdays and sundays -> gives NA value in our data) and we need to impute those NA values with the previous value if there is a previous value but we don't succeed. This is our data example
There are also companies that don't have data in the first years since they were not on the stock market so they have NA values for this period. And there are companies that go bankrupt and start having NA values so these should both become 0.
How should we do this since we have several conditions for filling our NA's
Thanks in advance.
My understanding of the rules are:
columns that are all NA are to be left as all NA
leading NA values are left as NA
interior NA values are replaced with the most recent non-NA values
trailing NA values are replaced with 0
To try this out we use the built-in data frame BOD replacing the 1st, 3rd and last rows with NA and adding a column of NA values -- see Note at end.
We define a logical vector ok having one element per column which is TRUE for columns having at least one element that is not NA and FALSE for other columns. Then operating only on the columns for which ok is TRUE we fill in the trailing NA values with 0 using na.fill. Then we use na.locf to fill in the interior NA values.
library(zoo)
ok <- !apply(is.na(BOD), 2, all)
BOD[, ok] <- na.locf(na.fill(BOD[, ok], c(NA, NA, 0)), na.rm = FALSE)
giving:
Time demand X
1 NA NA NA <-- leading NA values are left intact
2 2 10.3 NA
3 2 10.3 NA <-- interior NA values are filled in with last non-NA value
4 4 16.0 NA
5 5 15.6 NA
6 0 0.0 NA <- trailing NA values are filled in with 0
Note
We used the following input above:
BOD[c(1, 3, 6), ] <- NA
BOD <- cbind(BOD, X = NA)
Update
Fix.

Take difference of two columns in R resulting in a new third column

So far I have a data frame that looks like this:
Account Total Mastered Not_Mastered
1 1 NA NA
2 12 2 10
3 4 NA NA
4 51 50 1
The code I have is:
Table$not_mastered = (Table$total - Table$mastered)
My goal is to subtract the 'mastered' column from the 'total' column to result in a third column 'not_mastered' and if there is no value in the 'mastered' column then I want the new column to have the same value as the 'total' column. Shown below.
Account Total Mastered Not_Mastered
1 1 NA 1
2 12 2 10
3 4 NA 4
4 51 50 1
How can I skip over the NA values in the mastered column and rewrite the values from the total column?
We can use replace to change the NA values to 0 and then do the difference
with(df1, Total - replace(Mastered, is.na(Mastered), 0))
#[1] 1 10 4 1
Depending on what kind of software you are using, you should be able to catch those with a simple if-loop.
for index=1: (number of rows of data) % looks at each row, one at a time
if Mastered(index)==NA % if the value is the Mastered column is NA
NotMastered(index)=Total(index);
else
NotMastered(index)=Total(index)-Mastered(index);
end
end

Replaing NAs with correlated values in rows

Hey All I have data frame with 5 Samples A,B,C,D,E. and what I want to do is firstly search for a mirna which is overall highly correlated with the miRNA having the missing value and taking a value derived from that mirna .. for example
miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10
==> replace the missing value derived from the second miRNA by 4.
This is what I want to do for my data frame in R
Any help would be really appreciated :)
A B C D
hsa-miR-199a-3p, hsa-miR-199b-3p NA 13.13892 5.533703 25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p 15.70536 52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p NA 21.41597 5.964772 NA
hsa-miR-3689b-3p, hsa-miR-3689c 9.58696 44.56490 10.102051 13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865 28.06991 NA NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA 10.77471 8.039662 NA
E
hsa-miR-199a-3p, hsa-miR-199b-3p NA
hsa-miR-365a-3p, hsa-miR-365b-3p 31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA
Have you had a look at this answer (esp Akrun's short cut from zoo)? I appreciate it's not quite what you want, but might give some leads. It is for means of neighbours in a row, so would suggest 1 2 3 NA 5 would be 4 (average 3 and 5).
Replacing NA's in R numeric vectors with values calculated from neighbours
Trying to find a correlation between pairs with just 4 data points, as one is missing, is a challenge.

Resources