cannot change rownames of a table - r

I created this table:
> head(table)
tissue1 tissue2 tissue3 tissue4 tissue5
Simple_repeat_80 58 77 48 69 115 131
tRNA_1 0 14 12 1 19 14
Simple_repeat_86 2 10 2 2 14 9
Simple_repeat_87 1 33 12 3 15 21
Simple_repeat_103 0 0 2 0 0 4
SINE/tRNA-Deu_20 0 0 1 0 0 10
and I put the command
row <- strsplit(rownames(table), "_[0-9]+") to eliminate the underscore and the number of the elements after the name. I would to create a new table like this example:
> head(table)
tissue1 tissue2 tissue3 tissue4 tissue5
Simple_repeat 58 77 48 69 115 131
tRNA 0 14 12 1 19 14
Simple_repeat 2 10 2 2 14 9
Simple_repeat 1 33 12 3 15 21
Simple_repeat 0 0 2 0 0 4
SINE/tRNA-Deu 0 0 1 0 0 10
I've tried this command:
> row.names(table) = row
Error in `.rowNamesDF<-`(x, value = value) :
'row.names' duplicate non sono permesse
Inoltre: Warning message:
non-unique values when setting 'row.names': ‘DNA?’, ‘DNA/hAT-Ac’, ‘DNA/hAT-Charlie’, ‘DNA/hAT-Tag1’, ‘DNA/hAT-Tip100’, ‘DNA/MULE-MuDR’, ‘DNA/PIF-Harbinger’, ‘DNA/PiggyBac’, ‘DNA/TcMar-Mariner’, ‘DNA/TcMar-Tc1’, ‘DNA/TcMar-Tc2’, ‘DNA/TcMar-Tigger’, ‘LINE/CR1’, ‘LINE/Dong-R4’, ‘LINE/I-Jockey’, ‘LINE/L1’, ‘LINE/L2’, ‘LINE/Penelope’, ‘LINE/RTE-BovB’, ‘Low_complexity’, ‘LTR/ERV1’, ‘LTR/ERVK’, ‘LTR/ERVL’, ‘LTR/Gypsy’, ‘LTR/Gypsy?’, ‘RC/Helitron’, ‘rRNA’, ‘Satellite/acro’, ‘Simple_repeat’, ‘SINE/5S-Deu-L2’, ‘SINE/MIR’, ‘SINE/tRNA’, ‘SINE/tRNA-Deu’, ‘SINE/tRNA-RTE’, ‘snRNA’, ‘srpRNA’, ‘tRNA’
How can I solve it?

Your issue is that you are trying to assign duplicate row.names, which is not legal - multiple rows would be named Simple_repeat. One solution is to make the names unique, for example with:
row.names(table) <- make.unique(row)
Another solution is to not make use of row names at all, but create a separate column and then use that for further processing instead of row names, e.g.
table$rowLabel <- row

Related

How to add two specific columns from a colSums table in r?

I made a frequency table with two variables in a data frame using this:
table(df$Variable1, df$Variable2)
The output was this:
1 2 3 4 5 D R
1 5000 21 39 2 10 0 112
2 1028 11 18 4 8 1 54
3 1501 6 12 2 3 0 68
4 355 2 4 0 0 0 23
5 421 4 4 0 0 0 49
Then I wanted to find the sum of the first two columns so I did this:
colSums(table(df$Variable1, df$Variable2))
The output was this:
1 2 3 4 5 D R
8305 44 77 8 21 1 306
Is there a way to find the sum of columns 1 and 2 from the colSums output above? What would the code be? Thanks in advance.

Why is my R code for filtering data producing different results with "fread()" and "ffdf()"?

I have a huge file with 7 million records and 160 variables. I came to know that fread() and read.csv.ffdf() are two ways to handle such big data. But when I try to use dplyr to filter these two data sets, I get different results. Below is a small subset of my data-
sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2 18 5 0
3 32 11 0
4 67 7 0
5 37 6 1
6 57 5 0
7 50 6 0
8 59 12 0
9 44 9 0
10 40 9 0
11 27 3 0
12 59 8 0
13 44 7 0
14 81 10 0
15 59 6 1
16 32 10 0
17 90 12 1
18 69 7 0
19 62 11 1
20 85 6 1
21 43 10 0
Code1
sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result1-
AGE AGE_NEONATE AMONTH AWEEKEND
1 67 NA 7 0
2 81 NA 10 0
3 90 NA 12 1
4 69 NA 7 0
5 85 NA 6 1
Code2-
sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result2-
AGE AGE_NEONATE AMONTH AWEEKEND
1 81 10 0
2 90 12 1
3 85 6 1
I know that my 1st code is correct and gives me the correct results. What am I doing wrong in the 2nd code?
I haven't really tried running your code, but from what I can see, I suspect the following:
In your 2nd code version, you are reading the headers as part of the data. This leads to all the columns being imported as character rather than numeric.
In addition, most likely you have default.stringsAsFactors() returning TRUE, meaning that the imported character columns are treated as factors.
Now I guess that your between is being applied to factor levels between 65 and 95, rather than to the actual numbers. Since you probably don't have data for every year (age), 67 and 69 are likely mapped to factor levels below 65 (i.e. as.numeric(AGE) will return you the factor levels the numbers map to, and not the numbers as you see them when printing).
Try to use stringsAsFactors = FALSE or convert explicitly to character after reading.

How to get value from upcomming row if condition is met?

I searched in google and SO but could not find any answer to my question.
I try to get a value from the first upcomming row if the condition is met.
Example:
Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33
to
Pupil participation bonus bonusAtNoParti sumBonusTillParticipation=0
2 55 6 -94 6+3+9 = 18
2 33 3 -97 3+9 = 12
2 88 9 -91 9
2 0 -100 0 0
2 44 4 -29 4+7=11
2 66 7 -26 7
2 0 -33 0 0
So I need to do this:
Iterate through the dataframe and check next rows till participation equals to 0 and get the bonus from that line and add the bonus from the current line and write it to bonusAtNoPati.
My problem here is the "check next rows till participation equals to 0 and get the bonus from that line"
I know how to Iterate through the whole list but not after the current point(row)
I would need to do this process to the whole list where i can get any random participation value in random order.
Has anyone any idea how to realize it?
Edit, I also added another column("sumBonusTillParticipation=0", only sum value is required) which is even harder to realize. R is such a hard to learn language =(
you can use which to get which row number participation is 0.
df <- read.table(text = 'Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33', header = T)
index <- c(0, which(df$participation == 0))
diffs <- diff(index)
df$tp <- rep(df$bonus[index], times = diffs)
df$bonusAtNoParti <- df$bonus + df$tp
df$bonusAtNoParti[index] <- 0
df$tp <- NULL
Pupil participation bonus bonusAtNoParti
1 2 55 6 -94
2 2 33 3 -97
3 2 88 9 -91
4 2 0 -100 0
5 2 44 4 -29
6 2 66 7 -26
7 2 0 -33 0

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

Resources