Remove duplicate header lines in dataframe

Remove duplicate header lines in dataframe - r

My raw data contains numeric values with a recall of the headers every 20 lines.
I wish to remove the repeated header lines with R. I know it's quite easy with sed command but I wish the R script to handle all steps of tidying data.
> raw <- read.delim("./vmstat_archiveadm_s.txt")
> head(raw)
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100097600 97779056 285 426 53 0 0 0 367 86 6 0 0 1206 7711 2630 1 0 99
0 0 0 96908192 94414488 7 31 0 0 0 0 0 120 0 0 0 2782 5775 5042 2 0 97
0 0 0 96889840 94397152 0 1 0 0 0 0 0 122 0 0 0 2737 5591 4958 2 0 97
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100065744 97745448 282 422 52 0 0 0 363 89 6 0 0 1233 7690 2665 1 0 99
0 0 0 96725312 94222040 7 31 0 0 0 0 0 604 69 0 0 5269 5703 7910 2 1 97
0 0 0 96668624 94170784 0 0 0 0 0 0 0 155 53 0 0 3047 5505 5317 2 0 97
0 0 0 96595104 94086816 0 0 0 0 0 0 0 174 0 0 0 2879 5567 5068 2 0 97
1 0 0 96521376 94025504 0 0 0 0 0 0 0 121 0 0 0 2812 5471 5105 2 0 97
0 0 0 96503256 93994896 0 0 0 0 0 0 0 121 0 0 0 2731 5621 4981 2 0 97
(...)

Try this :
where df is the dataframe
x = seq(6,100,21)
df = df[-x,]
Sequence will generate a string of numbers from 6 till 100 at an interval of 21.
Therefore, in this case :
6 27 48 69 90
Remove them from the dataframe by
df[-x,]
EDIT:
To do this for the entire dataframe, replace 100 with number of rows. i.e
seq(6,nrow(df),21)

Instead of processing the output in R I will clean it at the generation level:
$ vmstat 1 | egrep -v '^ kthr|^ r'
0 0 0 154831904 153906536 215 471 0 0 0 0 526 33 32 0 0 1834 14171 5253 0 0 99
1 0 0 154805632 153354296 9 32 0 0 0 0 0 0 0 0 0 1463 610 739 0 0 100
1 0 0 154805632 153354696 0 4 0 0 0 0 0 0 0 0 0 1408 425 634 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1341 381 658 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1299 353 610 0 0 100
1 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1319 375 638 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1308 367 614 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1336 395 650 0 0 100
1 0 0 154805632 153354640 0 0 0 0 0 0 0 44 44 0 0 1594 378 878 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 66 65 0 0 1763 382 1015 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1312 411 645 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1342 390 647 0 0 100

Related

Rstudio error: the table must the same classes in the same order

I keep getting this error: the table must the same classes in the same order
when implementing KNN and confusion matrix to get the accuracy
df_train <- df_trimmed_n[1:10,]
df_test <- df_trimmed_n[11:20,]
df_train_labels <- df_trimmed[1:10, 1]
df_test_labels <- df_trimmed[11:20, 1]
library(class)
library(caret)
df_knn<-knn(df_train,df_test,cl=df_train_labels,k=10)
confusionMatrix(table(df_knn,df_test_labels))
Error in confusionMatrix.table(table(df_knn, df_test_labels)) :
the table must the same classes in the same order
> print(df_knn)
[1] 28134 5138 4820 3846 1216 1885 1885 22021 5138 15294
Levels: 106 1216 1885 3846 4820 5138 15294 22021 22445 28134
> print(df_test_labels)
[1] 33262 6459 5067 7395 22720 1217 3739 84 16219 17819
> table(df_knn,df_test_labels)
df_test_labels
df_knn 84 1217 3739 5067 6459 7395 16219 17819 22720 33262
106 0 0 0 0 0 0 0 0 0 0
1216 0 0 0 0 0 0 0 0 1 0
1885 0 1 1 0 0 0 0 0 0 0
3846 0 0 0 0 0 1 0 0 0 0
4820 0 0 0 1 0 0 0 0 0 0
5138 0 0 0 0 1 0 1 0 0 0
15294 0 0 0 0 0 0 0 1 0 0
22021 1 0 0 0 0 0 0 0 0 0
22445 0 0 0 0 0 0 0 0 0 0
28134 0 0 0 0 0 0 0 0 0 1
evn though both knn and test dataset have the same number of rows=10 but i'm not too sure what is wrong with the same classes and order?

Generate an image with specific dimensions from a data frame in R

I have a data frame in R with the following dimensions [15750,93]. I want to construct an image using this data such that there are 3 row coordinates and 31 column coordinates in the image. Each column in the data frame corresponds to data from one coordinate position in the image. The columns in the data frame have been arranged based on their respective coordinates in the following manner [1,1], [2,1], [3,1], [1,2], [2,2], [3,2] ......... [1,31],[2,31],[3,31]
To generate the image, for each column I would like to have an average of all values, a sum of all values and the highest value in each column. This way there will be exactly one value corresponding to a coordinate. And, with the 3 variations, I should get three types of images - average, sum and highest value.
Can someone help me in generating an overall image using this data or can guide me using data with smaller dimensions?
Some demo data below:
Dimensions of the data frame are [11, 15]
0 0 0 0 0 46 0 0 0 0 0 0 0 78 0
0 734 0 0 0 0 932 0 0 56 0 0 0 0 0
0 0 0 115 0 0 0 0 0 0 64 0 0 0 0
0 67 0 0 0 45 0 0 0 0 0 546 0 12 0
0 0 0 0 65 5 56 0 54 0 0 0 0 0 0
667 0 430 0 0 0 0 456 0 0 787 0 0 467 0
0 0 0 0 54 0 0 0 0 0 0 456 90 0 0
778 45 0 0 0 0 24 913 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 26 0 0 0
234 0 0 620 0 0 0 0 0 106 0 0 901 0 0
0 0 0 0 0 0 45 0 34 0 0 0 0 0 0
I would like to have an image of with the dimensions [3,5] and the columns in the above data frame have been arranged based on their respective coordinates in the following manner [1,1], [2,1], [3,1], [1,2], [2,2], [3,2]..... and so on
The image coordinate arrangement
[1,1] [1,2] [1,3] [1,4] [1,5]
[2,1] [2,2] [2,3] [2,4] [2,5]
[3,1] [3,2] [3,3] [3,4] [3,5]

This function reads in your dataset and finds the mean (or max or sum) of each column (yielding a series of numbers, one per column). It then reshapes that series into your desired output dimensions and displays as an image.
df <- read.table(header=FALSE,text="
0 0 0 0 0 46 0 0 0 0 0 0 0 78 0
0 734 0 0 0 0 932 0 0 56 0 0 0 0 0
0 0 0 115 0 0 0 0 0 0 64 0 0 0 0
0 67 0 0 0 45 0 0 0 0 0 546 0 12 0
0 0 0 0 65 5 56 0 54 0 0 0 0 0 0
667 0 430 0 0 0 0 456 0 0 787 0 0 467 0
0 0 0 0 54 0 0 0 0 0 0 456 90 0 0
778 45 0 0 0 0 24 913 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 26 0 0 0
234 0 0 620 0 0 0 0 0 106 0 0 901 0 0
0 0 0 0 0 0 45 0 34 0 0 0 0 0 0
")
img <- function(data, op, tall, wide) image(t(matrix(sapply(data, op), nrow = wide, ncol = tall)),
col = gray((0:32) / 32))
img(df, mean, 3, 5)
img(df, max, 3, 5)
img(df, sum, 3, 5)

Sum of longest string of non-zero values

I have a dataframe containing the daily rainfall values at 76 stations from 1964-2013. Each row is a different month for a particular station. Here is a snippet of the dataframe-
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0
...
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USW00093129 2013 10 31 0 0 0 0 0 0 0 0 43 15 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 3 8 0
USW00093129 2013 11 30 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 79 18 20 0 0 0 0 0 0 0 Inf
USW00093129 2013 12 31 0 0 175 33 0 0 3 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 0 0 0 0 0 0 0 0 0
I am trying to find the length of the longest stretch of non-zero rainfall values for each row and the total rainfall in that stretch. The easiest way to find the length of the longest stretch would be to convert the dataframe to 0s and 1s, use rle and apply max(y$lengths[y$values!=0]) along each row. But how do I find the sum of the values?
Thanks for helping out, in advance!

Not exactly a one-liner, but this works :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,check.names=FALSE,text=
"Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0")
res <- lapply(1:nrow(df), function(r){
monthDays <- df[r,'Days']
rain <- as.numeric(df[r,(1:monthDays) + 4])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
})
columnsToAdd <- do.call(rbind,res)
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
Result :
# We print the result without months values for better readability
> df2[,-(5:35)]
Station Year Month Days StretchLen StretchRain
1 USC00020750 1964 1 31 3 110
2 USC00020750 1964 2 29 1 48
3 USC00020750 1964 3 31 4 328
4 USC00020750 1964 4 30 4 127
5 USC00020750 1964 5 31 2 59
6 USC00020750 1964 6 30 1 38
7 USC00020750 1964 7 31 3 210
8 USC00020750 1964 8 31 3 175
9 USC00020750 1964 9 30 2 66
10 USC00020750 1964 10 31 0 0
11 USC00020750 1964 11 30 2 130
12 USC00020750 1964 12 31 2 127
BTW, if you want to stick with apply, it would be like this :
columnsToAdd <-
t(apply(df[,-(1:3)],MARGIN=1,function(r){
monthDays <- r[1]
rain <- as.numeric(r[-1])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
}))
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
I don't like using apply on data.frame's since it has been created for matrices and so it coerces the columns to the same type before calling the function (hence if you work on columns of different types you need to be careful).

Here's another solution with dplyr/tidyr
data %>%
gather(day, rain, -Station, -Year, -Month, -Days) %>%
arrange(Station, Year, Month, day) %>%
group_by(Station, Year, Month) %>%
mutate(previous_rain = lag(rain)) %>%
filter(!(rain %in% c(0, Inf))) %>%
mutate(storm = cumsum(previous_rain %in% c(0, NA))) %>%
group_by(Station, Year, Month, storm) %>%
summarize(total_rain = sum(rain),
number_of_days = n(),
start_day = first(day),
end_day = last(day)) %>%
arrange(desc(number_of_days)) %>%
slice(1)

Here's another take at it, where I've used the rle() function to find run lengths. It's protracted but primarily to make it clear what is happening - you could shorten it easily.
raindf <-
tmp <- read.table(textConnection(" Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0"), header = TRUE)
rainfall <- unlist(as.data.frame(t(raindf[1:3, -c(1:4)])), use.names = FALSE)
rainfall <- rainfall[!is.infinite(rainfall)]
rainfall[rainfall > 0] <- 1
rainyruns <- rle(rainfall)
rainyrunsDf <- data.frame(lengths = rainyruns$lengths, values = rainyruns$values)
rainyrunsDf <- subset(rainyrunsDf, values != 0)
rainyrunsDf <- rainyrunsDf[order(rainyrunsDf$lengths, decreasing = TRUE), ]
rainyrunsDf[1,1]
## [1] 4

Making a CSV file into an RData file

Please bear with an R newbie here. I'm trying to follow along with a tutorial published on the wonderful flowingdata.com site by using my own data to replace the .Rdata file included in the tutorial. The Rdata file, "unisexCnts.RData", contains unisex names and the number of times used for different years:
head(unisexCnts)
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 312 273 274 263 0 273 0 255 235 195 222 0 195 0 193 225 204 196 177 156
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 190 0 0 325 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 243 219 214
Alva 177 132 159 178 145 138 131 119 119 119 127 97 107 97 83 76 83 90 84 81 58 68
Amari 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 1264 0 0 0 0 0 0 0 1579 2145 2488 0 0
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
Addison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 595 664
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 561 565 556 643 747 722 0 742 0 0
Alva 54 57 53 54 59 40 62 0 48 0 28 0 34 0 0 0 0 0 0 0 0 26
Amari 0 0 0 0 0 0 11 0 0 0 0 0 16 0 22 0 32 0 0 0 0 0
Angel 2561 2690 2779 0 0 3004 3108 3113 3187 2924 3100 3341 3229 3101 3532 3889 4066 4520 0 0 0 0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Addison 778 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alexis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alva 0 0 0 19 0 14 0 0 0 0 0 24 0 0 0 0 0
Amari 0 0 0 0 0 0 1181 1397 1333 1299 1265 1550 1780 0 0 0 0
Angel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run it through the str() function I get the follwoing:
str(unisexCnts)
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
..$ : chr [1:83] "1930" "1931" "1932" "1933" ...
My data is in a csv file ,called "boysnames.csv":
,2013,2012,2011,2010,2009,2008
Jack,764,831,840,935,1068,1151
James,746,773,796,746,711,737
Daniel,678,683,711,792,842,828
Conor,610,639,709,726,776,857
I am trying to overwrite the unisexCnts.RData with the contents of my boysnames.csv. So to restructure and get my csv ready to be saved, I did:
Step1.
unisexCnts<-data <- read.csv("boysnames.csv", stringsAsFactors=FALSE, header=TRUE, check.names = FALSE)
Step2.
unisexCnts<-as.matrix(unisexCnts)
Step3.
save(file="unisexCnts.RData") ##save as Rdata file, overwriting the original unisexCnts.RData in the dir
However I get the following after steps 1 & 2 which doesn't match the structure of the original, any ideas/pointers?
> str(unisexCnts)
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "" "2013" "2012" "2011" ...

When you load a .csv file you can specify the column that should become the row names of the uploaded data using the command "row.names"
I recreated your data quickly and uploaded it using the following code:
read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
This saves you having to do this work after uploading the data. This gives you the data structure you are looking for as well:
unisexCnts = read.csv('test.csv', stringsAsFactors = F,head = T, row.names = 1)
unisexCnts = as.matrix(unisexCnts)
str(unisexCnts)
int [1:4, 1:6] 764 746 678 610 831 773 683 639 840 796 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:4] "Jack" "James" "Dan" "Conor"
..$ : chr [1:6] "X2013" "X2012" "X2011" "X2010" ...

However I get the following after steps 1 & 2 which doesn't match the
structure of the original, any ideas/pointers?
In the original unisexCnts the names are specified as row names. That's why the the first attribute is
..$ : chr [1:121] "Addison" "Alexis" "Ali" "Alva" ...
To replicate that in your example. You can set the names as rownames by specifying
rownames(unisexCnts) <- ListorOrVectorofNamesHere
This will make the output match.
The reason this line:
chr [1:100, 1:7] "Jack" "James" "Daniel" "Conor" "Sean" "Adam" "Ryan" "Michael" "Harry" "Noah" ...
doens't match this line
num [1:121, 1:83] 0 0 0 0 0 0 16 0 0 0 ...
is the same. You have the names included in the actual matrix itself. In a matrix you can only have data of the same type . By including character data in the matrix (the names) you are converting the whole matrix itself into character/strings.
in summary
remove the name vector from the matrix and use it as row names and the str() of your two objects will match.

Why can't I use aggregate with cbind on a range of columns in a data.frame?

20 Lines of the data I'm working on:
Zv9_NA110 6176 7276 5'to3'IntronExon 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA110 10126 11226 5'to3'IntronExon 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 9 9 15 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA110 11219 12319 5'to3'ExonIntron 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA110 14887 15987 5'to3'IntronExon 0 + 1100 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
Zv9_NA110 18923 20023 5'to3'IntronExon 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA110 21069 22169 5'to3'ExonIntron 0 + 1100 0 135 115 65 54 45 36 27 16 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA113 1615 2715 5'to3'IntronExon 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA113 2335 3435 5'to3'ExonIntron 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA113 5398 6498 5'to3'IntronExon 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA113 7173 8273 5'to3'ExonIntron 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA118 11674 12774 5'to3'IntronExon 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA118 12711 13811 5'to3'ExonIntron 0 + 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA123 38151 39251 5'to3'ExonIntron 0 - 1100 0 1061 958 844 796 695 600 464 346 265 210 150 133 94 81 72 46 18 4 0 0 0 0 0 0 0 0 0 7 9 9 9 11 21 35 43 58 91 108 180 268 406 547 712 833 882 960 1094 1172 1245 1331 1432 1510 1604 1711 1810 1830 1837 1823 1781 1690 1638 1560 1489 1257 854 731 631 589 551 497 439 404 369 301 231 168 123 76 58 50 42 28 20 11 9 9 24 27 27 27 27 27 25 18 18 18 18 18 18 18 18 18 18 18 18 18 14 5 0 0
Zv9_NA124 2578 3678 5'to3'ExonIntron 0 + 1100 0 423 407 401 377 357 345 324 304 249 185 111 54 30 12 0 0 0 0 0 0 0 0 0 0 0 0 0 1 9 9 9 9 14 18 25 27 27 27 27 27 27 27 27 27 27 27 26 18 18 18 18 18 18 16 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA129 4939 6039 5'to3'IntronExon 0 + 1100 226 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 9 9 9 9 9 9 9 9 9 9 9 9 14 34 45 60 97 128 175 293 395 524 621 764 894 1036 1164 1334 1469 1639 1801 1885 1983
Zv9_NA132 12589 13689 5'to3'ExonIntron 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA132 13634 14734 5'to3'IntronExon 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA132 14481 15581 5'to3'ExonIntron 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 9 9 9 9 9 9 9 9 9 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA132 19534 20634 5'to3'IntronExon 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Zv9_NA132 28708 29808 5'to3'ExonIntron 0 - 1100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 9 15 18 24 27 42 46 73 112 142 157 162 162 162 162 162 162 162 162 159 153 153 153 153 153 150 144 132 112 76 52 30 25 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I get this into R as follows:
> dat <- read.table("dat.dat",header=F)
I need to get the averages for columns 9 through 118, parsed by column 4.
This works:
> all_means <- aggregate(cbind(V9,V10,V11)~V4,data=dat,FUN=mean)
V4 V9 V10 V11
1 5'to3'ExonIntron 0 0.00 0
2 5'to3'IntronExon 0 0.75 1
But there's no way I'm typing this out to V118.
I've tried this:
> aggregate(cbind(9:118)~V4,data=blah,FUN=mean)
But I get this error:
Error in model.frame.default(formula = cbind(9:118) ~ V4, data = blah) :
variable lengths differ (found for 'V4')
Is there something dumb I'm missing?

You have a number of options.
create a formula using . and pass a subset of the data
aggregate( . ~ V4, data = dat[,c(4,9:118)], FUN = mean)
You could also create the vector of column names using paste
nn <- paste0('V', 9:118)
and refer by column name
aggregate( . ~ V4, data = dat[,c('V4',nn)], FUN = mean)
There isn't much point using cbind here, given the formula approach works, but for example.
aggregate( do.call(cbind,lapply(nn, as.name)) ~ V4, data = dat, FUN = mean)
But this is messy as it doesn't name the columns nicely. (and is hard to follow)

If speed is an issue in general (not necessary for this operation) and you want to use the data.table package, this is done as follows:
Safer solution
Thanks to mnel's comment, I would use that:
library(data.table)
dat <- as.data.table(dat)
dat[,lapply(.SD,mean),by="V4",.SDcols=paste0("V", 9:118)]
Old solution
dat[,lapply(.SD,mean),by="V4",.SDcols=9:118]

You can use
## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE)
With your data assuming your data is in dataframe DF
DF <- read.table(text = txt, header = FALSE, stringsAsFactors = FALSE)
result <- aggregate(DF[, 9:118], by = list(DF[, 4]), FUN = mean)
# Using pander to print result table nicely. It's not needed for aggregation :)
require(pander)
pandoc.table(result)
##
## ----------------------------------------------------
## Group.1 V9 V10 V11 V12 V13 V14
## ---------------- ----- ----- ----- ----- ----- -----
## 5'to3'ExonIntron 161.9 148 131 122.7 109.7 98.1
##
## 5'to3'IntronExon 0.0 0 0 0.0 0.0 0.0
## ----------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V15 V16 V17 V18 V19 V20 V21 V22
## ----- ----- ----- ----- ----- ----- ----- -----
## 81.5 66.6 52.3 39.5 26.1 18.7 12.4 9.3
##
## 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V23 V24 V25 V26 V27 V28 V29 V30
## ----- ----- ----- ----- ----- ----- ----- -----
## 7.2 4.6 1.8 0.4 0 0 0 0.5
##
## 0.0 0.0 0.0 0.0 0 0 0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V31 V32 V33 V34 V35 V36 V37 V38
## ----- ----- ----- ----- ----- ----- ----- -----
## 0.9 1.5 1.8 2.4 2.7 5 6.4 9.1
##
## 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V39 V40 V41 V42 V43 V44 V45 V46
## ----- ----- ----- ----- ----- ----- ----- -----
## 13 16.2 19.2 21.5 23 24.7 28 29.7
##
## 0 0.0 0.0 0.0 0 0.0 0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V47 V48 V49 V50 V51 V52 V53 V54
## ----- ----- ----- ----- ----- ----- ----- -----
## 36.9 45.7 59.5 73.3 89.2 101.3 106.2 114
##
## 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V55 V56 V57 V58 V59 V60 V61 V62
## ----- ----- ----- ----- ----- ----- ----- -----
## 127.3 134 140.7 148.1 156.2 160.4 167.4 175.7
##
## 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V63 V64 V65 V66 V67 V68 V69 V70
## ----- ----- ----- ----- ----- ----- ----- -----
## 183.9 183.1 183.7 182.3 178.1 169 163.8 156.7
##
## 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V71 V72 V73 V74 V75 V76 V77 V78
## ----- ----- ----- ----- ----- ----- ----- -----
## 149.8 126.6 86.3 74.0 64.0 59.8 56.0 50.6
##
## 0.7 0.9 0.9 1.5 1.8 1.8 1.8 1.8
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V79 V80 V81 V82 V83 V84 V85 V86
## ----- ----- ----- ----- ----- ----- ----- -----
## 45.6 42.2 38.7 31.9 24.9 18.6 14.1 9.4
##
## 1.8 1.8 1.8 1.8 1.8 1.8 2.2 2.7
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V87 V88 V89 V90 V91 V92 V93 V94
## ----- ----- ----- ----- ----- ----- ----- -----
## 7.6 6.2 5.1 3.7 2.9 2.5 2.7 2.7
##
## 2.7 2.7 2.7 2.7 2.7 2.7 2.2 0.9
## -----------------------------------------------
##
## Table: Table continues below
##
##
## --------------------------------------------------
## V95 V96 V97 V98 V99 V100 V101 V102
## ----- ----- ----- ----- ----- ------ ------ ------
## 4.2 4.5 4.5 4.5 4.5 4.5 4.3 2.5
##
## 0.9 0.9 0.9 1.4 4.1 5.4 6.9 10.6
## --------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------
## V103 V104 V105 V106 V107 V108 V109 V110
## ------ ------ ------ ------ ------ ------ ------ ------
## 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8
##
## 13.7 18.4 30.2 40.4 53.3 63.0 77.3 90.3
## -------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------
## V111 V112 V113 V114 V115 V116 V117 V118
## ------ ------ ------ ------ ------ ------ ------ ------
## 1.8 1.8 1.8 1.8 1.4 0.5 0.0 0.0
##
## 104.5 117.3 134.3 147.8 164.8 181.0 189.4 199.2
## -------------------------------------------------------
##

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicate header lines in dataframe - r

Related

Rstudio error: the table must the same classes in the same order

Generate an image with specific dimensions from a data frame in R

Sum of longest string of non-zero values

Making a CSV file into an RData file

Why can't I use aggregate with cbind on a range of columns in a data.frame?

Categories

Resources