I am new to R2OpenBUGS and the very enigmatic errors are quite frustrating.
I try to run a model that is quite simple. I had success running similar models before.
Are my problems from the fact that I have a 2-dimensional array (matrix) ?
I tried simplifying the model without success.
Here are the errors:
model is syntactically correct
expected the collection operator c error pos 11
model compiled
expected a number or an NA error pos 1449
initial values generated, model initialized
model is updating
200 updates took 0 s
tau.0 is not a variable in the model
tau.1 is not a variable in the model
model is updating
****** Sorry something went wrong in procedure StdMonitor.Update in module DeviancePlugin ******
And here is the code I use
rm(list=ls(all=TRUE))
cat("\014")
library(R2OpenBUGS)
rat.dat<- read.table("BigRatDat.txt",header=FALSE);
dose = data.matrix(rat.dat[1])
weight = data.matrix(rat.dat[3:13])
N<- length(dose);
cat("
model{
for(i in 1:50){
for(j in 1:11){
weight[i,j]~dnorm(mu[i,j],tau[i])
mu[i,j]<-b.0[i]+b.1[i]*j
}
b.0[i]~dnorm(mu.0[i],tau.0)
b.1[i]~dnorm(mu.1[i],tau.1)
mu.0[i] <-b.00+b.01*dose[i]
mu.1[i] <-b.00+b.01*dose[i]
tau[i]~dgamma(0.01,0.01)
dose[i]~dnorm(0,1)
}
b.00~dnorm(0,0.001)
b.01~dnorm(0,0.001)
b.10~dnorm(0,0.001)
b.11~dnorm(0,0.001)
tau.0~dgamma(0.01,0.01)
tau.1~dgamma(0.01,0.01)
}
",file="Rats2OpenBugs.txt")
data <- list("dose","weight")
inits <- function(){
b.0<-rnorm(n=N,0);
b.1<-rnorm(n=N,0);
b.00<-rnorm(1,0);
b.01<-rnorm(1,0);
b.10<-rnorm(1,0);
b.11<-rnorm(1,0);
tau = rep(1,N);
tau.0 = 1;
tau.1 = 1;
list(b.0=b.0,b.1=b.1,b.00=b.00,b.01=b.01,b.10=b.10,b.11=b.11,tau=tau,tau.0=1,tau.1=1)
}
params <- c("b.0","b.1","b.00","b.01","b.10","b.11","tau","tau.0","tau.1");
output.sim <- bugs(data,inits,params,model.file="Rats2OpenBugs.txt",
n.chains=1, n.iter=5000, n.burnin=200, n.thin=1
,debug=TRUE)
A Datafile:
0 1 54 60 63 74 77 89 93 100 108 114 124
0 2 69 75 81 90 97 120 114 119 126 138 143
0 3 77 81 87 94 101 110 117 124 134 141 151
0 4 64 69 77 83 88 96 104 109 120 123 131
0 5 51 58 62 71 74 81 88 93 99 103 113
0 6 64 71 77 89 90 100 106 114 122 134 139
0 7 80 91 97 101 111 119 129 131 137 147 154
0 8 79 85 89 99 104 105 116 121 132 139 147
0 9 77 82 88 92 101 109 119 127 135 144 158
0 10 79 84 91 98 107 114 119 131 137 146 155
.5 1 62 71 75 79 87 91 100 105 111 121 124
.5 2 68 73 81 89 94 101 110 114 123 132 139
.5 3 94 102 109 110 128 133 147 151 153 171 184
.5 4 81 90 95 102 109 120 128 137 141 154 160
.5 5 64 69 72 76 84 89 97 103 108 114 124
.5 6 67 74 81 81 84 95 100 109 119 128 130
.5 7 73 80 86 89 97 101 110 116 117 135 141
.5 8 71 74 82 84 93 97 102 113 119 124 131
.5 9 69 74 79 89 94 100 107 113 124 134 139
.5 10 60 62 67 74 78 85 92 103 112 121 130
1 1 59 63 66 75 80 87 99 104 110 115 124
1 2 56 66 70 81 77 88 96 100 113 120 130
1 3 71 77 84 80 97 106 111 109 128 133 140
1 4 59 64 69 76 85 88 96 104 110 119 126
1 5 65 70 73 77 85 92 96 101 111 118 121
1 6 61 69 77 81 89 92 107 111 118 127 132
1 7 80 86 95 99 106 113 127 131 142 150 160
1 8 74 80 84 90 99 101 108 117 126 133 140
1 9 71 79 88 90 98 102 116 121 127 139 142
1 10 69 75 80 86 96 97 104 113 122 129 138
The problem was that I was trying to use a matrix with only one column as a vector. R has no problem with that but it does not work when exporting the data to OpenBUGS. The program expects references to a matrix to have 2 indices (for line and column).
I just had to replace:
dose = data.matrix(rat.dat[1])
with:
dose = unlist(as.vector(rat.dat[1]))
Related
I was wondering if you could help me with this problem. I have a dataset of US counties that I am trying to do k-nearest neighbor analysis for spatial weighting, following the method proposed here (section 4.5), but the results aren't making sense, or potentially I'm not understanding them.
library(spdep)
library(tigris)
library(sf)
counties <- counties("Georgia", cb = TRUE)
coords <- st_centroid(st_geometry(counties), of_largest_polygon=TRUE)
col.knn <- knearneigh(coords)
gck4.nb <- knn2nb(knearneigh(coords, k=4, longlat=TRUE))
summary(gck4.nb, coords, longlat=TRUE, scale=0.5)
However, the output I'm getting, with regards to the distances, seems rather small, on the order of less than 1 km:
Neighbour list object:
Number of regions: 159
Number of nonzero links: 636
Percentage nonzero weights: 2.515723
Average number of links: 4
Non-symmetric neighbours list
Link number distribution:
4
159
159 least connected regions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 with 4 links
159 most connected regions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 with 4 links
Summary of link distances:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1355 0.2650 0.3085 0.3112 0.3482 0.6224
The decimal point is 1 digit(s) to the left of the |
1 | 44
1 | 7799999999999999
2 | 00000000000011111111112222222222222233333333333333333333333333444444
2 | 55555555555555555555555555556666666666666666666666666666666666667777+92
3 | 00000000000000000000000000000001111111111111111111111111111111111111+121
3 | 55555555555555555555555555555556666666666667777777777777777777777777+19
4 | 00000000000111111111112222222222223333333444
4 | 555667777999
5 | 0000014
5 | 7888
6 | 2
I have a large (ish) data frame and I want to use dplyr mutate function (or suitable alternative) to calculate the mean of selected columns.
For example, suppose I had a data frame as follows:
colnames(dall)
[1] "Code" "LA.Name" "LA_Name" "Jan.20" "Feb.20" "Mar.20" "Apr.20" "May.20" "Jun.20"
[10] "Jul.20" "Aug.20" "Sep.20" "Oct.20" "Nov.20" "Dec.20" "Jan.19" "Feb.19" "Mar.19"
[19] "Apr.19" "May.19" "Jun.19" "Jul.19" "Aug.19" "Sep.19" "Oct.19" "Nov.19" "Dec.19"
[28] "Jan.18" "Feb.18" "Mar.18" "Apr.18" "May.18" "Jun.18" "Jul.18" "Aug.18" "Sep.18"
[37] "Oct.18" "Nov.18" "Dec.18" "Jan.17" "Feb.17" "Mar.17" "Apr.17" "May.17" "Jun.17"
[46] "Jul.17" "Aug.17" "Sep.17" "Oct.17" "Nov.17" "Dec.17" "Jan.16" "Feb.16" "Mar.16"
[55] "Apr.16" "May.16" "Jun.16" "Jul.16" "Aug.16" "Sep.16" "Oct.16" "Nov.16" "Dec.16"
[64] "Jan.15" "Feb.15" "Mar.15" "Apr.15" "May.15" "Jun.15" "Jul.15" "Aug.15" "Sep.15"
[73] "Oct.15" "Nov.15" "Dec.15"
I'm trying to create a new column with the mean of January data from 2015 to 2019.
Have tried several methods. Latest as follows:
mutate(dall, mJan15to19 = mean(Jan.15,Jan.16,Jan.17,Jan.18,Jan.19))
I get the following back:
Error in mean.default(Jan.15, Jan.16, Jan.17, Jan.18, Jan.19) :
'trim' must be numeric of length one
In addition: Warning message:
In if (na.rm) x <- x[!is.na(x)] :
the condition has length > 1 and only the first element will be used
The content of the cells I'm trying to average is a numeric
Can you help?
UPDATE:
Tried:
head(dall) %>% mutate(new = rowMeans(select(., Jan.15:Jan.19)))
Returned the following:
head(dall) %>% mutate(new = rowMeans(select(., Jan.15:Jan.19)))
Code LA.Name LA_Name Jan.20 Feb.20 Mar.20 Apr.20 May.20 Jun.20
1 E06000001 Hartlepool Hartlepool 108 76 89 NA NA NA
2 E06000002 Middlesbrough Middlesbrough 178 98 135 NA NA NA
3 E06000003 Redcar and Cleveland Redcar and Cleveland 150 148 126 NA NA NA
4 E06000004 Stockton-on-Tees Stockton-on-Tees 202 124 175 NA NA NA
5 E06000005 Darlington Darlington 137 90 116 NA NA NA
6 E06000006 Halton Halton 141 101 115 NA NA NA
Jul.20 Aug.20 Sep.20 Oct.20 Nov.20 Dec.20 Jan.19 Feb.19 Mar.19 Apr.19 May.19 Jun.19 Jul.19 Aug.19
1 NA NA NA NA NA NA 92 87 68 81 108 77 97 73
2 NA NA NA NA NA NA 144 116 126 113 123 100 113 118
3 NA NA NA NA NA NA 146 152 133 135 114 101 140 116
4 NA NA NA NA NA NA 192 166 160 133 157 126 136 149
5 NA NA NA NA NA NA 138 110 104 84 115 75 86 104
6 NA NA NA NA NA NA 114 95 83 92 97 88 98 83
Sep.19 Oct.19 Nov.19 Dec.19 Jan.18 Feb.18 Mar.18 Apr.18 May.18 Jun.18 Jul.18 Aug.18 Sep.18 Oct.18
1 69 87 85 99 126 89 97 97 77 65 64 61 76 71
2 117 127 119 121 204 117 112 132 129 106 96 115 103 111
3 108 139 134 145 225 152 135 114 122 116 113 108 113 154
4 136 177 159 173 256 171 189 142 146 149 142 144 128 179
5 77 95 96 119 127 125 98 98 104 76 77 84 79 109
6 91 106 102 121 170 106 114 93 102 93 83 111 91 93
Nov.18 Dec.18 Jan.17 Feb.17 Mar.17 Apr.17 May.17 Jun.17 Jul.17 Aug.17 Sep.17 Oct.17 Nov.17 Dec.17
1 94 97 116 83 101 76 85 86 52 80 85 88 98 94
2 108 121 151 137 131 111 112 114 127 112 113 120 150 151
3 113 129 171 126 158 104 120 134 122 119 107 145 126 134
4 152 174 177 166 176 129 157 148 141 148 168 143 142 186
5 84 100 103 110 105 88 101 89 73 92 87 96 102 86
6 115 96 117 95 115 94 99 105 93 110 110 86 89 84
Jan.16 Feb.16 Mar.16 Apr.16 May.16 Jun.16 Jul.16 Aug.16 Sep.16 Oct.16 Nov.16 Dec.16 Jan.15 Feb.15
1 79 97 90 92 82 87 75 74 74 79 68 93 138 99
2 116 143 138 131 139 95 107 107 102 121 125 142 166 144
3 129 132 147 141 137 137 115 108 115 127 135 124 179 144
4 159 176 171 191 146 169 160 128 161 143 159 161 263 169
5 105 113 85 92 87 92 74 78 91 85 88 86 149 78
6 113 98 108 117 90 99 92 107 101 93 123 111 162 105
Mar.15 Apr.15 May.15 Jun.15 Jul.15 Aug.15 Sep.15 Oct.15 Nov.15 Dec.15 new
1 109 69 82 85 71 65 74 82 81 112 85.89796
2 130 116 127 124 119 104 107 95 115 101 123.51020
3 129 142 136 125 114 108 120 117 108 140 131.61224
4 155 163 127 129 142 101 161 148 140 180 161.30612
5 105 102 78 90 112 91 83 109 97 96 96.34694
6 100 102 99 90 90 81 102 98 86 107 103.02041
>
I have a new column, but the calculation is incorrect. I want an average of all of the 'Jan' columns except for 'Jan.20'
Since you wanted rowwise mean, this will work:
dall$mJan15to19 = rowMeans(dall[,c("Jan.15","Jan.16","Jan.17","Jan.18","Jan.19")])
This question already has answers here:
Binning across multiple categories
(2 answers)
Closed 5 years ago.
I am very new to r but have been asked to use it by my professor to analyze our data. Currently, we are trying to conduct a changepoint analysis on a large set of data which I know how to do. But we want to first place our data into time bins of 30 seconds. Our trials are 20 minutes in length so we should have a total of 40 bins. We have columns for: time, Flow, and MAP and would like to take the values of flow and MAP within each 30 second bin and average them. This will condense 1120-2000 points of data into a much cleaner 40 data points. We are having trouble binning the data and dont even know where to start, once binned we would like to generate a table of those new 40 values (40 for MAP and 40 for Flow) so that we can use the changepoint package to find the changepoint in our set. We believe possibly clip( could be what we need.
Sorry if this is too confusing or too vague, we have no programming experience whatsoever.
Edit I believe this is different than the bacteria question because I wanted a direct output into a table rather than interpolating from a graph then into a table.
Here is a sample from our data:
RawMin Flow MAP
2.9982 51 77
3.0113 110 80
3.0240 84 77
3.0393 119 75
3.0551 93 75
3.0692 136 73
3.0839 81 73
3.0988 58 72
3.1138 125 71
3.1285 89 72
3.1432 160 73
3.1576 87 74
3.1714 128 74
3.1860 90 74
3.2015 63 76
3.2154 120 76
3.2293 65 76
3.2443 156 78
3.2585 66 78
3.2723 130 78
3.2876 89 77
3.3029 111 77
3.3171 90 75
3.3329 100 76
3.3482 127 76
3.3618 69 78
3.3751 155 78
3.3898 90 79
3.4041 127 80
3.4176 103 80
3.4325 87 79
3.4484 134 78
3.4637 57 77
3.4784 147 78
3.4937 75 78
3.5080 137 78
3.5203 123 78
3.5337 99 80
3.5476 170 80
3.5620 90 79
3.5756 164 78
3.5909 85 78
3.6061 164 77
3.6203 103 77
3.6348 140 79
3.6484 152 79
3.6611 79 80
3.6742 184 82
3.6872 128 81
3.7017 123 82
3.7152 176 81
3.7295 74 81
3.7436 153 80
3.7572 85 80
3.7708 115 79
3.7847 187 78
3.7980 105 78
3.8108 175 78
3.8252 124 79
3.8392 171 79
3.8528 127 78
3.8669 138 79
3.8811 198 79
3.8944 109 80
3.9080 171 80
3.9214 137 79
3.9341 109 81
3.9455 193 83
3.9575 108 85
3.9707 163 84
3.9853 136 82
4.0005 121 81
4.0164 164 79
4.0311 73 79
4.0450 171 78
4.0591 105 79
4.0716 117 79
4.0833 210 81
4.0940 103 85
4.1041 193 88
4.1152 163 84
4.1310 145 82
4.1486 126 79
4.1654 118 77
4.1811 130 75
4.1975 83 74
4.2127 176 73
4.2277 72 74
4.2424 177 74
4.2569 90 75
4.2705 148 76
4.2841 148 77
4.2986 123 77
4.3130 150 76
4.3280 71 77
4.3433 176 76
4.3583 90 76
4.3727 138 77
4.3874 136 79
4.4007 106 80
4.4133 167 83
4.4247 119 87
4.4360 123 88
4.4496 141 85
4.4673 117 84
4.4841 133 80
4.5005 83 79
4.5166 156 77
4.5324 97 77
4.5463 182 77
4.5605 110 79
4.5744 187 80
4.5882 121 81
4.6024 142 81
4.6171 178 81
4.6313 96 80
4.6452 180 80
4.6599 107 80
4.6741 151 79
4.6876 137 80
4.7009 132 82
4.7141 199 80
4.7279 91 81
4.7402 172 83
4.7531 172 80
4.7660 128 84
4.7785 197 83
4.7909 122 84
4.8046 129 84
4.8187 176 82
4.8328 102 81
4.8448 184 81
4.8556 145 83
4.8657 123 84
4.8768 138 86
4.8885 143 82
4.9040 135 81
4.9198 112 78
4.9362 134 77
4.9515 152 76
4.9651 83 76
4.9785 177 78
4.9912 114 79
5.0037 127 80
5.0167 200 81
5.0297 104 81
5.0429 175 81
5.0559 123 81
5.0685 106 81
5.0809 176 81
5.0937 113 82
5.1064 191 81
5.1181 178 79
5.1297 121 79
5.1404 176 80
5.1506 214 83
5.1606 132 85
5.1709 149 83
5.1829 175 80
5.1981 103 79
5.2128 169 76
5.2283 97 75
5.2431 149 74
5.2575 109 74
5.2709 97 74
5.2842 195 75
5.2975 104 75
5.3106 143 77
5.3231 185 76
5.3361 140 77
5.3487 132 78
5.3614 162 79
5.3750 98 78
5.3900 137 78
5.4047 108 76
5.4202 94 76
5.4341 186 75
5.4475 82 77
5.4608 157 80
5.4739 176 81
5.4867 90 83
5.4989 123 86
Assuming RawMin is time in minutes, you could do something like this...
df2 <- aggregate(df, #the data frame
by=list(cut(df$RawMin,seq(0,10,0.5))), #the bins (see below)
mean) #the aggregating function
df2
Group.1 RawMin Flow MAP
1 (2.5,3] 2.998200 51.0000 77.00000
2 (3,3.5] 3.251682 103.5588 76.20588
3 (3.5,4] 3.748994 135.9722 79.75000
4 (4,4.5] 4.240434 132.0857 79.25714
5 (4.5,5] 4.749781 140.1892 80.43243
6 (5,5.5] 5.246556 140.9231 78.89744
Binning is done with the cut function - here by 0.5 minute intervals between 0 and 10, which you might want to change. The bin names are the intervals - e.g. (2.5,3] means greater than 2.5, less than or equal to 3.
If you don't want RawMin included in the output, just use df[,-1] in the input to aggregate.
Consider the following two code snippets.
A:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5, nrows=190) # Specify nrows, get correct answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
# No need to remove unranked countries because we specified nrows
# No need to convert V2 from factor to numeric
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get KNA, correct answer
B:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5) # Don't specify nrows, get incorrect answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
mergedData = mergedData[which(mergedData$V2 != ""),] # Remove unranked countries
mergedData$V2 = as.numeric(mergedData$V2) # make V2 a numeric column
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get SRB, incorrect answer
I would think the two code snippets would be identical, except that in A you never add the unranked countries to your dataframe and in B you add them but then remove them. Why is the sorting different for these two code snippets?
The file downloads are from Coursera's Getting and Cleaning Data class (Quiz 3, Question 3).
Edit: To avoid security concerns, I've pasted the raw .csv files below
gdp.csv - http://pastebin.com/raw.php?i=4aRZwBRd
education.csv - http://pastebin.com/raw.php?i=0pbhDCSX
Edit2: The problem is occurring in the as.numeric step. For case B, here is mergedData$V2 before and after mergedData$V2 = as.numeric(mergedData$V2) is applied:
> mergedData$V2
[1] 161 105 60 125 32 26 133 172 12 27 68 162 25 140 128 59 76 93
[19] 138 111 69 169 149 96 7 153 113 167 117 165 11 20 36 2 99 98
[37] 121 30 182 166 81 67 102 51 4 183 33 72 48 64 38 159 13 103
[55] 85 43 155 5 185 109 6 114 86 148 175 176 110 42 178 77 160 37
[73] 108 71 139 58 16 10 46 22 47 122 40 9 116 92 3 50 87 145
[91] 120 189 178 15 146 56 136 83 168 171 70 163 84 74 94 82 62 147
[109] 141 132 164 14 188 135 129 137 151 130 118 154 127 152 34 123 144 39
[127] 126 18 23 107 55 66 44 89 49 41 187 115 24 61 45 97 54 52
[145] 8 142 19 73 119 35 174 157 100 88 186 150 63 80 21 158 173 65
[163] 124 156 31 143 91 170 184 101 79 17 190 95 106 53 78 1 75 180
[181] 29 57 177 181 90 28 112 104 134
194 Levels: .. Not available. 1 10 100 101 102 103 104 105 106 107 ... Note: Rankings include only those economies with confirmed GDP estimates. Figures in italics are for 2011 or 2010.
> mergedData$V2 = as.numeric(mergedData$V2)
> mergedData$V2
[1] 72 10 149 32 118 111 41 84 26 112 157 73 110 49 35 147 166 185
[19] 46 17 158 80 58 188 159 63 19 78 23 76 15 105 122 104 191 190
[37] 28 116 94 77 172 156 7 139 126 95 119 162 135 153 124 69 37 8
[55] 176 130 65 137 97 14 148 20 177 57 87 88 16 129 90 167 71 123
[73] 13 161 47 146 70 4 133 107 134 29 127 181 22 184 115 138 178 54
[91] 27 101 90 59 55 144 44 174 79 83 160 74 175 164 186 173 151 56
[109] 50 40 75 48 100 43 36 45 61 38 24 64 34 62 120 30 53 125
[127] 33 91 108 12 143 155 131 180 136 128 99 21 109 150 132 189 142 140
[145] 170 51 102 163 25 121 86 67 5 179 98 60 152 171 106 68 85 154
[163] 31 66 117 52 183 82 96 6 169 81 103 187 11 141 168 3 165 92
[181] 114 145 89 93 182 113 18 9 42
Can anyone explain why the numbers change when I apply as.numeric()?
The real reason for getting different results are in the second case i.e. the full dataset have some footer notes, which were also read with the read.csv resulting in most of the columns to be 'factor' class because of the 'character' elements in the footer. This could have avoided either by
skipping the last few lines using skip argument in read.csv
using stringsAsFactors=FALSE in the read.csv call along with skipping the lines.
The columns were ordered based on the "levels" of the factor.
If you have already read the files without skipping the lines, convert to the respective classes. If it is 'numeric' column, convert it to numeric by as.numeric(as.character(df$column)) or as.numeric(levels(df$column))[df$column].
I have a problem in reading a .txt in to R.
The data is something like this:
68 89 103 1
37 8 103 9
78 93 8 12
3 50
I used readLine() in R and came up with a list. But when I compare it to the raw data, I find that , for example, the last "1" in the first line is not 1, it should be connected to the second line, which make the number to e 137, instead of 1 and 37. I think this data is split by " ". If I use readLine(), I manually split up the lines. How could I correctly read it?
And, number 9 is not connect to 78 since at the beginning of line 3, there is a space. number 12 is connected with 3 to form 123, since there is no space before 3.
Thanks. I even don't know how to search my problem in Google. Don't know how to express it.
182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63
102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91
1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1
63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1
37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134
134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9
1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123
137 161 179 182 140 152 182 182 81 63 88 134 84 134 182
7 11 9 2 9 4 6 7 6 1 13 2 1 10 4 5 11 11 9 12 1 3 1 3 3
Basically, what I am doing now is:
For example, the vector:
ind <- c(7, 11, 9, 2 ,9 ,4 ,6, 7, 6 ,1, 13, 2 ,1 ,10 ,4 ,5 ,11 ,11, 9 ,12, 1, 3 ,1, 3 ,3)
indicates that the block of number above should be split up according to the length specified by the vector. I know I can split up a vector by
split(vector, rep(1:length(ind), ind))
However, the problem is I can't read the block of number correctly.
Based on the conditions you described, i.e. if there is a space at the beginning of line after you read the file with readLines, then the last number in the previous line should be joined with the first number of the current line.
Using your second example (I didn't understand the ind though)
lines1 <- readLines(n=10)
182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63
102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91
1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1
63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1
37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134
134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9
1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123
137 161 179 182 140 152 182 182 81 63 88 134 84 134 182
lines2 <- lines1[lines1!=''] #remove blank lines
indx <- grep("^ ", lines2) #create a numeric index for lines that start with a space
indx1 <- indx-1 #index that is one above the previous `indx`
lines2[indx1] <- paste0(lines2[indx1], gsub("^\\s+", "", lines2[indx])) #paste the lines together using the two indexes
lines3 <- lines2[-indx] #remove the lines that belong to the first index
lines3
#[1] "182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91"
#[2] "1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1"
#[3] "63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1"
#[4] "37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9"
#[5] "1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123137 161 179 182 140 152 182 182 81 63 88 134 84 134 182"