Reading unkown file type with strange entries into R - r

I am completely new at this and here, so please have mercy.
I want to open an ASCII data file in R.
After several different attempts, I have tried df=read.csv("C:MyDirectory" ,header=FALSE, sep="").
This has produced a table with several variables, but some rows clearly contain the wrong information, some cells are blank, some contain NA values.
Any ideas what has gone wrong? I have gotten the file from an offical Spanish research institute:
http://www.cis.es/cis/opencm/ES/2_bancodatos/estudios/listaTematico.jsp?tema=1&todos=si
Then BARÓMETRO DE OCTUBRE 2017, to the right is a small link entitled "fichero de datos", which allows you to download after providing them with some info. The file giving the trouble is DA3191. If anyone could go through the trouble of helping me with this, it would be awesome. Thank you.

Part 1
This looks like a fixed width format, so you need read.fwf instead of read.csv and friends. I made a screen shot of an almost random place of that file: my hypothesis is that the 99's and 98's etc are missing data codes, so the first 99 marked in yellow would belong to the same column with 4, 2, 0, etc, and the immediately following 99 (not marked) is in the same column with 0, 5, 7, etc.
Part 2
And then look at the file ES3191 -- this looks like SPSS code (pardon my French!) containing the rules about reading in the data file. You can probably figure out the width of each column and what's in there from that file:
DATA LIST FILE= 'DA3191'
/ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
I'm not an SPSS expert but I would guess that what it is trying to tell us is that
columns 1-4 contain the variable "ESTU"
columns 5-9 contain the variable "CUES"
etc
For read.fwf you have to calculate each variable's "width" i.e. 4 characters for ESTU (if my reading was right) 5 characters for CUES etc.
Part 3
Using the guesses above, I used the following code to read in your data, and it looks like it works:
# this is copy/pasted SPSS code from file "ES3191"
txt <- "ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
P1501 116-117 P1502 118-119 P1503 120-121 P1504 122-123 P1505 124-125 P1506 126-127 P1507 128-129
P1508 130-131 P1509 132-133 P1510 134-135 P1511 136-137 P1512 138-139 P1513 140-141 P1514 142-143
P1515 144-145 P1516 146-147 P16 148 P17 149 P1801 150-151 P1802 152-153 P1803 154-155 P1804 156-157
P1805 158-159 P1806 160-161 P1807 162-163 P1808 164-165 P1809 166-167 P1810 168-169 P1811 170-171
P1812 172-173 P1813 174-175 P19 176 P20 177 P21 178-179 P22 180-181 P23 182-183 P2401 184-185
P2402 186-187 P2403 188-189 P2404 190-191 P2405 192-193 P2406 194-195 P2407 196-197 P2408 198-199
P2409 200-201 P2410 202-203 P2411 204-205 P2412 206-207 P2413 208-209 P2414 210-211 P2415 212-213
P2416 214-215 P25 216 P26 217 P27 218 P27A 219-220 P28 221-222 P29 223 P30 224-225 P31 226 P31A 227-228
P32 229 P32A 230 P33 231 P34 232 P35 233 P35A 234 P36 235 P37 236 P37A 237 P37B 238 P38 239-241
P39 242 P39A 243 P40 244-246 P41 247-248 P42 249-250 P43 251 P43A 252 P43B 253 P44 254 P4501 255
P4502 256 P4503 257 P4504 258 P4601 259-261(A) P4602 262-264(A) P4603 265-267(A) P4604 268-270(A)
P4605 271-273(A) P4701 274-276(A) P4702 277-279(A) P4703 280-282(A) P4704 283-285(A) P4705 286-288(A)
P48 289 P49 290 P50 291 P51 292 I1 293-295 I2 296-298 I3 299-301 I4 302-304 I5 305-307 I6 308-310
I7 311-313 I8 314-316 I9 317-319 E101 320-321 E102 322-323 E103 324-325 E2 326 E3 327-329 E4 330
C1 331 C1A 332-333 C2 334 C2A 335 C2B 336-337 C3 338 C4 339-340 P21R 341-342 P22R 343-344 VOTOSIMG 345-346
P27AR 347-348 RECUERDO 349-350 ESTUDIOS 351 OCUMAR11 352-353 RAMA09 354 CONDICION11 355-356
ESTATUS 357 "
# making a 2-column matrix (name = left column, position = right column)
m <- matrix(scan(text=txt, what=""), ncol=2, byrow=TRUE)
m <- as.data.frame(m, stringsAsFactors=FALSE)
names(m) <- c("Var", "Pos")
pos <- sub("(A)", "", m$Pos, fixed = TRUE) # some entries contain '(A)' - no idea what it means so deleting it
pos <- strsplit(pos, "-")
starts <- as.numeric(sapply(pos, head, 1)) # get the first element from left
ends <- as.numeric(sapply(pos, tail, 1)) # get the first element from right
w <- ends - starts +1
MyData <- read.fwf("R/MD3191/DA3191", widths = w)
names(MyData) <- m$Var
head(MyData)
# ESTU CUES CCAA PROV MUN TAMUNI CAPITAL DISTR SECCION ENTREV P0 P0A P1 P2 P3 P4 P5 P6
# 1 3191 1 16 1 59 5 1 0 0 0 1 0 3 2 2 5 1 2
# 2 3191 2 16 1 59 5 1 0 0 0 1 0 4 2 3 5 2 3
# 3 3191 3 16 1 59 5 1 0 0 0 1 0 4 2 2 4 2 2

Related

Using a for loop with ggplot2 to plot multiple graphs within a data frame

Was just wondering if somebody could help with a problem I am having in R with a for loop using ggplot2. I have carried out some clustering to find patterns of data that change over time. There are various patterns in total with 38 graphs of patterns. The output of the clustering is to put side by side all 38 graphs which is nice for visualisation.
But I want to zoom in to individual graphs to zoom in to them for presentation and a cleared view of a pattern. This is easy manually, however, writing 38 versions of the same script but just with a different cluster in each one is very tedious, so I would like to create a for loop in order to achieve in one chunk of quick code. I have done this code (with some help online also), however, I am unable to get the ouput of the individual 38 graphs. the code itself works as I can specify one cluster which will then give me an output of that specific cluster, but I want to create a code that will creat all 38 different clusters.
The code I am using is as follows:
The data frame is called dfllgc, within which dfllgc$cluster contains information on the individual clusters. The for loop I am attempting is as follows but does not work. Any help would really really be appreciated!
for(cluster in dfllgc$cluster){
df<-subset(dataframAMIRllgc,cluster == 1:38)
df$Time_point<-factor(df.s$Time_point, levels = c("p3", "p15", "p30","p60"))
g<-ggplot(df, aes(x=Time_point, y=abundance, group=llgc, colour=llgc))+
geom_line(size=1.5)+
geom_point(size=4)+
ggtitle("Cluster 29: Patterns over time (5 genes) \n") +
xlab("\nAge") + ylab("Expression(CPM)\n")
print(g) }
Changing df<-subset(dataframAMIRllgc,cluster == 1:38) to == 1, or 15 etc, or any other cluster does indeed produce that one cluster, but not all 38 with 1:38.
Finally, with the title (ggtitle), is there a way to automate also the titles such that I can have a template, but that the cluster number as well as number of genes are automatically applied to the correct clusters?
Thank you so much! Any help would be much appreciated :)
example data
merge cluster Time_point llgc abundance
1 High[26-50%]p15 1 p15 High[26-50%] 166.5400335
38 High[26-50%]p3 1 p3 High[26-50%] 255.5007952
75 High[26-50%]p30 1 p30 High[26-50%] 122.1110473
112 High[26-50%]p60 1 p60 High[26-50%] 78.84340532
149 Low[0-10%]p15 1 p15 Low[0-10%] 86.40962037
186 Low[0-10%]p3 1 p3 Low[0-10%] 205.9750297
223 Low[0-10%]p30 1 p30 Low[0-10%] 60.23843127
260 Low[0-10%]p60 1 p60 Low[0-10%] 56.64259547
297 Medium[11-25%]p15 1 p15 Medium[11-25%] 165.2372227
334 Medium[11-25%]p3 1 p3 Medium[11-25%] 223.3891249
371 Medium[11-25%]p30 1 p30 Medium[11-25%] 155.1325448
408 Medium[11-25%]p60 1 p60 Medium[11-25%] 176.8285175
2 High[26-50%]p15 2 p15 High[26-50%] 85.21789981
39 High[26-50%]p3 2 p3 High[26-50%] 211.5359752
76 High[26-50%]p30 2 p30 High[26-50%] 35.7475454
113 High[26-50%]p60 2 p60 High[26-50%] 12.87995477
150 Low[0-10%]p15 2 p15 Low[0-10%] 77.20608808
187 Low[0-10%]p3 2 p3 Low[0-10%] 43.04550979
224 Low[0-10%]p30 2 p30 Low[0-10%] 34.88976766
261 Low[0-10%]p60 2 p60 Low[0-10%] 9.791146582
298 Medium[11-25%]p15 2 p15 Medium[11-25%] 46.21377697
335 Medium[11-25%]p3 2 p3 Medium[11-25%] 34.89603178
372 Medium[11-25%]p30 2 p30 Medium[11-25%] 14.18668175
409 Medium[11-25%]p60 2 p60 Medium[11-25%] 7.360330065
3 High[26-50%]p15 3 p15 High[26-50%] 47.75793997
40 High[26-50%]p3 3 p3 High[26-50%] 62.3529071
77 High[26-50%]p30 3 p30 High[26-50%] 17.8348889
114 High[26-50%]p60 3 p60 High[26-50%] 14.26366778
151 Low[0-10%]p15 3 p15 Low[0-10%] 138.1451371
188 Low[0-10%]p3 3 p3 Low[0-10%] 185.1184602
225 Low[0-10%]p30 3 p30 Low[0-10%] 63.52332626
262 Low[0-10%]p60 3 p60 Low[0-10%] 39.40566363
299 Medium[11-25%]p15 3 p15 Medium[11-25%] 26.32551336
336 Medium[11-25%]p3 3 p3 Medium[11-25%] 49.72067928
373 Medium[11-25%]p30 3 p30 Medium[11-25%] 8.288553629
410 Medium[11-25%]p60 3 p60 Medium[11-25%] 5.385031193
I'm not sure I 100% understand what you are trying to do but I think there is a problem with your subset and then you need to add a save function to the end. Hopefully this does what you want:
dfllgc$Time_point<-factor(dfllgc$Time_point, levels = c("p3", "p15", "p30","p60"))
for(cluster in unique(dfllgc$cluster)) {
g<-ggplot( dfllgc[ dfllgc$cluster == cluster, ],
aes(x=Time_point, y=abundance, group=llgc, colour=llgc)) +
geom_line(size=1.5) +
geom_point(size=4) +
ggtitle( paste0("Cluster ", cluster,": Patterns over time (5 genes)") ) +
xlab("Age") + ylab("Expression(CPM)")
ggsave(paste0("Cluster_", cluster,".png"), g)
}
Changes made:
removed the subset line and added the cluster subset/filter to ggplot line but it could just as easily be separate.
moved the factor conversion outside the for loop so it only needs to be applied once.
set the title and file name to change with each cluster

How to use arguments specified in a user-created R function?

this seems like a basic question; however, I am not sure if I am unable to word my question to search for the answer that I need.
This is the sample:
id2 sbp1 dbp1 age1 sbp2 dbp2 sex bmi1 bmi2 smoke drink exercise
1 1 134.5 89.5 40 146 84 2 21.74685 22.19658 1 0 1
2 4 128.5 89.5 48 125 70 1 24.61942 22.29476 1 0 0
3 5 105.5 64.5 42 121 80 2 22.15103 26.90204 1 0 0
4 8 116.5 79.5 39 107 72 2 21.08032 27.64403 0 0 1
5 9 106.5 73.5 26 132 81 2 21.26762 29.16131 0 0 0
6 10 120.5 81.5 34 130 85 1 24.91663 26.89427 1 1 0
I have this code here for a function I am making:
linreg.ols<- function(indat, dv, p1, p2, p3){
data<- read.csv(file= indat, header=T)
data[1:5,]
y<- data$dv
x <- as.matrix(data.frame(x0=rep(1,nrow(data)), x1=data$p1, x2=data$p2,
x3=data$p3))
inv<- solve(t(x)%*%x)
xy<- t(x)%*%y
betah<- inv%*%xy
print("Value of beta hat")
betah
}
And when I run my code with this line:
linreg.ols("bp.csv",sbp1,smoke,drink,exercise)
I get the following error:
Error in data.frame(x0 = rep(1, nrow(data)), x1 = data$p1, x2 = data$p2, :
arguments imply differing number of rows: 75, 0
I have a feeling that it's because of how I am extracting the p1, p2, and p3 columns on the line where I create the x variable.
EDIT: changed to y<-data$dv
EDIT: added on part of the sample. Also, I tried:
x <- as.matrix(data.frame(1,data[,c("p1","p2","p3")]))
But that returned the error:
Error in `[.data.frame`(data, , c("p1", "p2", "p3")) : undefined columns selected

Store values in a cell dataframe

I am trying to store in multiple cells in a dataframe. But, my code is storing the data in the last cell (on the dd array). Please see my output below.
Can somebody please correct me? Cannot figure out what I am doing wrong.
Thanks in advance,
MyData <- read.csv(file="Pat_AR_035.csv", header=TRUE, sep=",")
dd <- unique(MyData$POLICY_NUM)
for (j in length(dd)) {
myDF <- data.frame(i=1:length(dd), m=I(vector('list', length(dd))))
myDF$m[[j]] <- data.frame(j,MyData[which(MyData$POLICY_NUM==dd[j] & MyData$ACRES), ],ncol(MyData),nrow(MyData))
}
[[60]]
NULL
[[61]]
NULL
[[62]]
NULL
[[63]]
j OBJECTID DIVISION POLICY_SYM POLICY_NUM YIELD_ID LINE_ID RH_CLU_ID ACRES PLANT_DATE ACRE_TYPE CLU_DETERM STATE COUNTY FARM_SERIA TRACT
1646 63 1646 8 MP 754033 3 20 39565604 8.56 5/3/2014 PL A 3 35 109 852
1647 63 1647 8 MP 754033 1 10 39565605 30.07 4/19/2014 PL A 3 35 109 852
1648 63 1648 8 MP 754033 1 10 39565606 56.59 4/19/2014 PL A 3 35 109 852
CLU_NUMBER FIELD_ACRE RMA_CLU_ID UPDATE_DAT Percent_Ar RHCLUID Field1 OBJECTID_1 DIVISION_1 STATE_1 COUNTY_1
1646 3 8.56 F68E591A-ECC2-470B-A012-201C3BB20D7F 9/21/2014 63.4990 39565604 1646 1646 8 3 35
1647 1 30.07 eb04cfc0-e78b-415f-b447-9595c81ef09e 9/21/2014 100.0000 39565605 1647 1647 8 3 35
1648 2 56.59 5922d604-e31c-4b9d-b846-9f38e2d18abe 9/21/2014 92.1442 39565606 1648 1648 8 3 35
POLICY_N_1 YIELD_ID_1 RH_CLU_ID_ short_dist coords_x1 coords_x2 optional SHAPE_Leng SHAPE_Area ncol.MyData. nrow.MyData.
1646 754033 3 39565604 5.110837 516747.8 -221751.4 TRUE 831.3702 34634.73 35 1757
1647 754033 1 39565605 5.606284 515932.1 -221702.0 TRUE 1469.4800 121611.46 35 1757
1648 754033 1 39565606 5.325399 516380.1 -221640.9 TRUE 1982.8757 228832.22 35 1757
for (j in length(dd))
This doesn’t iterate over dd — it iterates over a single number: the length of dd. Not much of an iteration. You probably meant to write the following or something similar:
for (j in seq_along(dd))
However, there are more issues with your code. For instance, the myDF variable is continuously overwritten inside your loop, which probably isn’t what you intended at all. Instead, you should probably create objects in an lapply statement and forego the loop.

Selecting pairs of odd even values in R

I have a large dataset as follows:
head(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type
Sampletype
400 32691 535 -28 3382.981 34.74480 1 S3 2011 2 ha
401 32701 536 -28 3375.263 34.86087 1 S3 2011 2 ha
402 32711 537 -28 3308.103 34.83100 1 S3 2011 2 ha
403 32721 538 -28 3368.721 31.58641 1 S3 2011 2 ha
404 32731 539 -28 3368.604 34.72326 1 S3 2011 2 ha
405 32741 540 -28 3314.713 32.83147 1 S3 2011 2 ha
tail(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type Sampletype
5445 70880 3962 -28.4 3390.458 29.12815 34 S4 2016 2 ha
5446 70890 3963 -28.5 3358.861 37.14896 34 S4 2016 2 ha
5447 70900 3964 -28.5 3363.626 26.71573 34 S4 2016 2 ha
5448 70910 3965 -28.5 3408.907 26.69665 34 S4 2016 2 ha
5449 70920 3966 -28.5 3348.463 29.01492 34 S4 2016 2 ha
5450 70930 3967 -28.4 3375.247 26.78261 34 S4 2016 2 ha
I am looking to create a variable to identify pairs of odd and even based on the variable GU.Number. These numbers identify duplicates of the same object - have same d13.C values.
For example,
535 - 536
537 - 538
3963-3964
3965-3966 are pairs.
Note, the column of GU.Number is not a sequence, some numbers are missing.
even.rows <- which(!(humic$GU.Number %% 2))
has.pair <- rep(0,nrow(humic))
for(i in even.rows){
has.pair[i] <- max((humic$GU.Number[i] + c(1,-1)) %in% humic$GU.Number)
}
# add as column of data
humic$has.pair <- has.pair
The has.pair column will be 1 if the GU.Number is even and there exists an odd GU.Number one less or one greater than the given GU.Number. Otherwise it will be 0. As a one-liner:
humic$has.pair <- sapply(1:nrow(humic),
function(x) with(humic,(!(GU.Number[x] %% 2))*max((GU.Number[x] + c(1,-1)) %in% GU.Number)))

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources