Importing messy data to r - r

Lake Elsinore 9.7 F W 60.2 131 1 1 0 2310.1
Lake Elsinore 10.4 F W 53.9 67 0 0 0 1815.9
Lake Elsinore 10.1 M W 54.3 96 1 1 1 1872.9
Lake Elsinore 9.6 M W 55.1 72 1 . 1 1980.4
So here I have ten variables V1-V10. How can I read it to R. You see the first variable is actually separated by space. So I can't read in "separating by space". Could someone have me to find a way that I could easily import those kind of data in.
Thank you so so much!

Here are two approaches:
1) It could be done with read.pattern in the gsubfn package. The matches to the parenthesized portions of the pattern are read in as separate fields:
library(gsubfn)
pattern <- "^(.*) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+)"
read.pattern("myfile.dat", pattern, na.strings = ".")
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Lake Elsinore 9.7 F W 60.2 131 1 1 0 2310.1
2 Lake Elsinore 10.4 F W 53.9 67 0 0 0 1815.9
3 Lake Elsinore 10.1 M W 54.3 96 1 1 1 1872.9
4 Lake Elsinore 9.6 M W 55.1 72 1 NA 1 1980.4
2) Read in the lines as they are, replace the first space on each line with some character (here we use underscore), re-read it now using read.table and then replace the underscore with space:
L <- readLines("myfile.dat")
L <- sub(" ", "_", L)
DF <- read.table(text = L, na.strings = ".")
DF[[1]] <- sub("_", " ", DF[[1]])
giving the same answer.

It's a little clunky, but I usually just read it in raw and parse the data from there. You could do something like:
# First, read in all columns space separated
df <- read.table(FILE, header = F, sep = " ")
# Create a new column (V12) that's a concatenation of V1, V2
within(df, V12 <- paste(V1, V2, sep=' '))
# And then drop the unwanted columns
df <- df[,2:11]
Remember, you have 11 columns reading it in raw, which is why I'm creating a 12th.

Related

How to remove just the set of numbers with / in between among other strings? [duplicate]

This question already has an answer here:
How can I extract numbers separated by a forward slash in R? [closed]
(1 answer)
Closed 3 years ago.
I need to extract the blood pressure values from a text note that is typically reported as one larger number, "/" over a smaller number, with the units mm HG (it's not a fraction, and only written as such). In the 4 examples below, I want to extract 114/46, 135/67, 109/50 and 188/98 only, without space before or after and place the top number in column called SBP, and the bottom number into a column called DBP.
Thank you in advance for your assistance.
bb <- c("PATIENT/TEST INFORMATION (m2): 1.61 m2\n BP (mm Hg): 114/46 HR 60 (bpm)", "PATIENT/TEST INFORMATION:\ 63\n Weight (lb): 100\nBSA (m2): 1.44 m2\nBP (mm Hg): 135/67 HR 75 (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Coronary artery disease. Hypertension. Myocardial infarction.\nWeight (lb): 146\nBP (mm Hg): 109/50 HR (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Aortic stenosis. Congestive heart failure. Shortness of breath.\nHeight: (in) 64\nWeight (lb): 165\nBSA (m2): 1.80 m2\nBP (mm Hg): 188/98 HR 140 (bpm) ")
BP <- head(bb,4)
dput(bb)
Base R solution:
setNames(data.frame(do.call("rbind", strsplit(trimws(gsub("[[:alpha:]]|[[:punct:]][^0-9]+", "",
gsub("HR.*", "", paste0("BP", lapply(strsplit(bb, "BP"), '[', 2)))), "both"), "/"))),
c("SBP", "DBP"))
We can use regmatches/regexpr from base R to extract the required values, and then with read.table, create a two column data.frame
read.table(text = regmatches(bb, regexpr('\\d+/\\d+', bb)),
sep="/", header = FALSE, stringsAsFactors = FALSE)
# V1 V2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
Or using strcapture from base R
strcapture( "(\\d+)\\/(\\d+)", bb, data.frame(X1 = integer(), X2 = integer()))
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
To create this as new columnss in the original data.frame, use either cbind to bind the output with the original dataset
cbind(data, read.table(text = ...))
Or
data[c("V1", "V2")] <- read.table(text = ...)
Or using extract from tidyr
library(dplyr)
library(tidyr)
tibble(bb) %>%
extract(bb, into = c("X1", "X2"), ".*\\b(\\d+)/(\\d+).*", convert = TRUE)
# A tibble: 4 x 2
# X1 X2
# <int> <int>
#1 114 46
#2 135 67
#3 109 50
#4 188 98
If we don't want to remove the original column, use remove = FALSE in extract
You could use str_match and select numbers which has / in between
as.data.frame(stringr::str_match(bb, "(\\d+)/(\\d+)")[, 2:3])
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
In base R, we can extract the numbers that follow the pattern a/b, split them on '/' and form two columns.
as.data.frame(do.call(rbind, strsplit(sub(".*?(\\d+/\\d+).*", "\\1", bb), "/")))
You can give them the column names as per your choice using setNames or any other method.

how to perform log10 transform on specific columns by a condition in R

May dataset is like this:
d <- read.table('age.txt', header = F,sep=' ')
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 101 12 3.531704 16.0 40.8 1.449648 1.080353 20.85738 74.53056 0
2 102 15 -9.000000 24.0 36.4 -9.000000 -9.000000 -9.00000 -9.00000 0
3 103 13 3.306023 26.2 48.4 2.178820 1.349228 22.51904 72.82571 2.3
4 104 12 2.715226 18.2 42.6 2.343138 1.414314 23.13632 72.73414 4.5
and I need to perform log10 transform on column 6:10, but only for the values that are not equal to 0 or -9. Well, I tried this:
if(d[,6:10]!=-9 || 0){d[,6:10]=log10(d[,6:10])}
but it did not work. If anyone can help thanks.
log2 or log10?
m <- as.matrix(df[6:10])
df[6:10] <- ifelse(m > 0, log10(m), m)
One option would be to loop through the columns 6:10, get the index of elements that are not 0 or -9, apply the log10 on those and return the vector.
d[6:10] <- lapply(d[6:10], function(x) {
i1 <- !x %in% c(0, -9)
x[i1] <- log10(x[i1])
x} )
Or another option would be to create a logical matrix ('i1') subset the elements from the columns and update with the log10
i1 <- d[6:10]!=0 & d[6:10] != -9
d[6:10][i1] <- log10(d[6:10][i1])

Spliting a row into columns using a delimiter in R

My data loks like this:
ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,
I would like to first separate this into different columns then apply a function on each row to calculate the difference of comma separated values such as (237-204).
Without the use of external library packages.
Try this except if the data is in a file replace the readLines line with something like this: L <- readLines("myfile.csv") . After that replace the colons with commas using gsub and then read the resulting text and transform it:
# test data
Lines <- "ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,"
L <- readLines(textConnection(Lines))
DF <- read.table(text = gsub(":", ",", L), sep = ",")
transform(DF, diff = V3 - V4)
giving:
V1 V2 V3 V4 V5 diff
1 ID 10 237 204 NA 33
2 ID 11 257 239 NA 18
3 ID 12 309 291 NA 18
4 ID 13 310 272 NA 38
5 ID 14 3202 3184 NA 18
6 ID 15 404 388 NA 16

I am trying to figure out how to parse a webpage

I am working on a summer project. To grab course information from my school website.
I start off by going here: http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=
to gather the course departments.
Then I grab info from pages like this one.
I have what I need filtered down to a list like:
[1] "91091 211 01 PRINC OF FINANCIAL ACCOUNTING 3.0 55 22 33 0 MW 12:45PM 02:05PM BAB 106 Rose-Green E"
[2] "91092 211 02 PRINC OF FINANCIAL ACCOUNTING 3.0 53 18 35 0 TR 09:35AM 10:55AM BAB 123 STAFF"
[3] "91093 211 03 PRINC OF FINANCIAL ACCOUNTING 3.0 48 29 19 0 TR 05:30PM 06:50PM BAB 220 Hoskins J"
[4] "91094 212 01 MANAGEMENT ACCOUNTING 3.0 55 33 22 0 MWF 11:30AM 12:25PM BAB 106 Hoskins J"
[5] "91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R"
However my issues are as follows:
www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS
I need to add the department from each url. In the link I gave, the department was "CS". I need to have that included with each entry.
I need to turn this into a table, or some other object where I can reference the data like
Max Wait
CRN Course Title Credit Enrl Enrl Avail List Days Start End Bldg Room Instructor
------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- --------------------
Basically how the data is displayed on the page.
So my end goal is to go through each of those links I grab, get all the course info(except the section type). Then put it into a giant data.frame that has all the courses like this.
Department CRN Course Title Credit MaxEnrl Enrl Avail WaitList Days Start End Bldg Room Instructor
ACC 91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R
So far I have this working
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
dep <- readLines(url)
dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]
dep <- substring(dep, 6)
dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]
dep <- foreach(i=dep) %do% trim(i)
dep <- dep[2:length(dep)]
return(dep)
}
x <- gatherClasses(uah[1])
x <-unlist(x)
I am having trouble split the data in the right places. I am not sure what I should try next.
EDIT:(Working Now)
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
L <- readLines(url)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3}", L, value = TRUE)
classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
classes$department <- unlist(strsplit(url, '='))[3]
return(classes)
}
allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)
write.table(mydata, "c:/sum2015b.txt", sep="\t")
Read the lines into L, grab the "--- ---- etc." line into Fields and ensure that there is exactly one space at the end. Find the character positions of the spaces and difference them to get the field widths. Finally grep out the data portion and read it in using read.fwf which reads fixed width fields. For example, for Art History:
URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 90628 100 01 ARH SURV:ANCIENT-MEDIEVAL 3 35 27 8 0 TR 12:45PM 02:05PM WIL 168 Joyce L
2 90630 101 01 ARH SURV:RENAISSANCE-MODERN 3 35 14 21 0 MW 12:45PM 02:05PM WIL 168 Stewart D
3 90631 101 02 ARH SURV:RENAISSANCE-MODERN 3 35 8 27 0 MW 03:55PM 05:15PM WIL 168 Stewart D
4 92269 101 03 ARH SURV:RENAISSANCE-MODERN 3 35 5 30 0 TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5 90632 101 04 ARH SURV:RENAISSANCE-MODERN 3 35 13 22 0 TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6 90633 301 01 ANCIENT GREEK ART 3 18 3 15 0 MW 02:20PM 03:40PM WIL 168 Joyce L
7 92266 306 01 COLLAPSE OF CIVILIZATIONS 3 10 4 6 0 TR 12:45PM 02:05PM SST 205 Sever T
8 W 90634 309 01 CONTEMPORARY ART & ISSUES 3 18 10 8 0 TR 09:35AM 10:55AM WIL 168 Stewart D
9 90635 320 01 ST: MODERN ARCHITECTURE 3 12 0 12 0 TR 11:10AM 12:30PM WIL 172 Takacs T
10 90636 400 01 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Joyce L
11 90637 400 02 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Stewart D
I wrote and donated that schedule.pl script about 20 years ago because they simply published the flat mainframe files of all the courses on offer for each session. The script's job is to break up the whole set and present it in human-consumable chunks. (That, and back then a browser would choke on that much data.) I understand from one of the former UAH IT people that they tried to do away with it once, but got a great hew and cry from users, so they figured out how to keep it working.
It would be easier for you to ask the UAH IT folks if you can't just retrieve the underlying flat file. It used to be on a public-facing URL, but like I said, that was about 20 years ago, so I don't recall the specifics. The output you see when viewing courses is the same as the flat file, but the flat file contains every department, so you don't have to fetch each separately.

R - "find and replace" integers in a column with character labels

I have two data frames the first (DF1) is similar to this:
Ba Ram You Sheep
30 1 33.2 120.9
27 3 22.1 121.2
22 4 39.1 99.1
11 1 20.0 101.6
9 3 9.8 784.3
The second (DF2) contains titles for column "Ram":
V1 V2
1 RED
2 GRN
3 YLW
4 BLU
I need to replace the DF1$Ram with corresponding character strings of DF2$V2:
Ba Ram You Sheep
30 RED 33.2 120.9
27 YLW 22.1 121.2
22 BLU 39.1 99.1
11 RED 20.0 101.6
9 YLW 9.8 784.3
I can do this with a nested for loop, but it feels REALLY inefficient:
x <- c(1:nrows(DF1))
y <- c(1:4)
for (i in x) {
for (j in y) {
if (DF1$Ram[i] == x) {
DF1$Ram[i] <- DF2$V2[y]
}
}
}
Is there a way to do this more efficiently??!?! I know there is. I'm a noob.
Use merge
> result <- merge(df1, df2, by.x="Ram", by.y="V1")[,-1] # merging data.frames
> colnames(result)[4] <- "Ram" # setting name
The following is just for getting the output in the order you showed us
> result[order(result$Ba, decreasing = TRUE), c("Ba", "Ram", "You", "Sheep")]
Ba Ram You Sheep
1 30 RED 33.2 120.9
3 27 YLW 22.1 121.2
5 22 BLU 39.1 99.1
2 11 RED 20.0 101.6
4 9 YLW 9.8 784.3
Usually, when you encode some character strings with integers, you likely want factor. They offer some benefits you can read about in the fine manual.
df1 <- data.frame(V2 = c(3,3,2,3,1))
df2 <- data.frame(V1=1:4, V2=c('a','b','c','d'))
df1 <- within(df1, {
f <- factor(df1$V2, levels=df2$V1, labels=df2$V2)
aschar <- as.character(f)
asnum <- as.numeric(f)
})

Resources