I have a list of weather stations and their locations by latitude and longitude. There was formatting issue and some of them have have hours and minutes while other have hours, minutes and seconds. I can find the pattern using regex but I'm having trouble extracting the individual pieces.
Here's data:
> head(wthrStat1 )
Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W
I'd like something like this:
Station latHr latMin latSec latDir lonHr lonMin lonSec lonDir
1940 K01R 31 08 00 N 092 34 00 W
1941 K01T 28 08 00 N 094 24 00 W
1942 K03Y 48 47 00 N 096 57 00 W
1943 K04V 38 05 50 N 106 10 07 W
1944 K05F 31 25 16 N 097 47 49 W
1945 K06D 48 53 04 N 099 37 15 W
I can get matches to this regex:
data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)
But am unsure how to get the individual parts into columns. I've tried a few things like:
wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)
but with no luck.
Here's a dput():
> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F",
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N",
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N",
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W",
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W",
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station",
"lat", "lon"), row.names = 1940:1949, class = "data.frame")
Any suggestions?
strapplyc in the gsubfn package will extract each group in the regular expression surrounded with parentheses:
library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"
which gives:
> parts
[,1] [,2] [,3] [,4]
[1,] "31" "08" "00" "N"
[2,] "28" "08" "00" "N"
[3,] "48" "47" "00" "N"
[4,] "38" "05" "50" "N"
[5,] "31" "25" "16" "N"
[6,] "48" "53" "04" "N"
[7,] "42" "34" "28" "N"
[8,] "47" "58" "27" "N"
[9,] "48" "18" "03" "N"
[10,] "43" "20" "00" "N"
it is extremely inefficient , I hope someone else had better solution:
dat <- read.table(text =' Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W', head=T)
pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'
dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin <- gsub(pattern,'\\2',dat$lat)
latSec <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec
latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir
dat
Station lat lon latHr latMin latSec latDir
1940 K01R 31-08N 092-34W 31 08 00 N
1941 K01T 28-08N 094-24W 28 08 00 N
1942 K03Y 48-47N 096-57W 48 47 00 N
1943 K04V 38-05-50N 106-10-07W 38 05 50 N
1944 K05F 31-25-16N 097-47-49W 31 25 16 N
1945 K06D 48-53-04N 099-37-15W 48 53 04 N
Another answer, using stringr:
# example data
data <-
"Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W"
## read string into a data.frame
df <- read.table(text=data, head=T, stringsAsFactors=F)
pattern <- "(\\d{1,3})-(\\d{1,3})(?:-(\\d{1,3}))?([NSWE]{1})"
library(stringr)
str_match(df$lat, pattern)
This produces a data.frame with one column for the whole matching string and an additional column for each capture-group.
[,1] [,2] [,3] [,4] [,5]
[1,] "31-08N" "31" "08" "" "N"
[2,] "28-08N" "28" "08" "" "N"
[3,] "48-47N" "48" "47" "" "N"
[4,] "38-05-50N" "38" "05" "-50" "N"
[5,] "31-25-16N" "31" "25" "-16" "N"
[6,] "48-53-04N" "48" "53" "-04" "N"
R's string processing ability has progressed a lot in the past few years.
Related
I am working with the following data:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
I want to split the string after the second character and put them into two columns.
So that the data looks like this:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
Is there a simple code to get this done? Thanks so much for you help
You can use substr if you always want to split by the second character.
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
you could use strcapture from base R:
strcapture("(\\w{2})(\\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
where \\w{2} means two words
The OP has written
I'm more familiar with strsplit(). But since there is nothing to split
on, its not applicable in this case
Au contraire! There is something to split on and it's called lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.
The result is a list of vectors
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
which can be turned into a matrix, e.g., by
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"
We can use str_match to capture first two characters and the remaining string in separate columns.
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
With the tidyverse this is very easy using the function separate from tidyr:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
Treat it as fixed width file, and import:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21
Working with library "fpp" and object "elecequip". See below code. Why does the data not show month and year on the export? What "type" of object is "elecequip" and how do I get the month and year for this data?
fit <- stl(elecequip, s.window=5)
plot(elecequip, col="gray", main="Electrical equipment manufacturing",
ylab="New orders index", xlab="")
lines(fit$time.series[,2],col="red",ylab="Trend")
write.table(elecequip, "ELECT DATA.txt", sep="\t")
Output looks like this for the first 2 years of data:
"x"
"1" 79.43
"2" 75.86
"3" 86.4
"4" 72.67
"5" 74.93
"6" 83.88
"7" 79.88
"8" 62.47
"9" 85.5
"10" 83.19
"11" 84.29
"12" 89.79
"13" 78.72
"14" 77.49
"15" 89.94
"16" 81.35
"17" 78.76
"18" 89.59
"19" 83.75
"20" 69.87
"21" 91.18
"22" 89.52
"23" 91.12
"24" 92.97
It's a time series data.
> class(elecequip)
[1] "ts"
> str(elecequip)
Time-Series [1:191] from 1996 to 2012: 79.4 75.9 86.4 72.7 74.9 ...
There may be better ways of saving the printed output as a flat table (which I think is what you're after), but sink seems to work.
sink(file = "test.txt")
print(elecequip)
sink()
Head of the test.txt file:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 79.43 75.86 86.40 72.67 74.93 83.88 79.88 62.47 85.50 83.19 84.29 89.79
1997 78.72 77.49 89.94 81.35 78.76 89.59 83.75 69.87 91.18 89.52 91.12 92.97
1998 81.97 85.26 93.09 81.19 85.74 91.24 83.56 66.45 93.45 86.03 86.91 93.42
1999 81.68 81.68 91.35 79.55 87.08 96.71 98.10 79.22 103.68 101.00 99.52 111.94
2000 95.42 98.49 116.37 101.09 104.20 114.79 107.75 96.23 123.65 116.24 117.00 128.75
It's a bit of a hack, but (pulling out the necessary parts of stats:::print.ts):
library("fpp")
pp <- .preformat.ts(elecequip,TRUE)
storage.mode(pp) <- "numeric"
pp <- as.data.frame(pp)
If you want it in long format (there are ways to do this other than in Hadleyverse 2, of course):
library("tidyr")
library("dplyr")
pp %>% add_rownames("year") %>% gather(month,value,-year)
I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!
You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"
Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)
Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39
I've parsed through a file to extract certain values. A column contains a percentage with the symbol. Is there any way to remove that "%" character?
From this:
98.9% 23 43
92.2% 342 34
98.9% 53 53
82.2% 32 76
97.9% 83 45
92.9% 92 23
to:
98.9 23 43
92.2 342 34
98.9 53 53
82.2 32 76
97.9 83 45
92.9 92 23
You say in the title that you have a matrix - in which case everything in the matrix should be 'character' already. Use gsub to replace % with nothing.
> j <- matrix(c("1%", "2%", 3, 4), ncol = 2)
> j
[,1] [,2]
[1,] "1%" "3"
[2,] "2%" "4"
> gsub("%", "", j)
[,1] [,2]
[1,] "1" "3"
[2,] "2" "4"
if you want it to be numeric you could use apply along with as.numeric
> apply(gsub("%", "", j), 1, as.numeric)
[,1] [,2]
[1,] 1 2
[2,] 3 4
Use gsub to substitute the % for an empty string, then convert to numeric:
x <- c("98.9%", "92.2%", "98.9%", "82.2%", "97.9%", "92.9%")
as.numeric(gsub("%", "", x))
[1] 98.9 92.2 98.9 82.2 97.9 92.9
Helo, I am trying to reshape a data.frame in R such that each row will repeat with a different value from a list, then the next row will repeat from a differing value from the second entry of the list.
the list is called, wrk, dfx is the dataframe I want to reshape, and listOut is what I want to end up with.
Thank you very much for your help.
> wrk
[[1]]
[1] "41" "42" "44" "45" "97" "99" "100" "101" "102"
[10] "103" "105" "123" "124" "126" "127" "130" "132" "135"
[19] "136" "137" "138" "139" "140" "141" "158" "159" "160"
[28] "161" "162" "163" "221" "223" "224" ""
[[2]]
[1] "41" "42" "44" "45" "98" "99" "100" "101" "102"
[10] "103" "105" "123" "124" "126" "127" "130" "132" "135"
[19] "136" "137" "138" "139" "140" "141" "158" "159" "160"
[28] "161" "162" "163" "221" "223" "224" ""
>dfx
projectScore highestRankingGroup
1 0.8852 1
2 0.8845 2
>listOut
projectScore highestRankingGroup wrk
1 0.8852 1 41
2 0.8852 1 42
3 0.8852 1 44
4 0.8852 1 45
5 0.8852 1 97
6 0.8852 1 99
7 0.8852 1 100
8 0.8852 1 101
...
35 0.8845 2 41
36 0.8845 2 42
37 0.8845 2 44
38 0.8845 2 45
39 0.8845 2 98
40 0.8845 2 99
41 0.8845 2 100
How about replicate rows of dfx and cbind with unlisted wrk:
listOut <- cbind(
dfx[rep(seq_along(wrk), sapply(wrk, length)), ],
wrk = unlist(wrk)
)
How about:
If wrk contains simple vectors like in your example:
> szs<-sapply(wrk, length)
> fulldfr<-do.call(c, wrk)
> listOut<-cbind(dfx[rep(seq_along(szs), szs),], fulldfr)
If wrk contains dataframes:
> szs<-sapply(wrk, function(dfr){dim(dfr)[1]})
> fulldfr<-do.call(rbind, wrk)
> listOut<-cbind(dfx[rep(seq_along(szs), szs),], fulldfr)
How about:
expand.grid(dfx$projectScore, dfx$highestRankingGroup, wrk[[1]])
Edit:
Maybe you can eleborate a bit more, because this does seem to work:
a <- c("41","42","44","45","97","99","100","101","102","103","105", "123","124","126","127","130","132","135","136","137","138","139","140","141","158","159","160","161","162","163","221","223","224")
wrk <-list(a, a)
dfx <- data.frame(projectScore=c(0.8852, 0.8845), highestRankingGroup=c(1,2))
listOut <- expand.grid(dfx$projectScore, dfx$highestRankingGroup, wrk[[1]])
names(listOut) <- c("projectScore", "highestRankingGroup", "wrk")
listOut[order(-listOut$projectScore,listOut$highestRankingGroup, listOut$wrk),]