R read_excel reads numeric data incorrectly - r

I'm trying to download and parse the Data worksheet in the file ie_data.xls from Professor Robert Shiller's home page (http://www.econ.yale.edu/~shiller/data.htm). I download the file from http://www.econ.yale.edu/~shiller/data/ie_data.xls, and then run the following script:
library(tidyverse)
ie_data <- read_excel("ie_data.xls", sheet = "Data", col_names = TRUE,
col_types = "numeric", na = "", skip = 7) %>%
select(Date,E) %>%
drop_na()
A bunch of warnings are generated, but more bothersome is the output
> names(ie_data)
[1] "Date" "E"
> ie_data
# A tibble: 1,791 x 2
Date E
<dbl> <dbl>
1 1871. 0.4
2 1871. 0.4
3 1871. 0.4
4 1871. 0.4
5 1871. 0.4
6 1871. 0.4
7 1871. 0.4
8 1871. 0.4
9 1871. 0.4
10 1871. 0.4
# ... with 1,781 more rows
Warning message:
`...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
The contents of both columns should have two decimal places (1871.01 represents January 1871, 1871.02 represents February 1871 and so on, and the second column is earnings per share rounded to the nearest penny), but everything after the decimal point is gone in the first column at the head of the dataframe! Even more mysterious is its tail:
> tail(ie_data)
# A tibble: 6 x 2
Date E
<dbl> <dbl>
1 2019. 135.
2 2019. 137.
3 2019. 139.
4 2020. 132.
5 2020. 124.
6 2020. 116.
Warning message:
`...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Now both columns have lost their fractional part! What change do I need to make to my code in order to read these columns correctly?
Sincerely and with many thanks in advance
Thomas Philips

You can do the following to see more significant digits in your console when printing your data with ie_data. This doesn't affect your data, only the way it is shown when printed to your console.
options(pillar.sigfig = 10)
ie_data
Which will show:
Date E
<dbl> <dbl>
1 1871.01 0.4
2 1871.02 0.4
3 1871.03 0.4
4 1871.04 0.4
5 1871.05 0.4
6 1871.06 0.4
7 1871.07 0.4
8 1871.08 0.4
9 1871.09 0.4
10 1871.1 0.4
# ... with 1,781 more rows
If you use the following:
options(pillar.sigfig = 1)
ie_data
You will get:
# A tibble: 1,791 x 2
Date E
<dbl> <dbl>
1 1871. 0.4
2 1871. 0.4
3 1871. 0.4
4 1871. 0.4
5 1871. 0.4
6 1871. 0.4
7 1871. 0.4
8 1871. 0.4
9 1871. 0.4
10 1871. 0.4
# ... with 1,781 more rows

try it with col_types = "text"
Don't really know why numeric will get you trimmed numbers but i seem to get it working with text (provided you later convert to a rounded number)

Related

How to estimate means from same column in large number of dataframes, based upon a grouping variable in R

I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format

Selecting rows with time in R

I have a data frame that looks like this:
Subject Time Freq1 Freq2 ...
A 6:20 0.6 0.1
A 6:30 0.1 0.5
A 6:40 0.6 0.1
A 6:50 0.6 0.1
A 7:00 0.3 0.4
A 7:10 0.1 0.5
A 7:20 0.1 0.5
B 6:00 ... ...
I need to delete the rows in the time range it is not from 7:00 to 7:30.So in this case, all the 6:00, 6:10, 6:20...
I have tried creating a data frame with just the times I want to keep but I does not seem to recognize the times as a number nor as a name. And I get the same error when trying to directly remove the ones I don't need. It is probably quite simple but I haven't found any solution.
Any suggestions?
We can convert the time column to a Period class under the package lubridate and then filter the data frame based on that column.
library(dplyr)
library(lubridate)
dat2 <- dat %>%
mutate(HM = hm(Time)) %>%
filter(HM < hm("7:00") | HM > hm("7:30")) %>%
select(-HM)
dat2
# Subject Time Freq1 Freq2
# 1 A 6:20 0.6 0.1
# 2 A 6:30 0.1 0.5
# 3 A 6:40 0.6 0.1
# 4 A 6:50 0.6 0.1
# 5 B 6:00 NA NA
DATA
dat <- read.table(text = "Subject Time Freq1 Freq2
A '6:20' 0.6 0.1
A '6:30' 0.1 0.5
A '6:40' 0.6 0.1
A '6:50' 0.6 0.1
A '7:00' 0.3 0.4
A '7:10' 0.1 0.5
A '7:20' 0.1 0.5
B '6:00' NA NA",
header = TRUE)

Save every iteration of for loop

I have asked these question before and solve the problem with Saga's help.
I am working on a simulation study. I have to reorganize my results and continue to analysis.
I have a data matrix contains may results like this
> data
It S X Y F
1 1 0.5 0.8 2.39
1 2 0.3 0.2 1.56
2 1 1.56 2.13 1.48
3 1 2.08 1.05 2.14
3 2 1.56 2.04 2.45
.......
It shows iteration
S shows second iteration working inside of IT
X shows coordinate of X obtained from a method
Y shows coordinate of Y obtained from a method
F shows the F statistic.
My problem is I have to find minimum F value for every iteration. So I have to store every iteration on a different matrix or data frame and find minimum F value.
I have tried many things but not worked. Any help, idea will be appreciated.
EDIT: Updated table information
This was the solution:
library(dplyr)
data %>%
group_by(It) %>%
slice(which.min(F))
A tibble: 3 x 5
Groups: It [3]
It S X Y F
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14
However , I will continue another for loop and I want to select every X values providing above conditions.
For example when I use data$X[i] This code doesn't select to values of X (0.30, 1.56, 2.08). It selected original values from "data" before grouping. How can I solve this problem?
I hope this is what you are expecting:
> library(dplyr)
> data %>%
group_by(It) %>%
slice(which.min(F))
# A tibble: 3 x 5
# Groups: It [3]
It S X Y F
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14

extract irregular numeric data from strings

I have data like below. I wish to extract the first and last year from each string here called my.string. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result below the example data set. I am using R.
When a string contains two years those years are contained within a portion of the string that looks like this ga49.51 or ea22.24
When a string contains only one year that year is contained in a portion of the string that looks like this: time11
I know a bit about regex, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.
EDIT
Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.
my.data <- read.table(text = '
my.string cov1 cov2
42:Alpha:ga6.8 -0.1 2.2
43:Alpha:ga9.11 -2.5 0.6
44:Alpha:ga30.32 -1.3 0.5
45:Alpha:ga49.51 -2.5 0.6
50:Alpha:time1:ga.time -1.7 0.9
51:Alpha:time2:ga.time -1.5 0.8
52:Alpha:time3:ga.time -1.0 1.0
2:Beta:ea2.9 -1.7 0.6
3:Beta:ea17.19 -5.0 0.8
4:Beta:ea22.24 -6.4 1.0
8:Beta:as 0.2 0.6
9:Beta:sd 1.7 0.4
12:Beta:time1:ea.tim -2.6 1.8
13:Beta:time10:ea.ti -3.6 1.1
14:Beta:time11:ea.ti -3.1 0.7
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
desired.result <- read.table(text = '
my.string cov1 cov2 time1 time2
42:Alpha:ga6.8 -0.1 2.2 6 8
43:Alpha:ga9.11 -2.5 0.6 9 11
44:Alpha:ga30.32 -1.3 0.5 30 32
45:Alpha:ga49.51 -2.5 0.6 49 51
50:Alpha:time1:ga.time -1.7 0.9 1 NA
51:Alpha:time2:ga.time -1.5 0.8 2 NA
52:Alpha:time3:ga.time -1.0 1.0 3 NA
2:Beta:ea2.9 -1.7 0.6 2 9
3:Beta:ea17.19 -5.0 0.8 17 19
4:Beta:ea22.24 -6.4 1.0 22 24
8:Beta:as 0.2 0.6 NA NA
9:Beta:sd 1.7 0.4 NA NA
12:Beta:time1:ea.tim -2.6 1.8 1 NA
13:Beta:time10:ea.ti -3.6 1.1 10 NA
14:Beta:time11:ea.ti -3.1 0.7 11 NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:
> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
my.string cov1 cov2 time1 time2
1 42:Alpha:ga6.8 -0.1 2.2 6 8
2 43:Alpha:ga9.11 -2.5 0.6 9 11
3 44:Alpha:ga30.32 -1.3 0.5 30 32
4 45:Alpha:ga49.51 -2.5 0.6 49 51
5 50:Alpha:time1:ga.time -1.7 0.9 1 <NA>
6 51:Alpha:time2:ga.time -1.5 0.8 2 <NA>
7 52:Alpha:time3:ga.time -1.0 1.0 3 <NA>
8 2:Beta:ea2.9 -1.7 0.6 2 9
9 3:Beta:ea17.19 -5.0 0.8 17 19
10 4:Beta:ea22.24 -6.4 1.0 22 24
11 8:Beta:as 0.2 0.6 <NA> <NA>
12 9:Beta:sd 1.7 0.4 <NA> <NA>
13 12:Beta:time1:ea.tim -2.6 1.8 1 <NA>
14 13:Beta:time10:ea.ti -3.6 1.1 10 <NA>
15 14:Beta:time11:ea.ti -3.1 0.7 11 <NA>
The first regex matches:
(?<=time)\\d+ - 1+ digits that have time before them
| - or
(?<=\\b[ge]a)\\d+ - 1+ digits that have ge or ea` as a whole word in front
The second regex matches:
(?<=\\b[ge]a\\d{1,100}\\.) - check if the current position is preceded with ge or ea as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
\\d+ - 1+ digits
Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:
Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*
Replace: $0\t$1\t$2\t$3
Breakdown:
.*(?: ... ).* ensures that the whole line is matched, and uses a non-capturing group for the main alternation
time(\d+): this is the first half of the alternation, capturing any digits after a "time"
(?:[ge]a)(\d+)\.(\d+): the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group
Replacement: $0 puts the whole line back. Each of the other capture groups are added, with tabs in-between.
See regex101 example

R explicit subscript function

I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4

Resources