r convert number saved as char in tibble to double NA introduced - r

I'd like to convert the relevant columns in the following tibble to numeric (double precision):
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <chr> <chr> <chr> <chr> <chr>
1 16.04.2021 53,64 54,12 53,64 54,12 50
2 15.04.2021 53,19 53,19 53,19 53,19 -
3 14.04.2021 53,29 53,29 53,29 53,29 -
4 13.04.2021 52,86 52,86 52,86 52,86 -
5 12.04.2021 53,17 53,17 53,17 53,17 -
6 09.04.2021 53,18 53,18 53,18 53,18 -
However, if I apply as.numeric to the relevant columns, NA would be introduced.
What is the most efficient way to convert the entries in the relevant columns to double without generating the NAs?
Reproducible sample data:
df <- tribble(
~Date, ~Open, ~High, ~Low, ~Close, ~Shares,
"16.04.2021", "53,64", "54,12", "53,64", "54,12", 50,
"15.04.2021", "53,19", "53,19", "53,19", "53,19", NA,
"14.04.2021", "53,29", "53,29", "53,29", "53,29", NA,
"13.04.2021", "52,86", "52,86", "52,86", "52,86", NA,
"12.04.2021", "53,17", "53,17", "53,17", "53,17", NA,
"09.04.2021", "53,18", "53,18", "53,18", "53,18", NA
)

You can replace comma with a dot and convert to numeric. Use lapply to apply the function to multiple columns.
df[2:5] <- lapply(df[2:5], function(x) as.numeric(sub(',', '.', x)))
Using dplyr :
library(dplyr)
library(readr)
df %>%
mutate(across(Open:Close, ~parse_number(., locale = locale(decimal_mark = ","))))

That reason you can't turn them into numeric values are , as decimal separator instead of .. So you can use the following code:
library(dplyr)
library(stringr)
df %>%
mutate(across(Open:Close, ~ str_replace(., ",", "\\.")),
across(Open:Close, as.numeric))
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 NA
3 14.04.2021 53.3 53.3 53.3 53.3 NA
4 13.04.2021 52.9 52.9 52.9 52.9 NA
5 12.04.2021 53.2 53.2 53.2 53.2 NA
6 09.04.2021 53.2 53.2 53.2 53.2 NA

First escape the "." in your regular expression.
Second replace the commas with a "." before you can convert to numeric
df %>%
mutate(across(2:5, ~as.numeric(gsub(",", ".", gsub("\\.", "", .)))))
Output:
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 -
3 14.04.2021 53.3 53.3 53.3 53.3 -
4 13.04.2021 52.9 52.9 52.9 52.9 -
5 12.04.2021 53.2 53.2 53.2 53.2 -
6 09.04.2021 53.2 53.2 53.2 53.2 -

Related

Create a temporary group in dplyr group_by

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.
We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

Extract value from previous row based on a condition

I have a dataset that looks as follows:
data <- tribble(
~Date, ~Ticker, ~Close, ~Open,
"1989-09-11","COND",77.3292,77.3292,
"1989-09-12","COND",77.4435,77.4435,
"1989-09-13","COND",76.3118,76.3118,
"1989-09-14","COND",75.5309,75.6344,
"1989-09-15","COND",75.6598,75.4675)
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 77.3
2 1989-09-12 COND 77.4 77.4
3 1989-09-13 COND 76.3 76.3
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
The issue with it is that until a certain date, the closing price is identical with the opening price. What I'm trying to do is writing a function that checks if the opening and closing price are the same, and if that's the case, it replaces the opening price with the closing price from the previous row. If applied to the above data, it would transform the data as follows:
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 NA
2 1989-09-12 COND 77.4 77.3
3 1989-09-13 COND 76.3 77.4
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
I tried to do it with an if statement, but I'm running into problems as soon as I try to get the value from the previous row in the "Close" column to the current "Open" value.
In dplyr, it's a simple mutate with lag.
library(dplyr)
data %>%
mutate(Open = if_else(Open == Close, lag(Close), Open))
## A tibble: 5 x 4
# Date Ticker Close Open
# <chr> <chr> <dbl> <dbl>
#1 1989-09-11 COND 77.3 NA
#2 1989-09-12 COND 77.4 77.3
#3 1989-09-13 COND 76.3 77.4
#4 1989-09-14 COND 75.5 75.6
#5 1989-09-15 COND 75.7 75.5

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Aggregate and adding new column

I have a dataset with district name, household latitude, and longitude. The dataset has 2000 household locations. I want to calculate the mean of latitude and longitude based on district name. Next, I want to add two new columns (i.e. Lat_mean, Long_mean) in which the mean Lat and Long will be stored for each household.
I was just able to aggregate the mean values for latitude and longitude. I don't know how to paste the summarized data as a new column for each ID (see code)
id <- c(1,2,3,4,5,6)
district <- c("A", "B", "C", "A", "A", "B")
lat <- c(28.6, 30.2, 35.9, 27.5, 27.9, 31.5)
long <- c(77.5, 85.2, 66.5, 75.0, 79.2, 88.8)
df <- data.frame(id, district, lat, long)
df_group <- df %>% group_by(district) %>% summarise_at(vars(lat:long), mean)
I am expecting the following. Lat_mean & Long_mean columns will be added to 'df' and each ID will have values based on district name. See the image below.
We can use mutate_at instead of summarise_at. Within the list, specify the name, so that it will create a new column with suffix as that name
library(dplyr)
df %>%
group_by(district) %>%
mutate_at(vars(lat, long), list(mean = mean))
# A tibble: 6 x 6
# Groups: district [3]
# id district lat long lat_mean long_mean
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 A 28.6 77.5 28 77.2
#2 2 B 30.2 85.2 30.8 87
#3 3 C 35.9 66.5 35.9 66.5
#4 4 A 27.5 75 28 77.2
#5 5 A 27.9 79.2 28 77.2
#6 6 B 31.5 88.8 30.8 87
> df %>%
mutate(lat_mean = ave(lat, district, FUN=mean),
lon_mean = ave(long, district, FUN=mean))
id district lat long lat_mean lon_mean
1 1 A 28.6 77.5 28.00 77.23333
2 2 B 30.2 85.2 30.85 87.00000
3 3 C 35.9 66.5 35.90 66.50000
4 4 A 27.5 75.0 28.00 77.23333
5 5 A 27.9 79.2 28.00 77.23333
6 6 B 31.5 88.8 30.85 87.00000

Resources