Create a dataframe i nR - r

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?

I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

Related

I was trying to mutate a new numeric column in a dataframe but the compliler is taking it as char's and i am not even able to access it using index

library(dslabs)
data(heights)
library(dplyr)
mutate(heights, ht_cm = height * 2.54, stringsAsFactor = FALSE )
str(heights) # not showing ht_cm as a variable in the data frame
mean(heights$ht_cm) # giving error that argument is not numeric
You just used mutate, but if you want to add the new column in height you need to:
Code
heights <-
heights %>%
mutate(ht_cm = height * 2.54)
Output
str(heights)
'data.frame': 1050 obs. of 3 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
$ ht_cm : num 190 178 173 188 155 ...

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.
Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750

How do I plot asset stock prices in R?

I'm trying to plot asset stock prices in R. I'm downloading the data in csv format from Yahoo Finance and then importing it to R so I can run some statistical tests on it and draw a few plots.
I'm currently trying to plot the closing price vs the date, and I'm not having a lot of success. R is just plotting it as a series of distinct points and won't join these points up with lines, despite me trying to use the argument type = "l".
price <- read.csv("~/Downloads/AAPL.csv")
plot(price$Date,price$Close,type="l")
I'm just grabbing the data from here: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
I get an output like this every time, regardless of what kind of extra arguments I try.
For example, I tried to make it red, didn't change at all.
Thanks!
The problem is that pric$Date is a factor (categorical variable) and not a number. You can convert the date string to a Posix timestamp with as.POSIXlt, and then compute a floating point representation therefrom, e.g. year + yday/366.
Try this
price$Date = as.Date(price$Date)
plot(price$Date,price$AAPL.Close,type="l",col=4)
or better
library(quantmod)
fro = '2014-07-31'
Apple = getSymbols('AAPL',auto.assign = F,from=fro)
chartSeries(Apple,subset = "last 3 years")
You don't need to use a package unless you want to create candlestick charts.
df <- read.csv("AAPL.csv")
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Factor w/ 254 levels "2019-07-10","2019-07-11",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
df$Date <- as.Date(df$Date) # Otherwise it is treated as a factor variable
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Date, format: "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-15" ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
plot(y=df$Close, x=df$Date, col="red", type = "l") # look at ?plot for more details

How can I fully extract all elements into a data frame?

I retrieve some data from an API and convert it to a flat structure.
library(httr)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
raw_original <- GET(url)
raw <- rawToChar(raw_original$content)
raw <- fromJSON(raw)
api_extr <- do.call("rbind", lapply(raw, data.frame))
At first, all seems well (a 5-column data frame):
> head(api_extr)
from to intensity.forecast intensity.actual intensity.index
1 2019-11-24T23:30Z 2019-11-25T00:00Z 210 200 moderate
2 2019-11-25T00:00Z 2019-11-25T00:30Z 199 200 moderate
3 2019-11-25T00:30Z 2019-11-25T01:00Z 200 198 moderate
4 2019-11-25T01:00Z 2019-11-25T01:30Z 204 189 moderate
5 2019-11-25T01:30Z 2019-11-25T02:00Z 199 191 moderate
6 2019-11-25T02:00Z 2019-11-25T02:30Z 192 193 moderate
However, one of the columns (intensity) is in fact a data frame which contains three further columns.
> str(api_extr)
'data.frame': 49 obs. of 3 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity:'data.frame': 49 obs. of 3 variables:
..$ forecast: int 210 199 200 204 199 192 191 194 197 192 ...
..$ actual : int 200 200 198 189 191 193 197 193 193 194 ...
..$ index : chr "moderate" "moderate" "moderate" "moderate" ...
I would expect the data frame to have five columns whereas instead it only has three.
At first glance this may seem insignificant, but the problems will start when it comes to working with the data (i.e. plotting it).
How can I achieve five columns?
You can pass the URL directly to fromJSON and flatten the result in a single step.
library(jsonlite)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
df <-fromJSON(url, flatten = TRUE)[[1]]
str(df)
'data.frame': 49 obs. of 5 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity.forecast: int 210 199 200 204 199 192 191 194 197 192 ...
$ intensity.actual : int 200 200 198 189 191 193 197 193 193 194 ...
$ intensity.index : chr "moderate" "moderate" "moderate" "moderate" ...

read complicated dataset in R

My dataset look something like given below. The first number is the feature number and then colon and then the value associated with that specific feature. I am not sure how to import this dataset in R. Anyone has any ideas?
236:24 500:163 732:234 869:117 885:106 1249:103 1280:158 1889:119 2015:55 2718:126 3307:137 3578:25 3770:26 4139:128 4723:114 4957:82 5128:50 5420:124 5603:135 5897:34 5946:117 6069:154 6153:55 6347:87 6372:77 6666:109 6866:223 6984:39 7709:253 7950:87 8078:38 8945:141 9316:111 9948:103 9989:68 10276:43 10530:76 10532:55 10799:15 10802:20 10848:82 11347:16 11871:51 11883:105 12534:133 12601:13 12781:178 12798:116 12842:106 12916:7 12935:51 12968:154 13028:58 13330:105 13384:2 13568:47 13641:632 13829:18 13964:62 14385:93 14392:272 15280:140 15424:119 15492:52 15523:31 16311:23 16464:69 16478:94 16584:102 16586:107 16705:272 17138:108 17181:150 17526:280 17540:163 18007:114 18050:53 18180:2 18806:160 18943:73 19055:41 19255:88 19774:59 19889:72 19921:45
101:68 572:57 732:63 962:120 1304:61 1831:60 1889:58 1973:105 2518:161 2629:228 2990:158 3147:75 3578:11 3860:88 4011:18 4623:141 4684:411 4758:69 4820:120 6149:102 6234:134 6306:118 6866:147 6927:89 6988:51 7048:178 7193:31 7257:61 7709:229 8061:125 8202:188 8272:17 8759:165 9104:77 9325:135 9860:97 10055:684 10532:180 10735:64 10744:267 10820:120 10848:186 10923:128 10936:129 11203:160 11303:144 11668:87 11867:97 11871:207 12191:83 12238:193 12380:51 12968:164 13369:58 13929:39 14531:102 14800:130 14931:99 15314:91 15632:62 16165:7 16353:120 16584:137 17216:172 18372:31 18893:75 19133:93 19154:101 19165:133 19607:20 19784:141 19889:97 19921:60
Assuming your data is stored in input.txt,
input <- scan('input.txt', what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Alternatively, you can use read.table to parse the input rather than manually splitting the strings which is slightly slower but more readable.
data <- read.table(text = input, sep = ':')
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Edit: adapted for your dataset. Reads your Feature/Value pairs into a data frame.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/dexter_test.data'
input <- scan(url, what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature','Value')
str(data)
# 'data.frame': 192449 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 79 10848 105 11018 76 ...

Resources