Trying to tidy data but get errors: can't convert .x[[i]] empty character vector to a function & must extract column with a single valid subscript - r

I am trying to tidy a dataset where I measured exposure at two different stations by time (in seconds). I have a data frame which has column 1 as Second, column 2 as SiteA_Number (corresponding to the number of particles at SiteA), SiteA_Diamater (diameter of particles at SiteA), SiteA_LDSA (LDSA at SiteA), and the same measurements for SiteB as 3 more columns (SiteB_Number, SiteB_Diam, SiteB_LDSA).
I would like my dataset to transform to have a column for the Seconds, columns for Number, Diameter, and LDSA, and a separate column for the station (SiteA or SiteB). That way, I can plot a graph with Number (y axis) over time (seconds) and fill by site.
The structure of each column is as follows:
'data.frame': 1800 obs. of 7 variables:
$ Second: num 1 2 3 4 5 6 7 8 9 10 ...
$ SiteA_Number : int 16673 19891 20370 17513 18185 18982 18362 17579 16605 15590 ...
$ SiteA_Diam : int 41 39 38 42 41 39 40 42 44 45 ...
$ SiteA_LDSA : num 36.1 40.4 40.7 38.6 38.8 ...
$ SiteB_Number: int 15554 16745 17719 16494 15811 15331 16053 16196 15733 15521 ...
$ SiteB_Diam : int 40 39 37 40 42 44 42 42 42 43 ...
$ SiteB_LDSA : num 33 33.8 34.3 34.2 35.2 ...
I tried using pivot_longer to create a station column and then corresponding columns for the number, diameter, and LDSA:
MergedLDSA %>%
pivot_longer(-Second,
names_to =c("Station", ".value"),
names_sep = ("_"),
names_transform = list(
Number = as.integer,
Diameter = as.integer,
LDSA = as.integer,
Station = as.character())
)
But I get the error message:
Error in `map()`:
! Can't convert
`.x[[i]]`, an empty
character vector, to
a function.
I then tried using the separate() function:
MergedLDSA %>%
separate(c(SiteA_Number, SiteA_Diam, SiteA_LDSA, SiteB_Number, SiteB_Diam, SiteB_LDSA), into = c("Station", ".value"), sep = "_")
But I get the error message:
Error:
! Must extract column with a single valid subscript.
x Subscript `var` has size 6 but must be size 1.
I'm fairly beginner at coding and this is my first time trying to tidy real data. I do not understand the errors and cannot figure out how to tidy my data the way I'd like.
Any help would be greatly appreciated! :)

Related

R: Iterate through a for loop to print multiple tables

In the house price prediction dataset, there are about 80 variables and 1459 obs.
To understand the data better, I have segregated the variables which are 'char' type.
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
There are 42 variables that are char datatype.
I want to find the number of observations in each variable.
The simple code for that would be:
table(property_train$Zoning_Class)
Commer FVR RHD RLD RMD
10 65 16 1150 218
But repeating the same for 42 variables would be a tedious task.
The for loops I've tried to print all the tables show error.
for (val in char_names){
print(table(property_train[[val]]))
}
Abnorml AdjLand Alloca Family Normal Partial
101 4 12 20 1197 125
Is there a way to iterate the char_names through the dataframe to print all 42 tables.
str(property_train)
'data.frame': 1459 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Building_Class : int 60 20 60 70 60 50 20 60 50 190 ...
$ Zoning_Class : chr "RLD" "RLD" "RLD" "RLD" ...
$ Lot_Extent : int 65 80 68 60 84 85 75 NA 51 50 ...
$ Lot_Size : int 8450 9600 11250 9550 14260 14115 10084 10382..
$ Road_Type : chr "Paved" "Paved" "Paved" "Paved" ...
$ Lane_Type : chr NA NA NA NA ...
$ Property_Shape : chr "Reg" "Reg" "IR1" "IR1" ...
$ Land_Outline : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
Actually, for me your code does not give an error (make sure to evaluate all lines in the for-loop together):
property_train <- data.frame(a = 1:10,
b = rep(c("A","B"),5),
c = LETTERS[1:10])
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
table(property_train$b)
for (val in char_names){
print(table(property_train[val]))
}
You can also get this result in a bit more user-friendy form using dplyr and tidyr by pivoting all the character columns into a long format and counting all the column-value combinations:
library(dplyr)
library(tidyr)
property_train %>%
select(where(is.character)) %>%
pivot_longer(cols = everything(), names_to = "column") %>%
group_by(column, value) %>%
summarise(freq = n())

Filter all columns in timeseries to keep only top 1/3

I have a timeseries with about 100 dates, 50 entities per date (so 5,000 rows) and 50 columns (all are different variables). How can I filter each column in the data frame, per unique date, to keep the top 1/3 of values for each column on each date. Then get the average Return for that group for that date. Thank you.
My data is organized as follows but the numbers in each column are random and vary like they do in column "a" (this is a sample, the real data has many more columns and many more rows):
Date Identity Return a b c d e f... ...z
2/1/19 X 5 75 43 67 85 72 56 92
2/1/19 Y 4 27 43 67 85 72 56 92
2/1/19 Z 7 88 43 67 85 72 56 92
2/1/19 W 2 55 43 67 85 72 56 92
2/2/19 X 7 69 43 67 85 72 56 92
2/2/19 Y 8 23 43 67 85 72 56 92
2/3/19 X 2 34 43 67 85 72 56 92
2/3/19 Y 3 56 43 67 85 72 56 92
2/3/19 Z 4 62 43 67 85 72 56 92
2/3/19 W 4 43 43 67 85 72 56 92
2/3/19 U 4 26 43 67 85 72 56 92
2/4/19 X 6 67 43 67 85 72 56 92
2/4/19 Y 1 78 43 67 85 72 56 92
2/5/19 X 4 75 43 67 85 72 56 92
2/7/19 X 5 99 43 67 85 72 56 92
2/7/19 Y 4 72 43 67 85 72 56 92
2/7/19 Z 4 45 43 67 85 72 56 92
I am trying to filter data into quantiles. I have a code that works for filtering into quantiles for one measure. However I want filtered results for many measures individually (i.e. I want a “high” group for a ton of columns).
The code that I have that works for one measure is as follows.
Columns are date, identity, and a a is the indicator I want to sort on
High = df[!is.na(df$a),] %>%
group_by(df.date) %>%
filter(a > quantile(a, .666)) %>%
summarise(high_return = sum(df.return) / length(df.identity)
Now I want to loop this for when I have many indicators to sort on individually (I.e. I do not want to sort within one another, I want each sorted separately and the results to be broken out by indicator)
I want the output of the loop to be a new data frame with the following format (where a_Return is the average return of the top 1/3 of the original a's on a given date):
Date a_Return b_Return c_Return
2/1/19 6. 7 3
2/3/19 4. 2 5
2/4/19 2. 4 6
I have tried the code below without it working:
Indicators <- c(“a”, “b”, “c”)
for(i in 1:length(Indicators)){
High = df %>%
group_by(df.date) %>%
filter(High[[I]] > quantile(High[[i]], .666)) %>%
summarise(g = sum(df.return) / length(df.identity)}
With this attempt I get the error: "Error in filter_impl(.data, quo) : Result must have length 20, not 4719.
I also tried:
High %>%
group_by(date) %>%
filter_at(vars(Indicators[i]), any_vars(. > quantile (., .666)))%>%
summarise(!!Indicators[I] := sum(Return) / n())
but with that code I get the error "Strings must match column names. Unknown Columns: NA"
I want High to turn up with a date column and then a column for each a, b, and c.
If you combine the filtering and calculations into a single function, then you can put that into summarize_at to apply it easily to each column. Since you're example data isn't fully reproducible, I'll use the iris dataset. In your case, you'd replace Species with Date, and Petal.Width with Return:
library(dplyr)
top_iris <- iris %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
top_iris
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.1 2.22 2.09
The problem with using filter is that each function in the pipe runs in order, so any criteria you give to filter_* will have to be applied to the whole data.frame before the result is piped into summarize_at. Instead, we just use a single summarize_at statement, and filter each column as the summarization function is applied to it.
To explain this in more detail, summarize_at takes 2 arguments:
The first argument is one or more of the variable selector functions described in ?select_helpers, enclosed in the vars function. Here we use one_of which just takes a vector of column names, but we could also use matches to select using a regular expession, or starts_with to choose based on a prefix, for example.
The second argument is a list of one or more function calls to be run on each selected column, enclosed in the funs function. Here we have 1 function call, to which we've given the name return.
Like with any tidyverse function, this is evaluated in a local environment constructed from the data piped in. So bare variable names like Petal.Width function as data$Petal.Width. In *_at functions, the . represents the variable passed in, so when the Sepal.Length column is being summarized:
Petal.Width[. > quantile(., .666)]
means:
data$Petal.Width[data$Sepal.Length > quantile(data$Sepal.Length, .666)]
Finally, since the function in funs is named (that's the return =), then the resulting summary columns have the function's name (return) appended to the original column names.
If you want to remove missing data before running these calculations, you can use na.omit to strip out NA values.
To remove all rows containing NA, just pipe your data through na.omit before grouping:
iris2 <- iris
iris2[c(143:149), c(1:2)] <- NA
iris2 %>%
na.omit() %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.09 2.19 2.07
To strip NA values from each column as it's being summarized, you need to move na.omit inside the summarize function:
iris2 %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = {
var <- na.omit(.)
length(Petal.Width[var > quantile(var, .666)])
}))
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.11 2.2 2.09
Here we use curly braces to extend the function we run in summarize_at to multiple expressions. First, we strip out NA values, then we calculate the return values. Since this function is in summarize_at it gets applied to each variable based on the grouping established by group_by.

r: Reading libsvm files with library (e1071)

I have generated a libsvm file in scala using the org.apache.spark.mllib.util.MLUtils package.
The file format is as follows:
49.0 109:2.0 272:1.0 485:1.0 586:1.0 741:1.0 767:1.0
49.0 109:2.0 224:1.0 317:1.0 334:1.0 450:1.0 473:1.0 592:1.0 625:1.0 647:1.0
681:1.0 794:1.0
17.0 26:1.0 109:1.0 143:1.0 198:2.0 413:1.0 476:1.0 582:1.0 586:1.0 611:1.0
629:1.0 737:1.0
12.0 255:1.0 394:1.0
etc etc
I read the file into r using e1071 package as follows:
m= read.matrix.csr(filename)
The structure of the resultant matrix.csr is as follows:
$ x:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
.. ..# ra : num [1:31033] 2 1 1 1 1 1 2 1 1 1 ...
.. ..# ja : int [1:31033] 109 272 485 586 741 767 109 224 317 334 ...
.. ..# ia : int [1:2996] 1 7 18 29 31 41 49 65 79 83 ...
.. ..# dimension: int [1:2] 2995 796
$ y: Factor w/ 51 levels "0.0","1.0","10.0",..: 45 45 10 5 42 25 23 41 23 25 ...
When I convert to a dense matrix with as.matrix(m) it produces one column and two rows, each with an uninterpretable (by me) object in it.
When I simply try to save the matrix.csr back to file (without doing any intermediate processing), I get the following error:
Error in abs(x) : non-numeric argument to mathematical function
I am guessing that the libsvm format is incompatible but I'm really not sure.
Any help would be much appreciated.
OK, the short of it:
m= read.matrix.csr(filename)$x
because read.matrix.csr is a list with two elements; the matrix and a vector.
In other words, the target/label/class is separated out from the features matrix.
NOTE for fellow r neophytes: In Cran documents, it seems that the "Value" subheading refers to the return values of the function
Value
If the data file includes no y variable, read.matrix.csr returns
an object of class matrix.csr,
else a list with components:
x object of class matrix.csr
y vector of numeric values or factor levels, depending on fac

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

times series import from Excel and date manipulation in R

I have 2 columns of a time series in a .csv excel file, "date" and "widgets"
I import the file into R using:
widgets<-read.csv("C:things.csv")
str(things)
'data.frame': 280 obs. of 2 variables:
$ date: Factor w/ 280 levels "2012-09-12","2012-09-13",..: 1 2 3 4 5 6 7 8 9 10 ...
$ widgets : int 5 10 15 20 30 35 40 50 55 60 65 70 75 80 85 90 95 100 ...
How do I convert the factor things$date into either xts or Time Series format?
for instance when I:
hist(things)
Error in hist.default(things) : 'x' must be numeric
Try reading it in as a zoo object and then converting:
Lines <- "date,widgets
2012-09-12,5
2012-09-13,10
"
library(zoo)
# replace first argument with: file="C:things.csv"
z <- read.zoo(text = Lines, header = TRUE, sep = ",")
x <- as.xts(z)

Resources