I have a data frame with 383 variables. Because the names of the variables are long and self-explanatory, I would like to add these names to the labels of variables, then in a second step (already successfully done), I would rename variables for easier coding. I have tried the following with the error:
library(expss)
REGCON_CA_FIRM <- apply_labels(REGCON_CA_FIRM,names(REGCON_CA_FIRM)<-names(REGCON_CA_FIRM))
# Error in if (curr_name %in% data_names) { : argument is of length zero
A one-liner using mtcars:
do.call(apply_labels, c(list(data=mtcars),setNames(names(mtcars), names(mtcars)) %>% as.list()))
However, for your use case, you can create a small function as below that takes a dataframe and a vector of new names, and basically moves the current column names to labels, and replaces the original (i.e. too long) names with the new names
replace_long_with_short <- function(d,short_names) {
setNames(
do.call(apply_labels, c(list(data=d),setNames(names(df), names(df)) %>% as.list())),
short_names
)
}
Pass your dataframe to this function, along with desired new names. The function will return the frame with the original column names as labels, and the new colnames will be the desired new names:
Example: Let's say you have a data frame that looks like this:
X.is.an.important.variable Y.is.also.important
1 -0.003643385 1.1052905
2 1.641458152 0.5303247
3 -1.058337452 0.5490569
and you want those descriptive column names to be the labels, and the new names to be x and y.
Then calling the above function like this:
df = replace_long_with_short(df,c("x", "y"))
will convert df to this:
x y
1 -0.003643385 1.1052905
2 1.641458152 0.5303247
3 -1.058337452 0.5490569
and the labels will be attached:
str(df)
'data.frame': 3 obs. of 2 variables:
$ x:Class 'labelled' num -0.00364 1.64146 -1.05834
.. .. LABEL: X.is.an.important.variable
$ y:Class 'labelled' num 1.105 0.53 0.549
.. .. LABEL: Y.is.also.important
Related
Within a large function, I would like to create a "summary table" of sorts. This summary table, summaries information from multiple R objects, that have been created within the function. The objects are:
Data table with the information on the limit
> str(limit)
Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
$ id : num 6292
$ type : chr "DAILY"
$ value: chr "350"
- attr(*, ".internal.selfref")=<externalptr>
vector with position of the element in mydata that is over the limit limits$value
n <- which(mydata$amount > as.double(limit$value))
str(n)
int [1:4960] 1 2 3 5 6 9 11 16 19 20 ...
I have now created an empty data.table problem with rows that I want to use to summarise the elements that are over the limit in mydata:
problem <- data.table("LIMITid" = character(),
"LIMITtype" = character (),
"LIMITvalue" = character (),
"amount" = double(),
"customerID" = character())
Finally, i want to populate my problem data.table with the corresponding information. I tried:
if(length(n) > 0){
problem$LIMITid <- limit$id
problem$LIMITtype <- limit$type
problem$LIMITvalue <- limit$value
problem$amount <- mydata$amount
problem$customerID <- mydata$customerID
}
How can i populate the data.table? I was thinking of using a loop, but i am unsure how to loop over positions in an element - n %in% nrownames(mydata)?
We can specify the columns of interest in .SDcols and then subset the .SD (Subset of Data.table) with the index provided by 'n' and cbind with the 'limits' dataset
library(data.table)
cbind(limits, mydata[, .SD[n], .SDcols = c("amount", "customerID")])
Wouldn't this work ?
library(data.table)
data.table(limits, amount = mydata$amount[n], customerID = mydata$customerID[n])
I have the following list:
library(rjson)
j <- fromJSON(file='https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
I am interested in extracting data from this component: j$response$docs, which is a list of lists. The 'internal' lists are all supposed to have the same names.
I want to save the output to a data.frame() or tibble().
This below works and gives the desired output, for the few selected variables:
nmod <- length(j$response$docs)
for (i in 1:nmod) {
#select one list at a time
j1 <- j$response$docs[[i]]
tmp <- data.frame(variable=j1$variable,
variable_long_name=j1$variable_long_name,
rcm_name=j1$rcm_name,
driving_model=j1$driving_model,
cf_standard_name=j1$cf_standard_name
)
#join them
if (i==1) {
d <- tmp
} else {
d <- rbind(d, tmp)
}
}
However, I'd like to know if there is a more elegant and efficient way, maybe using tidyr, dplyr or purrr, which also would allow me to select all ¨columns¨, instead of just the few selected there.
You can do it with help from package purrr. I thought at_depth might work here, but instead I ended up using nested map_df.
library(purrr)
Your variables are different lengths, so the first thing to do is to make sure each variable is length 1. This can be done by collapsing each element of the inner list with paste. I used commas a separator. Doing this via map_df returns a 1 row tibble.
Here's an example with the first inner list.
map_df(j$response$docs[[1]], paste, collapse = ",")
Now we can loop through the outer lists, making a 1 row tibble for each. We use map_df to bind each of these together. The output is a 832 row tibble, one row per list. I used the .id argument to add a grouping variable to the result, which may not be needed.
d1 = map_df(j$response$docs, ~map_df(.x, paste, collapse = ","))
d1
# A tibble: 832 × 45
group id version
<chr> <chr> <chr>
1 1 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clh.v20131119|cordexesg.dmi.dk 20131119
2 2 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clivi.v20131119|cordexesg.dmi.dk 20131119
3 3 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsds.v20131119|cordexesg.dmi.dk 20131119
4 4 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlds.v20131119|cordexesg.dmi.dk 20131119
5 5 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsus.v20131119|cordexesg.dmi.dk 20131119
6 6 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlus.v20131119|cordexesg.dmi.dk 20131119
7 7 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsdt.v20131119|cordexesg.dmi.dk 20131119
8 8 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsut.v20131119|cordexesg.dmi.dk 20131119
9 9 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlut.v20131119|cordexesg.dmi.dk 20131119
10 10 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.psl.v20131119|cordexesg.dmi.dk 20131119
# ... with 822 more rows, and 42 more variables:
If you want to get multiple rows for the variables that were greater than length 1, such as access and experiment_family, you can use tidyr::separate_rows to separate the data onto multiple rows.
tidyr::separate_rows(d1, experiment_family)
instead of rjson go with this:
library(jsonlite)
j <- jsonlite::fromJSON('https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
# The names you wan to find in the nested returned data
look_for <- c('variable','variable_long_name' ,
'rcm_name','driving_model',
'cf_standard_name')
new_df <- as.data.frame(sapply(look_for, function(i){
unlist(j$response$docs[[i]])
}))
str(new_df)
'data.frame': 832 obs. of 5 variables:
$ variable : chr "clh" "clivi" "rsds" "rlds" ...
$ variable_long_name: chr "High Level Cloud Fraction" "Ice Water Path" "Surface Downwelling Shortwave Radiation" "Surface Downwelling Longwave Radiation" ...
$ rcm_name : chr "HIRHAM5" "HIRHAM5" "HIRHAM5" "HIRHAM5" ...
$ driving_model : chr "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" ...
$ cf_standard_name : chr "cloud_area_fraction_in_atmosphere_layer" "atmosphere_cloud_ice_content" "surface_downwelling_shortwave_flux_in_air" "surface_downwelling_longwave_flux_in_air" ...
I'm writing a simple function that will create a new variable containing the sum of missing values of each column within a dataset. I am using the assign function to assign a variable name based on the input of the function.
report.NA <- function(v){
nam <- deparse(substitute(v))
newvar <-paste0(nam,"NAs")
as.data.frame(assign(newvar,colSums(is.na(v)),envir=parent.frame()))
message(paste("Sum of NAs in",nam,"dataset:",newvar),appendLF=FALSE)
}
For the sake of reproducibility:
set.seed(1)
df<-matrix(1,nrow=10,ncol=5)
dimnames(df)<-list(rownames(df),colnames(df,do.NULL=F))
df[sample(1:length(d), 10)] <- NA
Run the function on df, you get a new variable called dfNAs.
> dfNAs
col1 col2 col3 col4 col5
2 2 3 0 3
The issue I am running into is that I want to have my output variable as a data.frame type. I know the obvious way of doing this outside of the function is just to run as.data.frame(dfNAs) but I would like to have function itself produce the new variable from assign as a data frame. I just wanted to see if there is a solution to this issue.
Also the overarching question is how to call the name from assign nested within a function so that and if it's even possible? I seems like a naive question but I haven't been able to find an answer yet.
Not sure I understand what is desired but this reworking might point you in a favorable direction. Using as.list will convert a named vector to a multi-element named list which the ordinary data.frame function can accept to make multiple columns:
report.NA <- function(v){
nam <- deparse(substitute(v))
newvar <-paste0(nam,"NAs")
assign(newvar,data.frame(as.list(colSums(is.na(v)))),envir=parent.frame())
message(paste("Sum of NAs in",nam,"dataset:",newvar),appendLF=FALSE)
}
report.NA(df)
#Sum of NAs in df dataset: dfNAs
> dfNAs
col1 col2 col3 col4 col5
1 2 2 3 0 3
> str(dfNAs)
'data.frame': 1 obs. of 5 variables:
$ col1: num 2
$ col2: num 2
$ col3: num 3
$ col4: num 0
$ col5: num 3
I have a table source that reads into a data frame. I know that by default, external sources are read into data frames as factors. I'd like to apply stringsAsFactors=FALSE in the data frame call below, but it throws an error when I do this. Can I still use chaining and turn stringsAsFactors=FALSE?
library(rvest)
pvbData <- read_html(pvbURL)
pvbDF <- pvbData %>%
html_nodes(xpath = `//*[#id="ajax_result_table"]`) %>%
html_table() %>%
data.frame()
data.frame(,stringsAsFactors=FALSE) <- Throws an error
I know this is probably something very simple, but I'm having trouble finding a way to make this work. Thank you for your help.
Though the statement should logically be data.frame(stringsAsFactors=FALSE) if you are applying chaining, even this statement doesn't produce the required output.
The reason is misunderstanding of use of stringsAsFactors option. This option works only if you make the data.frame column by column. Example:
a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = T)
str(a)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 2 levels "a","b": 1 2
$ y: num 1 2
a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = F)
str(a)
'data.frame': 2 obs. of 2 variables:
$ x: chr "a" "b"
$ y: num 1 2
If you give data.frame as input, stringsAsFactors option doesn't work
Solution:
Store the chaining result to a variable like this:
library(rvest)
pvbData <- read_html(pvbURL)
pvbDF <- pvbData %>%
html_nodes(xpath = `//*[#id="ajax_result_table"]`) %>%
html_table()
And then apply this command:
data.frame(as.list(pvbDF),stringsAsFactors=F)
Update:
If the column is already a factor, then you can't convert it to character vector using this command. Better first as.character it and retry.
You may refer to Change stringsAsFactors settings for data.frame for more details.
I am trying to create a empty data frame with two columns and unknown number of row. I would like to specify the names of the columns. I ran the following command
dat <- data.frame("id"=numeric(),"nobs"=numeric())
I can test the result by running
> str(dat)
'data.frame': 0 obs. of 2 variables:
$ id : num
$ nobs: num
But later on when I insert data into this data frame using rbind in the following command, the names of the columns are also changed
for (i in id) {
nobs = nrow(na.omit(read.csv(files_list[i])))
dat = rbind(dat, c(i,nobs))
}
After for loop this is the value of dat
dat
X3 X243
1 3 243
And str command shows the following
str(dat)
'data.frame': 1 obs. of 2 variables:
$ X3 : num 3
$ X243: num 243
Can any one tell why are the col names in data frame change
EDIT:
My lazy solution to fix the problem is to run the follwing commands after for loop that binds data to my data.frame
names(dat)[1] = "id"
names(dat)[2] = "nobs"
Interestingly, the rbind.data.frame function throws away all values passed that have zero rows. It basically happens in this line
allargs <- allargs[nr > 0L]
so passing in a data.frame with no rows, is really like not passing it in nothing at all. Another good example why it's almost always a bad idea to try to build a data.frame row-by-row. Better to build vectors and then combine into a data.frame only when done.
dat = data.frame(col1=numeric(), col2=numeric())
...loop
dat[, dim(dat)[1] + 1] = c(324, 234)
This keeps the column names
You should try specify your column names inside the rbind():
dat = rbind(dat, data.frame("id" = i, "nobs" = nobs))
I would change how you're appending the data to the data frame. Since rbind seems to remove the column names, just replace the indexed location.
dat <- data.frame("id"=numeric(),"nobs"=numeric())
for (i in id) {
dat[i,] <- nrow(na.omit(read.csv(files_list[i])))
}
FYI, Default data frame creation converts all strings to factors, not an issue here, since all your data formats are numeric. But if you had a character(), you might want to turn off the default stringsAsFactors=FALSE, to append character lists.