R extract labels from a rda data frame - r

I am looking at some data downloaded from ICPSR and I am specifically using their R data file (.rda). Beneath the column name of each data file, there are some descriptions of the variables (a.k.a labels). An example is attached as well.
I tried various ways to get the label including base::label, Hmisc::label, labelled::var_label, sjlabelled::get_label and etc. But none worked.
So I am asking any ideas on how to extract the labels from this data file?
Thanks very much in advance!

this could work using purrr
#load library
library(purrr)
#get col n
n <- ncol(yourdata)
#extract labels as vector
labels <- map_chr(1:n, function(x) attr(yourdata[[x]], "label") )

This worked for me (I am working with ICPSR 35206):
attributes(yourdata)$variable.labels -> labels
Make sure that your attribute referring to the labels is actually called "variable.labels".

Related

Getting subscripts from Excel into R

I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:
I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))
To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))

R extract() doesn´t accept my coordinates

First of all, I´m new to programing so this might be a simple question but i cant find the solution anywhere.
I´ve been using this code to extract values from a set of stacked rasters:
raster.files <- list.files()
raster.list <- list()
raster.files <-list.files(".",pattern ="asc")
for(i in 1: length(raster.files)){
raster.list[i] <- raster(raster.files[i])}
stacking <- stack(raster.list)
coord <- read.csv2("...")
extract.data <- extract(stacking,coord,method="simple")
I already used this code several times without any problem, until now. Every time I run the extract line I get this error:
Error in .doCellFromXY(object#ncols, object#nrows, object#extent#xmin, :
Not compatible with requested type: [type=character; target=double].
The coord file consists in a data.frame with 2 columns(X and Y respectively).
I´ve managed to found a way to bypass this error, its not technically a solution because I can´t understand why R was treating my data as text instead in first place.
Basically I separated the X and Y columns and treated them individually and then binded them again in a new data.frame:
coord_matrix_x<-as.numeric(as.matrix(coord[1]))
coord_matrix_y<-as.numeric(as.matrix(coord[2]))
coord2 <- cbind(coord_matrix_x, coord_matrix_y)
coord2<-as.data.frame(coord2)
coordinates(coord2)<-c("coord_matrix_x","coord_matrix_y")
It´s far form the most elegant way to do it, but it just works.

Retain SPSS value labels when working with data

I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")

R: Issues with line graph of germination trough time

I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,
Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.

Reading from CSV and Plotting Boxes in R

I am looking for the most convenient way of creating boxplots for different values and groups read from a CSV file in R.
First, I read my Sheet into memory:
Sheet <- read.csv("D:/mydata/Table.csv", sep = ";")
Which just works fine.
names(Sheet)
gives me correctly the Headlines of the different columns.
I can also access and filter different groups into separate lists, like
myData1 <- Sheet[Sheet$Group == 'Group1',]$MyValue
myData2 <- Sheet[Sheet$Group == 'Group2',]$MyValue
...
and draw a boxplot using
boxplot(myData1, myData2, ..., main = "Distribution")
where the ... stand for more lists I have filled using the selection method above.
However, I have seen that using some formular could do these steps of selection and boxplotting in one go. But when I use something like
boxplot(Sheet~Group, Sheet)
it won't work because I get the following error:
invalid type (list) for variable 'Sheet'
The data in the CSV looks like this:
No;Gender;Type;Volume;Survival
1;m;HCM;150;45
2;m;UCM;202;103
3;f;HCM;192;5
4;m;T4;204;101
...
So i have multiple possible groups and different values which I'd like to represent as a box plot for each group. For example, I could group by gender or group by type.
How can I easily draw multiple boxes from my CSV data without having to grab them all manually out of the data?
Thanks for your help.
Try it like this:
Sheet <- data.frame(Group = gl(2, 50, labels=c("Group1", "Group2")),
MyValue = runif(100))
boxplot(MyValue ~ Group, data=Sheet)
Using ggplot2:
ggplot(Sheet, aes(x = Group, y = MyValue)) +
geom_boxplot()
The advantage of using ggplot2 is that you have lots of possibilities for customizing the appearance of your boxplot.

Resources