Transform variables - r

I'm just asking how to transform from categorical variables to quantitative variables so as to make a boxplot.
My command is:
wiser_perc<-read.csv("Perca_fluviatilis.csv",header=T, sep=";")
attach(wiser_perc)
summary(wiser_perc)
Country
Sweden :156
Germany: 73
France : 67
Norway : 19
Estonia: 8
(Other):7
Diversity
1,66E+00: 8
1,28E+00: 6
1,64E+00: 5
1,76E+00: 5
2,01E+00: 5
2,36E+00: 5
(Other):299
boxplot(Diversity~Country, data=wiser_perc,boxwex=0.7,cex.axis=0.8,ylab="Size diversity")
Error in boxplot.default(split(mf[[response]], mf[-response]), ...) :
adding class "factor" to an invalid object
#
So, I don't know how to change the variable "Diversity" to a quantitative variable.
Please, I'm stuck in that problem.

You don't want to be using read.csv(), you should be using read.csv2() instead. The latter is designed to be "used in countries that use a comma as decimal point and a semicolon as field separator". That way you don't need to worry about fixing the mess caused by read.csv().
Have a look at: http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

Related

Error: object not found in R. Headers not naming from .csv file

I am new to R and I keep getting inconsistent results with trying to display a column of data from a csv. I am able to import the csv into R without issue, but I can't call out the individual columns.
Here's my code:
setwd('mypath')
cdata <- read.csv(file="cendata.csv",header=TRUE, sep=",")
cdata
This prints out the following:
year pop
1 2010 2,775,332
2 2011 2,814,384
3 2012 2,853,375
4 2013 2,897,640
5 2014 2,936,879
6 2015 2,981,835
7 2016 3,041,868
8 2017 3,101,042
9 2018 3,153,550
10 2019 3,205,958
When I try to plot the following, the columns cannot be found.
plot(pop,year)
Error: object 'pop' not found
I even checked if the column names existed, and only data shows up.
ls()
[1] "data"
I can manually enter the data and label them "pop" and "year" but that kind of defeats the point of importing the csv.
Is there a way to label each header as an object?
year and pop are not independent objects. You need to refer them as part of the dataframe you have imported. Also you might need to remove "," from the numbers to turn them to numeric before plotting. Try :
cdata$pop <- as.numeric(gsub(',', '', cdata$pop))
plot(cdata$year, cdata$pop)

Why is R adding empty factors to my data?

I have a simple data set in R -- 2 conditions called "COND", and within those conditions adults chose between one of 2 pictures, we call house or car. This variable is called "SAW"
I have 69 people, and 69 rows of data
FOR SOME Reason -- R is adding an empty factor to both, How do I get rid of it?
When I type table to see how many are in each-- this is the output
table(MazeData$SAW)
car house
2 9 59
table(MazeData$COND)
Apples No_Apples
2 35 33
Where the heck are these 2 mystery rows coming from? it wont let me make my simple box plots and bar plots or run t.test because of this error - can someone help? thanks!!

How to load CSV as factor in R

I have a file called metadata.csv that I want to load into R and convert to a factor.
I begin with:
metadata <- read.csv(file="metadata.csv", header=T, stringsAsFactors=T)
And this loads the CSV just fine. I've printed out metadata here:
> metadata
Filename Genre Date Gender
1 Austen_Emma.txt Social Early Female
2 Bronte_Eyre.txt Social Middle Female
3 Dickens_Expectations.txt Social Late Male
4 Eliot_Mill.txt Social Late Female
5 Lewis_Monk.txt Gothic Early Male
6 Radcliffe_Italian.txt Gothic Early Female
7 Shelley_Frankenstein.txt Gothic Middle Female
8 Stoker_Dracula.txt Gothic Late Male
9 Thackeray_Vanity.txt Social Middle Male
10 Trollope_Vicar.txt Social Middle Male
Now I want to convert it to a factor:
as.factor(metadata)
This gives me the following error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
metadata is a dataframe which is a special type of list made up of vectors of equal length. You can only use as.factor() on vectors. Therefore you must class as.factor() on each vector in the dataframe. This can be done using the lapply function:
metadata <- data.frame(lapply(metadata, factor))
This will convert each column to a factor (check this by class(metadata[, 1])). The overall structure of metadata will still be a dataframe.
read.csv puts data into a data.frame
You cannot convert a data.frame into a factor. That's very basic R stuff.
It's like you're trying to change of a bunch of .doc files into PDFs by converting your computer into a PDF. It just doesn't make sense.
The error is asking "Have you called sort on a list?" Yes, you have. as.factor calls sort, and your data.frame is a list.

Simple way to subset SpatialPolygonsDataFrame (i.e. delete polygons) by attribute in R

I would like simply delete some polygons from a SpatialPolygonsDataFrame object based on corresponding attribute values in the #data data frame so that I can plot a simplified/subsetted shapefile. So far I haven't found a way to do this.
For example, let's say I want to delete all polygons from this world shapefile that have an area of less than 30000. How would I go about doing this?
Or, similarly, how can I delete Antartica?
require(maptools)
getinfo.shape("TM_WORLD_BORDERS_SIMPL-0.3.shp")
# Shapefile type: Polygon, (5), # of Shapes: 246
world.map <- readShapeSpatial("TM_WORLD_BORDERS_SIMPL-0.3.shp")
class(world.map)
# [1] "SpatialPolygonsDataFrame"
# attr(,"package")
# [1] "sp"
head(world.map#data)
# FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION LON LAT
# 0 AC AG ATG 28 Antigua and Barbuda 44 83039 19 29 -61.783 17.078
# 1 AG DZ DZA 12 Algeria 238174 32854159 2 15 2.632 28.163
# 2 AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145 47.395 40.430
# 3 AL AL ALB 8 Albania 2740 3153731 150 39 20.068 41.143
# 4 AM AM ARM 51 Armenia 2820 3017661 142 145 44.563 40.534
# 5 AO AO AGO 24 Angola 124670 16095214 2 17 17.544 -12.296
If I do something like this, the plot does not reflect any changes.
world.map#data = world.map#data[world.map#data$AREA > 30000,]
plot(world.map)
same result if I do this:
world.map#data = world.map#data[world.map#data$NAME != "Antarctica",]
plot(world.map)
Any help is appreciated!
looks like you're overwriting the data, but not removing the polygons. If you want to cut down the dataset including both data and polygons, try e.g.
world.map <- world.map[world.map$AREA > 30000,]
plot(world.map)
[[Edit 19 April, 2016]]
That solution used to work, but #Bonnie reports otherwise for a newer R version (though perhaps the data has changed too?):
world.map <- world.map[world.map#data$AREA > 30000, ]
Upvote #Bonnie's answer if that helped.
When I tried to do this in R 3.2.1, tim riffe's technique above did not work for me, although modifying it slightly fixed the problem. I found that I had to specifically reference the data slot as well before specifying the attribute to subset on, as below:
world.map <- world.map[world.map#data$AREA > 30000, ]
plot(world.map)
Adding this as an alternative answer in case others come across the same issue.
Just to mention that subset also makes the work avoiding to write the data's name in the condition.
world.map <- subset(world.map, AREA > 30000)
plot(world.map)
I used the above technique to make a map of just Australia:
australia.map < - world.map[world.map$NAME == "Australia",]
plot(australia.map)
The comma after "Australia" is important, as it turns out.
One flaw with this method is that it appears to retain all of the attribute columns and rows for all of the other countries, and just populates them with zeros. I found that if I wrote out a .shp file, then read it back in using readOGR (rgdal package), it automatically removes the null geographic data. Then I could write another shape file with only the data I want.
writeOGR(australia.map,".","australia",driver="ESRI Shapefile")
australia.map < - readOGR(".","australia")
writeOGR(australia.map,".","australia_small",driver="ESRI Shapefile")
On my system, at least, it's the "read" function that removes the null data, so I have to write the file after reading it back once (and if I try to re-use the filename, I get an error). I'm sure there's a simpler way, but this seems to work good enough for my purposes anyway.
As a second pointer: this does not work for shapefiles with "holes" in the shapes, because it is subsetting by index.

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

Resources