Plotting a data frame in R - r

I have this data frame and I'd like to know if there's a way to plot this using the ggplot2 library (or anything that works). The first row has a bunch of zip codes and the second row contains weather data (temperature in this case) associated with the corresponding zip code. I want to create a graph (bar/plot) with the zip codes on the x axis and the temperature values on the y axis but I don't know how to do it.
V1 V2
1 20904 82.9
2 20905 80.1
3 20906 84.6
4 20907 84.6
5 20908 88.0
6 20910 84.6
7 20911 84.6
8 20912 86.1
9 20913 86.1
10 20914 80.7
11 20915 84.6

You also can do a simple barplot:
ydf <- ZipGraph
barplot(ydf[2,],names.arg = ydf[1,],col=rainbow(ncol(ydf)),
xlab="zipcode",ylab="Temperature",cex.axis = .8,cex.names = .7)
Edit: Or you can do
ylims=c(0,max(ydf[,2])*1.2)
y1=barplot(ydf[,2],col=rainbow(nrow(ydf)),xaxt="n",ylim=ylims,
xlab="zipcode",ylab="Temperature",cex.axis = .8)
axis(1, at=y1,labels=ydf[,1],las=2,cex.axis= .6)

Related

Combine multiple csv files (similar column) in one with selective row and column in r

I have around 300-500 CSV files with some character information at the beginning and two-column as numeric data. I want to make one data.frame with all the numeric values in Such a way that I have column X once with multiple Y without the character rows.
**File1** has two-column and more than a thousand rows: an example looks like
info info
info info
info info
X Y
1 50.3
2 56.2
3 96.5
4 56.4
5 65.2
info 0
**File2**
info info
info info
info info
X Y
1 46.3
2 65.2
3 21.6
4 98.2
5 25.3
info 0
Only Y values are changing from file to file, I want to add all the files in one file with selective rows and make a data frame. Such as I want as a data frame.
X Y1 Y2
1 46.3 50.3
2 65.2 56.2
3 21.6 96.5
4 98.2 56.4
5 25.3 65.2
I tried
files <- list.files(pattern="*.csv")
l <- list()
for (i in 1:length(files)){
l[[i]] <- read.csv(files[i], skip = 3)
}
data.frame(l)
This gives me
X1 Y1 X2 Y2
1 46.3 1 50.3
2 65.2 2 56.2
3 21.6 3 96.5
4 98.2 4 56.4
5 25.3 5 65.2
info 0 info 0
How can I skip the last row and column X as the first column only (since X values do not change)
Define a function Read that reads one file removing all lines that do not start with a digit. We use sep="," to specify that the fields in each file are comma separated. Then use Read with read.zoo to read and merge the files giving zoo object z. Finally either use it as a zoo object or convert it to a data frame as shown.
library(zoo)
Read <- function(f) {
read.table(text = grep("^\\d", readLines(f), value = TRUE), sep = ".")
}
z <- read.zoo(Sys.glob("*.csv"), read = Read)
DF <- fortify.zoo(z)

Applying a label depending on which condition is met using R

I would like to use a simple R function where the contents of a specified data frame column are read row by row, then depending on the value, a string is applied to that row in a new column.
So far, I've tried to use a combination of loops and generating individual columns which were combined later. However, I cannot seem to get the syntax right.
The input looks like this:
head(data,10)
# A tibble: 10 x 5
Patient T1Score T2Score T3Score T4Score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 96.4 75 80.4 82.1
2 5 100 85.7 53.6 55.4
3 6 82.1 85.7 NA NA
4 7 82.1 85.7 60.7 28.6
5 8 100 76.8 64.3 57.7
6 10 46.4 57.1 NA 75
7 11 71.4 NA NA NA
8 12 98.2 92.9 85.7 82.1
9 13 78.6 89.3 37.5 42.9
10 14 89.3 100 64.3 87.5
and the function I have written looks like this:
minMax<-function(x){
#make an empty data frame for the output to go
output<-data.frame()
#making sure the rest of the commands only look at what I want them to look at in the input object
a<-x[2:5]
#here I'm gathering the columns necessary to perform the calculation
minValue<-apply(a,1,min,na.rm=T)
maxValue<-apply(a,1,max,na.rm=T)
tempdf<-as.data.frame((cbind(minValue,maxValue)))
Difference<-tempdf$maxValue-tempdf$minValue
referenceValue<-ave(Difference)
referenceValue<-referenceValue[1]
#quick aside to make the first two thirds of the output file
output<-as.data.frame((cbind(x[1],Difference)))
#Now I need to define the class based on the referenceValue, and here is where I run into trouble.
apply(output, 1, FUN =
for (i in Difference) {
ifelse(i>referenceValue,"HIGH","LOW")
}
)
output
}
I also tried...
if (i>referenceValue) {
apply(output,1,print("HIGH"))
}else(print("LOW")) {}
}
)
output
}
Regardless, both end up giving me the error message,
c("'for (i in Difference) {' is not a function, character or symbol", "' ifelse(i > referenceValue, \"HIGH\", \"LOW\")' is not a function, character or symbol", "'}' is not a function, character or symbol")
The expected output should look like:
Patient Difference Toxicity
3 21.430000 LOW
5 46.430000 HIGH
6 3.570000 LOW
7 57.140000 HIGH
8 42.310000 HIGH
10 28.570000 HIGH
11 0.000000 LOW
12 16.070000 LOW
13 51.790000 HIGH
14 35.710000 HIGH
Is there a better way for me to organize the last loop?
Since you seem to be using tibbles anyway, here's a much shorter version using dplyr and tidyr:
> d %>%
gather(key = tscore,value = score,T1Score:T4Score) %>%
group_by(Patient) %>%
summarise(Difference = max(score,na.rm = TRUE) - min(score,na.rm = TRUE)) %>%
ungroup() %>%
mutate(AvgDifference = mean(Difference),
Toxicity = if_else(Difference > mean(Difference),"HIGH","LOW"))
# A tibble: 10 x 4
Patient Difference AvgDifference Toxicity
<int> <dbl> <dbl> <chr>
1 3 21.4 30.3 LOW
2 5 46.4 30.3 HIGH
3 6 3.6 30.3 LOW
4 7 57.1 30.3 HIGH
5 8 42.3 30.3 HIGH
6 10 28.6 30.3 LOW
7 11 0 30.3 LOW
8 12 16.1 30.3 LOW
9 13 51.8 30.3 HIGH
10 14 35.7 30.3 HIGH
I think maybe your expected output might have been based on a slightly different average difference, so this output is very slightly different.
And a much simpler base R version if you prefer:
d$min <- apply(d[,2:5],1,min,na.rm = TRUE)
d$max <- apply(d[,2:5],1,max,na.rm = TRUE)
d$diff <- d$max - d$min
d$avg_diff <- mean(d$diff)
d$toxicity <- with(d,ifelse(diff > avg_diff,"HIGH","LOW"))
A few notes on your existing code:
as.data.frame((cbind(minValue,maxValue))) is not an advisable way to create data frames. This is more awkward than simply doing data.frame(minValue = minValue,maxValue = maxValue) and risks unintended coercion from cbind.
ave is for computing summaries over groups; just use mean if you have a single vector
The FUN argument in apply expects a function, not an arbitrary expression, which is what you're trying to pass at the end. The general syntax for an "anonymous" function in that context would be apply(...,FUN = function(arg) { do some stuff and return exactly the thing you want}).

R algorithm for deleting rows with artefacts

I'm analyzing DEM data of rivers with R and need assistance with the data processing. The DEM data include many artifacts, where the river longitudinal profile goes slightly uphill, which is in fact nonsense. So I would like to have an algorithm to delete all rows from the data set where the Z value (elevation) is higher than the predecessor. To explain it better, just look at the following data rows:
*data.frame*
ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9
I would like to delete rows 2, 3 and 6 from the list. I wrote the following code but it doesn't work:
i <- *data.frame*[1,2]
for (n in *data.frame*[,2]) {if(n-i>0) *data.frame*[i,2]=0 else i <- n}
I would be very appreciated if anybody can help.
Apparently, you want to recursively remove values until there are no increasing values. It would be easiest to simply use a while loop:
DF <- read.table(text = "ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9", header = TRUE)
while(any(diff(DF$Z) > 0)) DF <- DF[c(TRUE, diff(DF$Z) <= 0),]
# ID Z
#1 1 105.2
#4 4 105.1
#5 5 105.1
#7 7 104.9
Test yourself, if this is sufficiently efficient.
I would also like to comment on your whole idea of data cleaning here. I find it very dubious. How do you now that there is no error in the values that don't increase? You might remove perfectly valid values because there is an error in a (strongly) decreasing value.

Is there a way to multiply two columns in R like this, ex. 17.5 x 4 would give me a list of 4 rows 17.5,17.5,17.5,17.5?

I need to multiply two columns so that the result, columnC, is a list of columnA with columnB entries (sorry if that is confusing I dont know how else to say it). So columnA (17.5) * columnB (4) gives columnC (17.5, 17.5, 17.5, 17.5).
Is this possible? I need to make a histogram in R but the data is entered in the A B format (i.e. there were 4 ind at 17.5, 2 ind at 16.8, 5 ind at 15.9, etc) but I cannot get the plotting to work this way so I thought if I changed it to just a list of values it would work. It is a very large data set and doing this manually is prohibitive. Is there a better way to do this? New to R so any help is greatly appreciated.
As far as I can tell the following code will do what you want it to:
dat <- data.frame( x = c(17.5,16.8,15.9),y=c(4,2,5))
newDat <- data.frame( x = rep(dat$x,dat$y), y = rep(1,sum(dat$y) ) )
if(!require("ggplot2")){ #INCLUDE PACKAGE ggplot2 AND INSTALL IT IF IT'S NOT ALREADY INSTALLED
install.packages("ggplot2",repos="http://ftp.heanet.ie/mirrors/cran.r-project.org/",dependencies = TRUE)
library("ggplot2")
}
ggplot(newDat, aes(x=x, y=y, fill=factor(x))) + geom_bar(stat="identity")
Depending on the size of your data this might not make sense and you might want to do something other than appending a column of 1 to your dataframe, but for this toy example it functions fine. You should get something like the following:
Suppose your data frame (called df) is as follows:
A B
1 17.5 5
2 16.8 8
One way to expand (i.e. replicate) is
df <- df[rep(rownames(df), df$B),]
# A B
#1 17.5 5
#1.1 17.5 5
#1.2 17.5 5
#1.3 17.5 5
#1.4 17.5 5
#2 16.8 8
#2.1 16.8 8
#2.2 16.8 8
#2.3 16.8 8
#2.4 16.8 8
#2.5 16.8 8
#2.6 16.8 8
#2.7 16.8 8
If you want to 'tidy' your rownames you can just do,
rownames(df) <- NULL

Sum or mean of certain element in data frame in R

I have a data frame that look like this:
k v 2002 2006 2010
1 a x 79.1 80.2 83.2
2 a y 75.1 76.2 79.3
3 a z 74.7 75.8 79.0
4 b x 82.8 85.9 87.6
5 b y 81.1 83.5 85.1
6 b z 80.5 83.1 84.6
etc. What I need is the mean of the numeric values for every row, i.e. I want it to look like this:
k v tot
1 a x 80.833
2 a y 76.867
3 a z 76.500
4 b x 85.433
5 b y 83.233
6 b z 82.733
I don't want to keep the original values, just the means. I know about rowMeans but as far as I know I can't (and don't want to) use it since it is averaging the whole row, not just the three last columns. I tried to use
rowMeans(subset(df,select=3:5))
but then I only get the numerical values and loose the variables k and v. Does anyone know a convenient way to get the mean over just some of the elements in a row?
dplyr::mutate(df, tot= (`2002`+`2006`+`2010`)/3)
should work too.
This will preserve the first two columns of variables as you intended, and append a column named tot that is = the mean of the three 'years' cols.

Resources