read.csv error due to no column names (R) - r

I'm trying to read a csv file in r.
The issue is that my file has no column names except for the first column
Using the read.csv() function gives me the 'Error in read.table : more columns than column names' error
So I used the read_csv() function from the readr library.
However this creates a df with just one column containing all the values.
(https://i.stack.imgur.com/Och8A.png)
What should I do to fix this issue?

First cut to read the data would be using skip=1 (to not read in the first line, it appears to be descriptive only) and header=FALSE:
quux <- read.csv("path/to/file.csv", skip = 1, header = FALSE)
I find this format to be a bit awkward, we may want to reshape it a bit.
quux <- setNames(data.frame(t(quux[,-1])), sub(":$", "", quux[[1]]))
quux
# LON LAT MMM 1984-Nov-01 1974-Nov-05
# V2 151.0 -24.5 27.11 22.28 22.92
# V3 151.5 -24.0 27.46 22.47 22.83
# V4 152.0 -24.0 27.19 22.27 22.64
Many tools prefer to have the "month" column names as a single column, which is converting this data from "wide" format to "long". This is easily done with either tidyr::pivot_longer or reshape2::melt:
dat <- reshape2::melt(quux, c("LON", "LAT", "MMM"), variable.name = "date")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-Nov-01 22.28
# 2 151.5 -24.0 27.46 1984-Nov-01 22.47
# 3 152.0 -24.0 27.19 1984-Nov-01 22.27
# 4 151.0 -24.5 27.11 1974-Nov-05 22.92
# 5 151.5 -24.0 27.46 1974-Nov-05 22.83
# 6 152.0 -24.0 27.19 1974-Nov-05 22.64
dat <- tidyr::pivot_longer(quux, -c(LON, LAT, MMM), names_to = "date")
From here, it might be nice to have the date column be a "proper" Date-object so that it "number-like" things can be done with it. For example, in its present form, sorting is incorrect since Apr will land before Jan; other number-like operations include finding ranges of dates (which can be done with strings, but not these strings) and adding/subtracting days (e.g., 7 days prior to a value).
dat$date <- as.Date(dat$date, format = "%Y-%b-%d")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-11-01 22.28
# 2 151.5 -24.0 27.46 1984-11-01 22.47
# 3 152.0 -24.0 27.19 1984-11-01 22.27
# 4 151.0 -24.5 27.11 1974-11-05 22.92
# 5 151.5 -24.0 27.46 1974-11-05 22.83
# 6 152.0 -24.0 27.19 1974-11-05 22.64
Sample data:
quux <- read.csv(skip = 1, header = FALSE, text = '
LON:,151.0,151.5,152.0
LAT:,-24.5,-24.0,-24.0
MMM:,27.11,27.46,27.19
1984-Nov-01,22.28,22.47,22.27
1974-Nov-05,22.92,22.83,22.64
')

Related

Why R netCDF4 package is transposing my data?

I'm reading a .nc data in R with ncdf4 and RNetCDF. The NetCDF metadata says that there are 144 lons and 73 lats, which leads to 144 columns and 73 rows, right?
However, the data I get in R seems to be transposed with 144 rows and 73 columns.
Please could you tell me what is wrong?
thanks
library(ncdf4)
a <- tempfile()
download.file(url = "ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis2.derived/pressure/uwnd.mon.mean.nc", destfile = a)
nc <- nc_open(a)
uwnd <- ncvar_get(nc = ncu, varid = "uwnd")
dim(uwnd)
## [1] 144 73 17 494
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
nrow(umed)
## [1] 144
ncol(umed)
## [1] 73
It looks you are having two problems.
The first one is related with expecting the same structure that the netCDF file has in R which is a problem in itself because when you are translating the multi-dimensional array structure of the netCDF into 2 dimensional dataframe. NetCDF format needs some reshaping in R in order to be manipulated as it does in python(see: http://geog.uoregon.edu/bartlein/courses/geog490/week04-netCDF.html).
The second one is that you are using values instead of indices when subsetting the data.
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
The solution that I see for this is starting by creating the indices of the dimensions that you want to subset. In this example I am subsetting preassure level 10 millibar and all that goes between longitude 230 and 300 and latitude 25 and 40.
nc <- nc_open("uwnd.mon.mean.nc")
LonIdx <- which( nc$dim$lon$vals > 230 & nc$dim$lon$vals <300 )
## [1] 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## 114 115 116 117 118 119 120
LatIdx <- which( nc$dim$lat$vals >25 & nc$dim$lat$vals < 40)
## [1] 22 23 24 25 26
LevIdx <- which( nc$dim$level$vals==10)
## [1] 17
Then you would need to apply the indices over each dimension except time which i would assume you don't want to subset. Sub setting lon and latitude is important due to R saves all in memory therefore leaving the whole range of them would consume a significant amount of RAM.
lat <- ncvar_get(nc,"lat")[LatIdx]
lon <- ncvar_get(nc,"lon")[LonIdx]
lev <- ncvar_get(nc,"level")[LevIdx]
time <- ncvar_get(nc,"time")
After that you can get the variable that you were looking for uwnd Monthly U-wind on Pressure Levels and finish reading the netCDF file with a nc_close(nc).
uwnd <- ncvar_get(nc,"uwnd")[LonIdx,LatIdx,LevIdx,]
nc_close(nc)
At the end you can expand the grid with all the four dimensions: longitude,latitude,preassure level and time.
uwndf <- data.frame(as.matrix(cbind(expand.grid(lon,lat,lev,time))),c(uwnd))
names(uwndf) <- c("lon","lat","level","time","U-wind")
Bind it to a dataframe with the U-wind variable and convert the netcdf time variable into an R time object.
uwndf$time_final<-convertDateNcdf2R(uwndf$time, units = "hours", origin =
as.POSIXct("1800-01-01", tz = "UTC"),time.format="%Y-%m-%d %Z %H:%M:%S")
At the end you will have the dataframe you are looking for between Jan 1979 and March 2020.
max(uwndf$time_final)
## [1] "2020-03-01 UTC"
min(uwndf$time_final)
## [1] "1979-01-01 UTC"
head(uwndf)
## lon lat level time U-wind time_final
## 1 232.5 37.5 10 1569072 3.289998 1979-01-01
## 2 235.0 37.5 10 1569072 5.209998 1979-01-01
## 3 237.5 37.5 10 1569072 7.409998 1979-01-01
## 4 240.0 37.5 10 1569072 9.749998 1979-01-01
## 5 242.5 37.5 10 1569072 12.009998 1979-01-01
## 6 245.0 37.5 10 1569072 14.089998 1979-01-01
I hope this is useful! Cheers!
Note: For converting the netcdf time variable into an R time object make sure you have the ncdf.tools library installed.

convert some rows to LBS in R

on of my vectors have diferents kind of data, I’ve been trying to convert it, but I really don find the way.
In te column I have the Weights, the ones with no indicator are in lbs, the others are in KG, I need to have it all in Lbs. But I do not find how to work with an specific number of rows only. To take out Kg, and multiply it by 2.20 to conver it in lbs for example.
List item
Weight
200
150
220
100KG
80KG
95KG
Try this example:
# example data
df1 <- read.table(text = "Weight
1 194
2 200
3 250
4 50Kg
5 40Kg
6 39Kg", header = TRUE, stringsAsFactors = FALSE)
# using ifelse (gives warning)
ifelse(grepl("Kg", df1$Weight),
as.numeric(gsub("Kg", "", df1$Weight)) * 2.2,
as.numeric(df1$Weight))
# [1] 194.0 200.0 250.0 110.0 88.0 85.8
# Warning message:
# In ifelse(grepl("Kg", df1$Weight), as.numeric(gsub("Kg", "", df1$Weight)) * :
# NAs introduced by coercion
# not using ifelse :)
as.numeric(gsub("Kg", "", df1$Weight)) * (1 + grepl("Kg", df1$Weight) * 1.2)
# [1] 194.0 200.0 250.0 110.0 88.0 85.8

Subtract data from lat/lon coordinates

I have 2 files of data that looks like this:
Model Data
long lat count
96.25 18.75 4
78.75 21.25 3
86.75 23.25 7
91.25 33.75 10
Observation Data
long lat count
96.75 25.75 10
86.75 23.25 7
78.75 21.25 11
95.25 30.25 5
I'm trying to subtract the counts of the lat/long combinations (model data-observation data) that match such that the first combination of 78.75 & 21.25 would give a difference count of -8. Any lat/long points without a match to subtract with would just be subtracted by or from 0.
I've tried an if statement as such to match points for subtraction:
if (modeldata$long == obsdata$long & modeldata$lat == obsdata$lat) {
obsdata$difference <- modeldata$count - obsdata$count
}
However, this just subtracts rows in order, not by matching points, unless matching points happen to fall within the same row.
I also get these warnings:
Warning messages:
1: In modeldata$long == obsdata$long :
longer object length is not a multiple of shorter object length
2: In modeldata$lat == obsdata$lat :
longer object length is not a multiple of shorter object length
3: In if (modeldata$long == obsdata$long & modeldata$lat == :
the condition has length > 1 and only the first element will be used
Any help would be greatly appreciated!
You can merge on coordinates, add 0 for NA and substract.
mdl <- read.table(text = "long lat count
96.25 18.75 4
78.75 21.25 3
86.75 23.25 7
91.25 33.75 10", header = TRUE)
obs <- read.table(text = "long lat count
96.75 25.75 10
86.75 23.25 7
78.75 21.25 11
95.25 30.25 5", header = TRUE)
xy <- merge(mdl, obs, by = c("long", "lat"), all.x = TRUE)
xy[is.na(xy)] <- 0
xy$diff <- xy$count.x - xy$count.y
xy
long lat count.x count.y diff
1 78.75 21.25 3 11 -8
2 86.75 23.25 7 7 0
3 91.25 33.75 10 0 10
4 96.25 18.75 4 0 4
You can do this using a data.table join & update
library(data.table)
## reading your supplied data
# dt_model <- fread(
# 'long lat count
# 96.25 18.75 4
# 78.75 21.25 3
# 86.75 23.25 7
# 91.25 33.75 10'
# )
#
#
# dt_obs <- fread(
# "long lat count
# 96.75 25.75 10
# 86.75 23.25 7
# 78.75 21.25 11
# 95.25 30.25 5"
# )
setDT(dt_model)
setDT(dt_obs)
## this join & update will update the `dt_model`.
dt_model[
dt_obs
, on = c("long", "lat")
, count := count - i.count
]
dt_model
# long lat count
# 1: 96.25 18.75 4
# 2: 78.75 21.25 -8
# 3: 86.75 23.25 0
# 4: 91.25 33.75 10
Noting the obvious caveat that joining on coordinates (floats/decimals) may not always give the right answer
Here is an option with dplyr
library(dplyr)
left_join(mdl, obs, by = c("long", "lat")) %>%
transmute(long, lat, count = count.x - replace(count.y, is.na(count.y), 0))
# long lat count
#1 96.25 18.75 4
#2 78.75 21.25 -8
#3 86.75 23.25 0
#4 91.25 33.75 10

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
max(subset(ozone,Ozone))
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
max(ozone,na.rm=T)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
colMax(dat)
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
ozone[order(ozone$Solar.R),]
Or to sort the other direction:
ozone[rev(order(ozone$Solar.R)),]
Here's a dplyr solution:
library(dplyr)
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
or
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...

R matplot function

As I am beginner R, I allow myself to ask R users a little question.
I want to represent in a graphic (points, lines, curves) the values of weight of two human groups treated and not treated by drug (0,1) measured ten times (months).
drug NumberIndividu Mar Apr May June July August September November October December
1 9 25.92 24.6 31.85 38.50 53.70 53.05 65.65 71.45 69.10 67.20
1 10 28.10 26.6 32.00 38.35 53.60 53.25 65.35 65.95 67.80 65.95
1 11 29.10 28.8 30.80 38.10 52.25 47.30 62.20 68.05 66.20 67.55
1 13 27.16 25.0 27.15 34.85 47.30 43.85 54.65 62.25 60.85 58.05
0 5 25.89 25.2 26.50 27.45 37.05 38.95 43.30 50.60 48.20 50.10
0 6 28.19 27.6 28.05 28.60 36.15 37.20 40.40 47.80 45.25 44.85
0 7 28.06 27.2 27.45 28.85 39.20 41.80 51.40 57.10 54.55 55.30
0 8 22.39 21.2 30.10 30.90 42.95 46.30 48.15 54.85 53.35 49.90
I tried :
w= read.csv (file="/file-weight.csv", header=TRUE, sep=",")
w<-data.frame(w)
rownames(w[1:8,])
rownames(w)<-(c(w[,1]))
cols <- character(nrow(w))
cols[rownames(w) %in% c(rownames(w[1:4,]))]<-"blue"
cols[rownames(w) %in% c(rownames(w[5:8,]))]<-"red"
pairs(w,col=cols)
My question is how to configurate matplot function to have one graphic view (points or curves or hist +curves)
My main goal is to visualize all distributions of individus following two colors of first column (drug) for all dates in one image.
Thanks a lot for your suggestions
Is this what you had in mind?
The code is based on the answer to ->this question<-, just using your dataset (df) instead of iris. So in that response, replace:
x <- with(iris, data.table(id=1:nrow(iris), group=Species, Sepal.Length, Sepal.Width,Petal.Length, Petal.Width))
with:
xx <- with(df, data.table(id=1:nrow(df), group=drug, df[3:12]))
If all you want is the density plots/histograms, it's easier (see below). These are complementary, because they show that weight is increasing in both control and test groups, just faster in the test group. You wouldn't pick that up from the scatterplot matrix. Also, there's the suggestion that variability in weight is greater in the control group, and grows over time.
library(ggplot2)
library(reshape2) # for melt(...)
# convert df into a form suitable to use with ggplot
gg <- melt(df,id=1:2, variable.name="Month", value.name="Weight")
# density plots
ggplot(gg) +
stat_density(aes(x=Weight, y=..scaled.., color=factor(drug)),geom="line", position="dodge")+
facet_grid(Month~.)+
scale_color_discrete("Drug")
# histograms
ggplot(gg) +
geom_histogram(aes(x=Weight, fill=factor(drug)), position="dodge")+
facet_grid(Month~.)+
scale_fill_discrete("Drug")

Resources