I am having a big dataframe (dataset_n) consisting of several columns, each for a different variable.
I am concentrating now on the variables:
q32, i.e. net recalled wages
pgssyear, i.e. the year when a person was asked the question about the wages
I would like to create an additional column that would stand for the CPI in a given year (pgssyear), so that I can calculate real wages later on (dividing the q32 with the CPI column).
I tried the following:
- copying the pgssyear under a different name (creating a new vector with the name "year", but with the same contents) and replacing the first year: 1992 with 1 using "replace" hoping to be able to replace 1993 with e.g. 1.35, etc.:
attach(dataset_n)
dataset_n$year <- pgssyear
detach(dataset_n)
dataset_n$year
neither of the options:
replace(dataset_n$year, dataset_n$year == 1992, 1)
replace(dataset_n$year, dataset_n$year == "1992", 1)
with(data.frame(dataset_n), replace(dataset_n$year, dataset_n$year == 1992, 1)))
worked for me. Each time I got the massage: "object of type 'closure' is not subsettable"
dataset_n$year[dataset_n$year==1992] <- 1
did not work either and I got the message:
Warning message:
In [<-.factor(*tmp*, dataset_n$year == 1992, value = c(NA, NA, :
invalid factor level, NA generated
I suspect that when creating the new vector the numeric data got treated as factors.
I tried also:
as.numeric(gsub(1992, 1, dataset_n$year))
as.numeric(gsub(1993, 1.35, dataset_n$year))
This time the values got replaced, but I failed to achieve it "all at once", which is what I need.
I have also run out of further ideas, so any help would be appreciated.
The other threads I have seen and which might be related are:
Replace given value in vector
Error in <my code> : object of type 'closure' is not subsettable
To make this line work:
dataset_n$year[dataset_n$year==1992] <- 1
Convert your year vector to numeric from factor like this:
dataset_n$year <- as.numeric(as.character(dataset_n$year))
Related
The textsample below is in one column. Using R, I hope to separate it into 5 columns with the following headings: "Name" , "Location", "Date", "Time", "Warning" . I have tried separate() and strsplit() and haven't succeeded yet. I hope someone here can help.
textsample <- "Name : York-APC-UPS\r\n
Location : York SCATS Zigzag Road\r\n
Contact : Mechanical services\r\n
\r\n
http://York-APC-UPS.domain25.minortracks.wa.gov.au\r\n
http://192.168.70.56\r\n
http://FE81::3C0:B8FF:FE6D:8065\r\n
Serial Number : 5A1149T24253\r\n
Device Serial Number : 5A1149T24253\r\n
Date : 12/06/2018\r\n
Time : 08:45:46\r\n
Code : 0x0125\r\n
\r\n
Warning : A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor
(Port 1 Temp 1 at Port 1) reporting over 50%CD.\r\n"
Here's an approach that should at least get you started:
We can use extract from tidyr extract the text of interest with regular expressions.
Then we can use mutate_all to apply the same str_replace to get rid of the labels.
library(dplyr)
library(tidyr)
library(stringr)
as.data.frame(extsample) %>%
extract(1, into=c("Name","Location","Date","Time","Warning"),
regex = "(Name : .+)[^$]*(Location : .+)[^$]*(Date : .+)[^$]*(Time : .+)[^$]*(Warning : .+)[^$]*") %>%
mutate_all(list(~str_replace(.,"^\\w+ : ","")))
# Name Location Date Time
#1 York-APC-UPS York SCATS Zigzag Road 12/06/2018 08:45:46
# Warning
#1 A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor
This relies on capturing groups with (), see help(tidyr::extract) for details. We use [^$]* to match anything other than the end of the string 0 or more times between the groups.
Note the first argument to extract is 1, which indicates the first (and only) column of the data.frame I made from your example data. Change this as necessary.
I have looked through all the posts i could find on dplyr::arrange() or order() argument lengths differ errors, but have not found an explanation.
Im trying to make a function best() that can return the lowest rated value from a dataframe of hospital outcomes (dfout). When i copy the code straight into R it runs without an error, returning the hospital name with the lowest mortality rate.
Only when i call it as a function does it say "Error in order(State, outcome, Hospital) : argument lengths differ"
The function: (note i used capitalized names for colnames and non capitalized for function variables)
best <- function(state, outcome){
colnames(dfout) <- c("Hospital", "State", "Heartattack", "Heartfailure", "Pneumonia")
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, outcome, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == state,] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
Now if i call best("TX", Heartattack) i get "Error in order(State, outcome, Hospital) : argument lengths differ",
but if i simply run the code and replace state and outcome with "TX" and Heartattack i get a hospital, like this
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, Heartattack, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == "TX",] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
[1] "CYPRESS FAIRBANKS MEDICAL CENTER"
My question is really: how can the function not work, when copying the same code into the command line with the variables put in works.
You need to evaluate the outcome parameter inside the function, so R will interpret it as a variable, not as text
arranged <- arrange(dfout, State, eval(parse(text=outcome)), Hospital)
Now
# > best("TX","Heartattack")
# [1] CYPRESS FAIRBANKS MEDICAL CENTER
I have an error that I don't understand.
I have downloaded an Excel file with unemploymente rates by country and by year.
Basically, column 1 is Country, column 2 is 1990, column 3 etc...
I am trying to plot an histogram unemployment rate in 2005.
I use this code:
qplot(x=2005,y=Country,data=data)
But I always have this error:
Error: unexpected numeric constant in
I have tried to:
- convert all the names in character
- add a "y" before the year
- put brackets
But I still have this error.
Error: unexpected numeric constant in "qplot(y=data$2005"
Error: unexpected numeric constant in "qplot(x=y 2005"
With brackets, I have this error
Error: unexpected '[' in "qplot(x=["
Any idea? Many thanks in advance!
Edit:
Dataset:[link]https://docs.google.com/spreadsheets/d/1frieoKODnD9sX3VCZy5c3QAjBXMY-vN7k_I9gR-gcU8/pub?gid=0[link]
I have downloaded it (xlxs format), and changed the name of the first column
library(ggplot2)
library(readxl)
file<-"indicator_t 15-24 unemploy.xlsx"
excel_sheets(file)
data<-read_excel(file)
I've tried to plot:
qplot(x=2005,y=Total 15-24 unemployment (%),data=data)
Error: unexpected numeric constant in "qplot(x=2005,y=Total 15"
I have changed the named of the first column, and added a "y" before the years.
names2<-paste("y",names(data[,2:length(data)]))
data2<-c("Country",names2)
colnames(data)<-data2
I still have an error:
qplot(x=y2005,y=Country,data=data)
Error in eval(expr, envir, enclos) : object 'y2005' not found
There are several problems in your code, and you could certainly benefit from reading some basic references on R, such as http://tryr.codeschool.com/
What you are trying to do may be accomplished by
qplot ( x = data$"2005" , ylab="Total 15-24 unemployment (%)")
Here, the first argument specifies which data should be plotted, and ylab is used to set the y-axis label. Notice that this label must be enclosed by "quotes".
Edit:
Note also that "2005" may or may not be the name of your column. Check what are your column names with colnames(data).
Regarding the comment below, if the name of the column is actually 2005, you need to quote it as well. If you don't, R will interpret 2005 as a numerical constant:
> x$2000
Error: unexpected numeric constant in "x$2000"
> x$"2000"
[1] 1 2 4 6
I am having trouble using the functcomp package in R.
I have 2 datasets: one with species frequency, and the other listing the functional traits of my species. The frequency dataset has 264 species listed in the first row and 27 sites listed in the first column, all values in dataset are between 0-1. The functional trait dataset has the same 264 species (copied & pasted from the frequency dataset to make sure identical) listed in the first column, and 5 different functional traits listed in the 1st row (height, life history, life form, origin, palatability).
I am using the following code:
traits.df <- read.table("species_functional_traits_6_ August.txt", header = TRUE)
frequency.df <- read.table("Spring 2014 - combined table - 6 August.txt", header = TRUE)
x <- (as.matrix(traits.df))
a <- (as.matrix(frequency.df))
functcomp(x, a, CWM.type = c("dom", "all"), bin.num = height)
But keep getting the following error message:
Error in functcomp(x, a, CWM.type = c("dom", "all"), bin.num = height) :
Different number of species in 'x' and 'a'.
I have tried fiddling with a couple of things in the code and datasets, but cannot work out what I am doing wrong here. Any help would be greatly appreciated!
Here are links the frequency & trait data (a subset of it, but still get same error message with this data) as a tab-delimited text file
frequency: https://www.dropbox.com/s/girs3nrq1ciyg1a/frequency%20-%20small.txt?dl=0
traits: https://www.dropbox.com/s/l888sallx7mu3f6/traits%20-%20small.txt?dl=0
try stating row.names=1 when read in your table, this solved my problem -
Anna
I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e