I'm new to r, and I'm stuck.
I have a data frame, which I read into r from a csv:
site depth co2 n2o ch4
1 ot aue 0 408.7412 2.0432975 11.393448
2 ot aue 0 325.8365 -0.4539224 -2.222266
3 ot aue 0 237.4456 0.6853362 -13.105958
which I then subset by "site" (there are five different sites):
OTA<-subset(data, site=="ot aue")
and I wanted to change the units of the greenhouse gases, and save those values in new columns, which I did here:
data$co2mol<-NA
data$co2mol=((data$co2/3600)/12)*1000
data$n2omol=NA
data$n2omol<-((data$n2o/3600)/14)*1000000
data$ch4mol=NA
data$ch4mol<-((data$ch4/3600)/12)*1000
I confirmed the changes with:
> names(data)
[1] "site" "depth" "co2" "n2o" "ch4" "co2mol" "n2omol" "ch4mol"
and
> str(data)
'data.frame': 71 obs. of 8 variables:
$ site : chr "ot aue" "ot aue" "ot aue" "ot aue" ...
$ depth : int 0 0 0 0 0 0 0 0 0 0 ...
$ co2 : num 409 326 237 294 557 ...
$ n2o : num 2.043 -0.454 0.685 2.084 5.911 ...
$ ch4 : num 11.39 -2.22 -13.11 3.66 -11.15 ...
$ co2mol: num 9.46 7.54 5.5 6.82 12.89 ...
$ n2omol: num 47.3 -10.5 15.9 48.2 136.8 ...
$ ch4mol: num 0.2637 -0.0514 -0.3034 0.0847 -0.2582 ...
Now, I need to plot the data, using the columns I created with the new units.
This code works with the original columns (co2, n2o, ch4), but not the new columns (co2mol, n2omol, ch4mol).
ota_co2mol<-boxplot2(OTA$co2mol~OTA$depth, data=OTA,
xlab="Depth in Hole (cm)",
ylab="mol CO2",
col = c("tan", "tan2", "tan3", "tan4", "burlywood4"),
main=expression("Ot Aue CO"[2]),
top=TRUE,
ylim=c(-593,3762))
When trying to create the boxplot, I get this error message:
Error in model.frame.default(formula = OTA$co2mol ~ OTA$depth, data =
OTA) : invalid type (NULL) for variable 'OTA$co2mol'
Obviously this is a data.frame/data.table/data format/data reading error, I just don't know how to fix it. Thanks in advance!
Related
I am dealing with a dataset consisting of several key banking balance sheet and income statement figures (deleted some variables for this post):
'data.frame': 52028 obs. of 38 variables:
$ institutionid : int 4307883 4209717 4558501 4392480 4306242 4303334 114518 4183859 4307849 4256486 ...
$ fiscalyear : Factor w/ 8 levels "2010","2011",..: 1 1 1 1 1 1 1 1 1 1 ...
$ institutionname : chr "Kure Shinkin Bank" "Shinkin Central Bank" "Shibata Shinkin Bank" "Takasaki Shinkin Bank" ...
$ Tier 1 Ratio : num 9.8 20.68 13.93 6.84 19.43 ...
$ snlindustryid : int 28 2 28 2 2 1 1 1 2 1 ...
$ snlindustryname : chr "Other Banking" "Savings Bank/Thrift/Mutual" "Other Banking" "Savings Bank/Thrift/Mutual" ...
$ countryname : chr "Japan" "Japan" "Japan" "Japan" ...
$ Interest Income : num 141.3 3330.3 16.2 83.6 289.8 ...
$ Net Interest Income : num 122.8 756.4 14.1 74.4 250.4 ...
$ Operating Revenue : num 137.8 NA 13.8 80.1 NA ...
$ Provision for Loan Losses: num 27.546 NA 0.535 13.26 NA ...
$ Compensation and Benefits: num NA NA 6.07 36.8 NA ...
$ EBIT : num 27.04 2814.57 5.05 16.67 88.05 ...
$ Net Income befoire Taxes : num 8.57 224.58 2.98 7.42 48.62 ...
$ Provision for Taxes : num -7.861 -113.864 0.159 0.125 14.525 ...
$ Net Income : num 16.43 338.45 2.83 7.29 34.1 ...
$ net_margin : num 2.98 1.06 3.56 3.05 2.5 ...
I am trying to run a DiD regression using net_margins, a figure that is calculated as net income / total gross loans. When I first plot the net_margins they look like this:
Clearly, there are values included that don't make economic sense. This is partly due the fact that some banks in the dataset have unreasonable figures for e.g. gross loans. If you divide by something close to zero some unreasonable large numbers will come out.
My first intuition was to just get rid of the outliers by doing this:
Q <- quantile(dataset$net_margin, probs = c(0.25,0.75))
IQR <- IQR(dataset$net_margin)
up <- Q[2]+1.5*IQR # Upper Range
low<- Q[1]-1.5*IQR # Lower Range
#Eliminating outliers
dataset_cleaned <- dataset %>%
filter(net_margin<up & net_margin > low)
If I plot the data now it looks like this:
Through removing the outliers I basically created new medians and interquantile ranges, thus my data is now still plagued heavily by outliers.
In other posts that suggested using the IQR to remove outliers that was not the case however.
I am a bit on the dead end with my own statistical (and R) knowledge. Is this a right practice to remove outliers for such a dataset? Thank you!
I'm doing a regression analysis considering fixed effects using plm() from package plm. I have selected the twoways method to account for both time and individual effects. However, after runing the below code I keep receiving this message:
Error in pdata.frame(data, index) :
variable id does not exist (individual index)
Here the code:
pdata <- DATABASE[,c(2:4,13:21)]
pdata$id <- group_indices(pdata,ISO3.p,Productcode)
coutnin <- dcast.data.table(pdata,ISO3.p+Productcode~.,value.var = "id")
setcolorder(pdata,neworder=c("id","Year"))
pdata <- pdata.frame(pdata,index=c("id","Year"))
reg <- plm(pdata,diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), effect="twoways", model="within", index = c("id","Year"))
Please mind that pdata structure shows that there are multiple levels in the id variable which is in numeric form, I tried initially to use a string type variable but I keep receiving the same outcome:
Classes ‘data.table’ and 'data.frame': 1211800 obs. of 13 variables:
$ id : int 4835 6050 13158 15247 17164 18401 19564 23553 24895 27541 ...
$ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ Productcode: chr "101" "101" "101" "101" ...
$ ISO3.p : Factor w/ 171 levels "ABW","AFG","AGO",..: 8 9 20 22 27 28 29 34 37 40 ...
$ e : num 0.245 -0.238 1.624 0.693 0.31 ...
$ RERcp : num -0.14073 -0.16277 1.01262 0.03908 -0.00243 ...
$ RERpp : num -0.1712 NA NA NA -0.0952 ...
$ RER_GVC : num -3.44 NaN NA NA NaN ...
$ GDPR.p : num 27.5 26.6 23.5 20.3 27.8 ...
$ GDPR.r : num 30.4 30.4 30.4 30.4 30.4 ...
$ GVCPos : num 0.141 0.141 0.141 0.141 0.141 ...
$ GVCPar : num 0.436 0.436 0.436 0.436 0.436 ...
$ TV : num 17.1 17.1 17.1 17.1 17.1 ...
- attr(*, ".internal.selfref")=<externalptr>
When I convert the data.table into a pdata.frame I do not receive any warning, it happens only after I run the plm function. From running View(table(index(pdata), useNA = "ifany")) it displays no value larger than 1, therefore I assume I have no duplicates obs in my data.
Try to put the data argument at the second place in the plm statement. In case pdata has been converted to a pdata.frame already, leave out the index argument in the plm statement, i.e., try this:
reg <- plm(diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), data = pdata, effect = "twoways", model = "within")
I have a data frame:
rawDataLogged
I have a function:
doForRow <- function(row) {
transpose <- t(row);
transpose <- transpose[like(row.names(transpose), "H.M")]
frame <- data.frame(transpose)
frame$BR <- c(1,1,2,2)
frame$TR <- c(1,2,1,2)
colnames(frame)[1] <- "Log2Ratio"
frame$Log2Ratio <- as.numeric(levels(frame$Log2Ratio))[frame$Log2Ratio]
summ <- summary(aov(Log2Ratio ~ BR + Error(TR), data=frame))
summ[[2]][[1]]["BR",]$'Pr(>F)'
}
If I execute my function with a row from my data frame, I get a result:
> doForRow(rawDataLogged[5,])
[1] 0.4973168
However if I try to use 'apply' to get the results for all my rows, it does not work:
tmp <- apply(rawDataLogged, 1, doForRow)
Error in $<-.data.frame(*tmp*, "BR", value = c(1, 1, 2, 2)) :
replacement has 4 rows, data has 0
When I place a breakpoint in my own function, I see that 'row' is empty, as in nothing seems to be getting passed into my function by apply.
Any ideas why this could be happening? I've spent hours trying to solve this myself, perhaps a loop would be easiest instead of an apply family function. I'm at a loss as to why my function is called without any row data.
I have placed an R data file containing the 'rawDataLogged' object at this url: Link which could be used for debugging. Example data created using dput: Link
Here is a dump from str to show the structure of my data frame:
'data.frame': 1262 obs. of 15 variables:
$ Protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN;tr|H0Y8S8|H0Y8S8_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Majority.protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Ratio.M.L.normalized.X1.1: num -0.27 -0.707 0.244 -0.728 -2.025 ...
$ Ratio.H.L.normalized.X1.1: num 0.0036 0.0588 -0.0886 0.1561 -0.0843 ...
$ Ratio.H.M.normalized.X1.1: num 0.339 0.66 -0.211 0.477 1.926 ...
$ Ratio.M.L.normalized.X1.2: num -0.132 -0.661 0.283 -1.045 -1.223 ...
$ Ratio.H.L.normalized.X1.2: num -0.07779 0.10273 -0.00251 -0.09755 0.18929 ...
$ Ratio.H.M.normalized.X1.2: num 0.0793 0.7718 -0.2657 0.9651 1.3532 ...
$ Ratio.M.L.normalized.X3.1: num -3.55 -2.08 -1.99 -1.98 -1.85 ...
$ Ratio.H.L.normalized.X3.1: num 0.1336 0.0777 -0.1014 -0.3478 -0.0259 ...
$ Ratio.H.M.normalized.X3.1: num -0.187 2.259 1.852 1.511 1.928 ...
$ Ratio.M.L.normalized.X3.2: num 0.106 -2.118 -1.864 -2.364 -1.847 ...
$ Ratio.H.L.normalized.X3.2: num 0.0141 0.0746 -0.0315 -0.1772 -0.0936 ...
$ Ratio.H.M.normalized.X3.2: num -0.143 2.248 1.842 2.279 1.758 ...
$ id : int 1369 564 2170 577 1966 700 1050 1357 855 2482 ...
Implemented:
I am importing a .xlsx file into R.
This file consists of three sheets.
I am binding all the sheets into a list.
Need to Implement
Now I want to combine this matrix lists into a single data.frame. With the header being the --> names(dataset).
I tried using the as.data.frame with read.xlsx as given in the help but it did not work.
I explicitly tried with as.data.frame(as.table(dataset)) but still it generates a long list of data.frame but nothing that I want.
I want to have a structure like
header = names and the values below that, just like how the read.table imports the data.
This is the code I am using:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
b <- rbind(list(lapply(1:sheet_ct, function(x) {
res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
b <- b [-c(1),] # Just want to remove the second header
I want to have the data arrangement something like below.
Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH
1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427
2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042
3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636
4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231
5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000
Please dont suggest me to have all data on a single sheet and also convert .xlsx to .csv or simple text format. I am trying really hard to have a proper dataframe from a .xlsx file.
Following is the file
And this is the post following : Followup
This is what resulted:
str(full_data)
'data.frame': 0 obs. of 19 variables:
$ Experiment : Factor w/ 2 levels "#","1":
$ Mesocosm : Factor w/ 10 levels "#","1","2","3",..:
$ Exp.day : Factor w/ 24 levels "1","10","11",..:
$ Hour : Factor w/ 24 levels "108","12","132",..:
$ Temperature: Factor w/ 125 levels "10","10.01","10.02",..:
$ Salinity : num
$ pH : num
$ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..:
$ TA : Factor w/ 117 levels "1813","1826",..:
$ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..:
$ Chl.a : Factor w/ 156 levels "0.171","0.22",..:
$ PIC : Factor w/ 194 levels "-0.47","-0.96",..:
$ POC : Factor w/ 199 levels "-0.046","1.733",..:
$ PON : Factor w/ 151 levels "1.675","1.723",..:
$ POP : Factor w/ 110 levels "0.032","0.034",..:
$ DOC : Factor w/ 93 levels "100.1","100.4",..:
$ DON : Factor w/ 1 level "µmol/L":
$ DOP : Factor w/ 1 level "µmol/L":
$ TEP : Factor w/ 100 levels "10.4934","11.0053",..:
[Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]
This is what I want to achieve:
str(a)
'data.frame': 9936 obs. of 29 variables:
$ Ei : int 1 1 1 1 1 1 1 1 1 1 ...
$ Mi : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1 2 3 4 5 6 7 8 9 10 ...
$ Cphy : num 0.653 0.645 0.637 0.63 0.624 ...
$ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ...
$ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ...
$ Chet : num 0.331 0.33 0.33 0.33 0.329 ...
$ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ...
$ Cdet : num 0.331 0.33 0.33 0.329 0.329 ...
$ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ...
$ DOC : num 49.8 49.5 49.3 49.1 48.8 ...
$ DIN : num 15 15 15 15 15 ...
$ DIC : num 2050 2050 2050 2051 2051 ...
$ AT : num 2150 2150 2150 2150 2150 ...
$ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ...
$ TEPC : num 0.134 0.165 0.194 0.221 0.246 ...
$ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ...
$ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ...
$ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ...
$ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ...
$ par : num 0 0 0.87 1.55 2.78 ...
$ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
$ Sal : num 31.3 31.3 31.3 31.3 31.3 ...
$ co2atm : num 370 370 370 370 370 370 370 370 370 370 ...
$ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ...
$ co2ppm : num 565 566 566 567 567 ...
$ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ...
$ pH : num 7.88 7.88 7.88 7.88 7.88 ...
[Note: sorry for the extra columns, this is another dataset (simple text), which I am reading from read.table]
With NA's handled:
> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4
Without handling NA's:
> unique(full_data$Exp.num)
[1] 1 NA 2 3
> unique(full_data$Mesocosm)
[1] 1 2 3 4 5 6 7 8 9 NA
I think this is what you need. I add a few comments on what I am doing:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
print(i)
variable_name <- sprintf('mydf_%s',i)
assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}
colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).
#some of the sheets were loaded with a few blank rows (full of NAs) which I remove
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
x <- x[!is.na(x)]
a <- length(x==TRUE)
}
mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]
full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])
#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working
n_row <- NULL
for ( i in 1:nrow(full_data)) {
x <- full_data[i,]
if ( all(is.na(x)) ) {
n_row <- append(n_row,i)
}
}
full_data <- full_data[-n_row,]
I think now this is what you need
I am trying to read the table from the following URL:
url <- 'http://faculty.chicagobooth.edu/ruey.tsay/teaching/introTS/m-ge3dx-4011.txt'
da <- read.table(url, header = TRUE, fill=FALSE, strip.white=TRUE)
I can look at the data using head:
> head(da)
date ge vw ew sp
1 19400131 -0.061920 -0.024020 -0.019978 -0.035228
2 19400229 -0.009901 0.013664 0.029733 0.006639
3 19400330 0.049333 0.018939 0.026168 0.009893
4 19400430 -0.041667 0.001196 0.013115 -0.004898
5 19400531 -0.197324 -0.220314 -0.269754 -0.239541
6 19400629 0.061667 0.066664 0.066550 0.076591
This works fine for the first 4 columns, for example, I can look at the column ew
> head(da$ew)
[1] -0.019978 0.029733 0.026168 0.013115 -0.269754 0.066550
but when I try to access the last one, I get some extra output which is not in the txt file.
> head(da$sp)
[1] -0.035228 0.006639 0.009893 -0.004898 -0.239541 0.076591
859 Levels: -0.000060 -0.000143 -0.000180 -0.000320 -0.000659 -0.000815 ... 0.163047
How do I get rid of the extra output? Thanks!
This is representation of a factor.
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : Factor w/ 859 levels "-0.000060","-0.000143",..: 226 411 445 42 353 828 613 585 441 684 ...
Row 58 has a dot instead of a number. This is sufficient information for R to handle the variable as a factor. Once you change the dot to NA or fix the error, you will be able to read in the data fine.
Another option would be to change the point to something meaningful after the data has been read in, and coercing to numeric afterwards. The following statement will coerce . to NA.
da$sp <- as.numeric(as.character(da$sp))
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : num -0.03523 0.00664 0.00989 -0.0049 -0.23954 ...