Loop to rename fields in an R dataframe - r

I have a data frame in R that I have read in from a csv file. How can I append a string ("EA") to the end of all column names? I have figured out code that works for a single column, but for some reason my loop does not return renamed fields.
Here is the dataframe:
> str(mydataframe)
'data.frame': 8368 obs. of 4 variables:
$ gene: Factor w/ 8368 levels "A1BG","A1CF",..: 6949 4379 7111 4691 2331 4914 506 4985 7109 2072 ...
$ p : num 1.23e-09 1.05e-07 1.20e-07 2.53e-07 6.67e-07 ...
$ beta: num 2.86 2.52 2.51 1.72 2.34 ...
$ se : num 0.471 0.474 0.474 0.334 0.471 ...
Here is the code:
for(i in names(mydataframe)){
i_renamed <- paste(i, "EA", sep=".")
mydataframe$i_renamed <- mydataframe$i
mydataframe$i <- NULL
}
...but afterwards the object is still the same
> str(mydataframe)
'data.frame': 8368 obs. of 4 variables:
$ gene: Factor w/ 8368 levels "A1BG","A1CF",..: 6949 4379 7111 4691 2331 4914 506 4985 7109 2072 ...
$ p : num 1.23e-09 1.05e-07 1.20e-07 2.53e-07 6.67e-07 ...
$ beta: num 2.86 2.52 2.51 1.72 2.34 ...
$ se : num 0.471 0.474 0.474 0.334 0.471 ...
The desired result is a field "gene.EA" that is identical to the original "gene" field, etc for all columns
Thank you

You can avoid trying to use a loop to do this.
names(mydataframe) <- paste0(names(mydataframe), '.EA')
Or explicitly, you could do:
mydataframe <- setNames(mydataframe, paste0(names(mydataframe), '.EA'))

Related

Apply function not working in R on data frame

I have a data frame:
rawDataLogged
I have a function:
doForRow <- function(row) {
transpose <- t(row);
transpose <- transpose[like(row.names(transpose), "H.M")]
frame <- data.frame(transpose)
frame$BR <- c(1,1,2,2)
frame$TR <- c(1,2,1,2)
colnames(frame)[1] <- "Log2Ratio"
frame$Log2Ratio <- as.numeric(levels(frame$Log2Ratio))[frame$Log2Ratio]
summ <- summary(aov(Log2Ratio ~ BR + Error(TR), data=frame))
summ[[2]][[1]]["BR",]$'Pr(>F)'
}
If I execute my function with a row from my data frame, I get a result:
> doForRow(rawDataLogged[5,])
[1] 0.4973168
However if I try to use 'apply' to get the results for all my rows, it does not work:
tmp <- apply(rawDataLogged, 1, doForRow)
Error in $<-.data.frame(*tmp*, "BR", value = c(1, 1, 2, 2)) :
replacement has 4 rows, data has 0
When I place a breakpoint in my own function, I see that 'row' is empty, as in nothing seems to be getting passed into my function by apply.
Any ideas why this could be happening? I've spent hours trying to solve this myself, perhaps a loop would be easiest instead of an apply family function. I'm at a loss as to why my function is called without any row data.
I have placed an R data file containing the 'rawDataLogged' object at this url: Link which could be used for debugging. Example data created using dput: Link
Here is a dump from str to show the structure of my data frame:
'data.frame': 1262 obs. of 15 variables:
$ Protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN;tr|H0Y8S8|H0Y8S8_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Majority.protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Ratio.M.L.normalized.X1.1: num -0.27 -0.707 0.244 -0.728 -2.025 ...
$ Ratio.H.L.normalized.X1.1: num 0.0036 0.0588 -0.0886 0.1561 -0.0843 ...
$ Ratio.H.M.normalized.X1.1: num 0.339 0.66 -0.211 0.477 1.926 ...
$ Ratio.M.L.normalized.X1.2: num -0.132 -0.661 0.283 -1.045 -1.223 ...
$ Ratio.H.L.normalized.X1.2: num -0.07779 0.10273 -0.00251 -0.09755 0.18929 ...
$ Ratio.H.M.normalized.X1.2: num 0.0793 0.7718 -0.2657 0.9651 1.3532 ...
$ Ratio.M.L.normalized.X3.1: num -3.55 -2.08 -1.99 -1.98 -1.85 ...
$ Ratio.H.L.normalized.X3.1: num 0.1336 0.0777 -0.1014 -0.3478 -0.0259 ...
$ Ratio.H.M.normalized.X3.1: num -0.187 2.259 1.852 1.511 1.928 ...
$ Ratio.M.L.normalized.X3.2: num 0.106 -2.118 -1.864 -2.364 -1.847 ...
$ Ratio.H.L.normalized.X3.2: num 0.0141 0.0746 -0.0315 -0.1772 -0.0936 ...
$ Ratio.H.M.normalized.X3.2: num -0.143 2.248 1.842 2.279 1.758 ...
$ id : int 1369 564 2170 577 1966 700 1050 1357 855 2482 ...

Analysis of PCA

I'm using the rela package to check whether I can use PCA in my data.
paf.neur2 <- paf(neur2)
summary(paf.neur2)
# [1] "Your dataset is not a numeric object."
I want to see the KMO (The Kaiser-Meyer-Olkin measure of sampling adequacy test). How to do that?
Output of str(neur2)
'data.frame': 1457 obs. of 66 variables:
$ userid : int 200 387 458 649 931 991 1044 1075 1347 1360 ...
$ funct : num 3.73 3.79 3.54 3.04 3.81 ...
$ pronoun: num 2.26 2.55 2.49 1.98 2.71 ...
.
.
.
$ time : num 1.68 1.87 1.51 1.03 1.74 ...
$ work : num 0.7419 0.2311 -0.1985 -1.6094 -0.0619 ...
$ achieve: num 0.174 0.2469 0.1823 -0.478 -0.0513 ...
$ leisure: num 0.2852 0.0296 0.0583 -0.3567 -0.0408 ...
$ home : num -0.844 -0.58 -0.844 -2.207 -1.079 ...
.
Variables are all numeric.
According to ?paf, object is a numeric dataset (usually a coerced matrix from a prior data frame)
So you need to turn your data.frame neur2 into a matrix: as.matrix(neur2).
Here is a reproduction of your problem using the Seatbelts dataset:
library(rela)
Belts <- Seatbelts[,1:7]
class(Belts)
# [1] "mts" "ts" "matrix"
Belts <- as.data.frame(Belts)
# [1] "data.frame"
paf.belt <- paf(Belts)
[1] "Your dataset is not a numeric object."
Belts <- as.matrix(Belts)
class(Belts)
# [1] "matrix"
paf.belt <- paf(Belts) # Works
Two options which can do it for you:
kmo_DIY <- function(df){
csq = cor(df)^2
csumsq = (sum(csq)-dim(csq)[1])/2
library(corpcor)
pcsq = cor2pcor(cor(df))^2
pcsumsq = (sum(pcsq)-dim(pcsq)[1])/2
kmo = csumsq/(csumsq+pcsumsq)
return(kmo)
}
or
the function KMO() from the psych package.

Load data from GitHub Gist [duplicate]

This question already has answers here:
Read a CSV from github into R
(10 answers)
Closed 8 years ago.
I have a data in which I would like to import in R.
One way is to download the data from gist and then use
read.delim
read.table
but I am more searching to find other ways instead downloading it and then use the above functions
Here is an example data:
https://gist.github.com/anonymous/2c69ab500bfa94d0268a
you can use something like this
url <- 'https://gist.githubusercontent.com/anonymous/2c69ab500bfa94d0268a/raw/example.txt'
library(RCurl)
library(bitops)
df <- getURL(url, ssl.verifypeer=FALSE)
df1 <- read.delim(textConnection(df),header=TRUE, row.names=1,stringsAsFactors=FALSE)
dim(df1)
I've been working on a package to do just this. It's called rio and is available on GitHub. Once it's installed, you can do this in one line:
# install and load rio
library("devtools")
install_github("leeper/rio")
library("rio")
# import
> d <- import("https://gist.githubusercontent.com/anonymous/2c69ab500bfa94d0268a/raw/cdaedc27897d4e570f61317711c3f548430570eb/example.txt")
Result:
> str(d)
'data.frame': 1679 obs. of 19 variables:
$ probes: chr "200645_at" "200690_at" "200691_s_at" "200692_s_at" ...
$ M1 : num 0.0446 -0.0165 0.0554 0 0.0608 ...
$ M2 : num 0.0744 0.1121 -0.0689 -0.0505 0.0601 ...
$ M3 : num -0.034 -0.0959 -0.0852 -0.0508 0.0115 ...
$ M4 : num 0.0173 0 0.0702 -0.0159 0.0744 ...
$ M5 : num 0.228 -0.4595 0.0823 -0.3041 -0.0232 ...
$ M6 : num 0.007 -0.0282 0.0361 -0.0684 -0.1095 ...
$ M7 : num -0.025 -0.1617 -0.0306 -0.0644 -0.0416 ...
$ M8 : num 0.0644 -0.0482 -0.0076 -0.0175 -0.0499 ...
$ M9 : num -0.0253 -0.2611 -0.034 0.0503 -0.0515 ...
$ M10 : num -0.123 0.0223 -0.0198 0.0546 0.0303 ...
$ M11 : num -0.625 -0.613 -0.182 -0.214 -0.115 ...
$ M12 : num 0.021 0.1961 -0.0681 -0.0216 0.0824 ...
$ M13 : num 0.1095 0.2119 0.1219 0.044 -0.0036 ...
$ M14 : num 0.1527 0.0122 -0.1615 -0.0811 0.0575 ...
$ M15 : num 0.0261 -0.5495 -0.0729 0.0964 0.0427 ...
$ M16 : num -0.2107 0.1518 -0.0696 0.0211 0.1104 ...
$ M17 : num -0.0196 -0.2409 0.0042 -0.0325 -0.0216 ...
$ M18 : num -0.2316 0.161 0.1239 0.181 0.0278 ...
The import function assumes .txt data files are tab-separated. If you have another format, you can specify it in the format argument.

Build a proper dataframe from a matrix list after importing .xlsx file

Implemented:
I am importing a .xlsx file into R.
This file consists of three sheets.
I am binding all the sheets into a list.
Need to Implement
Now I want to combine this matrix lists into a single data.frame. With the header being the --> names(dataset).
I tried using the as.data.frame with read.xlsx as given in the help but it did not work.
I explicitly tried with as.data.frame(as.table(dataset)) but still it generates a long list of data.frame but nothing that I want.
I want to have a structure like
header = names and the values below that, just like how the read.table imports the data.
This is the code I am using:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
b <- rbind(list(lapply(1:sheet_ct, function(x) {
res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
b <- b [-c(1),] # Just want to remove the second header
I want to have the data arrangement something like below.
Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH
1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427
2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042
3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636
4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231
5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000
Please dont suggest me to have all data on a single sheet and also convert .xlsx to .csv or simple text format. I am trying really hard to have a proper dataframe from a .xlsx file.
Following is the file
And this is the post following : Followup
This is what resulted:
str(full_data)
'data.frame': 0 obs. of 19 variables:
$ Experiment : Factor w/ 2 levels "#","1":
$ Mesocosm : Factor w/ 10 levels "#","1","2","3",..:
$ Exp.day : Factor w/ 24 levels "1","10","11",..:
$ Hour : Factor w/ 24 levels "108","12","132",..:
$ Temperature: Factor w/ 125 levels "10","10.01","10.02",..:
$ Salinity : num
$ pH : num
$ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..:
$ TA : Factor w/ 117 levels "1813","1826",..:
$ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..:
$ Chl.a : Factor w/ 156 levels "0.171","0.22",..:
$ PIC : Factor w/ 194 levels "-0.47","-0.96",..:
$ POC : Factor w/ 199 levels "-0.046","1.733",..:
$ PON : Factor w/ 151 levels "1.675","1.723",..:
$ POP : Factor w/ 110 levels "0.032","0.034",..:
$ DOC : Factor w/ 93 levels "100.1","100.4",..:
$ DON : Factor w/ 1 level "µmol/L":
$ DOP : Factor w/ 1 level "µmol/L":
$ TEP : Factor w/ 100 levels "10.4934","11.0053",..:
[Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]
This is what I want to achieve:
str(a)
'data.frame': 9936 obs. of 29 variables:
$ Ei : int 1 1 1 1 1 1 1 1 1 1 ...
$ Mi : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1 2 3 4 5 6 7 8 9 10 ...
$ Cphy : num 0.653 0.645 0.637 0.63 0.624 ...
$ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ...
$ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ...
$ Chet : num 0.331 0.33 0.33 0.33 0.329 ...
$ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ...
$ Cdet : num 0.331 0.33 0.33 0.329 0.329 ...
$ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ...
$ DOC : num 49.8 49.5 49.3 49.1 48.8 ...
$ DIN : num 15 15 15 15 15 ...
$ DIC : num 2050 2050 2050 2051 2051 ...
$ AT : num 2150 2150 2150 2150 2150 ...
$ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ...
$ TEPC : num 0.134 0.165 0.194 0.221 0.246 ...
$ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ...
$ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ...
$ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ...
$ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ...
$ par : num 0 0 0.87 1.55 2.78 ...
$ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
$ Sal : num 31.3 31.3 31.3 31.3 31.3 ...
$ co2atm : num 370 370 370 370 370 370 370 370 370 370 ...
$ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ...
$ co2ppm : num 565 566 566 567 567 ...
$ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ...
$ pH : num 7.88 7.88 7.88 7.88 7.88 ...
[Note: sorry for the extra columns, this is another dataset (simple text), which I am reading from read.table]
With NA's handled:
> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4
Without handling NA's:
> unique(full_data$Exp.num)
[1] 1 NA 2 3
> unique(full_data$Mesocosm)
[1] 1 2 3 4 5 6 7 8 9 NA
I think this is what you need. I add a few comments on what I am doing:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
print(i)
variable_name <- sprintf('mydf_%s',i)
assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}
colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).
#some of the sheets were loaded with a few blank rows (full of NAs) which I remove
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
x <- x[!is.na(x)]
a <- length(x==TRUE)
}
mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]
full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])
#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working
n_row <- NULL
for ( i in 1:nrow(full_data)) {
x <- full_data[i,]
if ( all(is.na(x)) ) {
n_row <- append(n_row,i)
}
}
full_data <- full_data[-n_row,]
I think now this is what you need

read.table returns extra stuff on last column

I am trying to read the table from the following URL:
url <- 'http://faculty.chicagobooth.edu/ruey.tsay/teaching/introTS/m-ge3dx-4011.txt'
da <- read.table(url, header = TRUE, fill=FALSE, strip.white=TRUE)
I can look at the data using head:
> head(da)
date ge vw ew sp
1 19400131 -0.061920 -0.024020 -0.019978 -0.035228
2 19400229 -0.009901 0.013664 0.029733 0.006639
3 19400330 0.049333 0.018939 0.026168 0.009893
4 19400430 -0.041667 0.001196 0.013115 -0.004898
5 19400531 -0.197324 -0.220314 -0.269754 -0.239541
6 19400629 0.061667 0.066664 0.066550 0.076591
This works fine for the first 4 columns, for example, I can look at the column ew
> head(da$ew)
[1] -0.019978 0.029733 0.026168 0.013115 -0.269754 0.066550
but when I try to access the last one, I get some extra output which is not in the txt file.
> head(da$sp)
[1] -0.035228 0.006639 0.009893 -0.004898 -0.239541 0.076591
859 Levels: -0.000060 -0.000143 -0.000180 -0.000320 -0.000659 -0.000815 ... 0.163047
How do I get rid of the extra output? Thanks!
This is representation of a factor.
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : Factor w/ 859 levels "-0.000060","-0.000143",..: 226 411 445 42 353 828 613 585 441 684 ...
Row 58 has a dot instead of a number. This is sufficient information for R to handle the variable as a factor. Once you change the dot to NA or fix the error, you will be able to read in the data fine.
Another option would be to change the point to something meaningful after the data has been read in, and coercing to numeric afterwards. The following statement will coerce . to NA.
da$sp <- as.numeric(as.character(da$sp))
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : num -0.03523 0.00664 0.00989 -0.0049 -0.23954 ...

Resources