Load data from GitHub Gist [duplicate] - r

This question already has answers here:
Read a CSV from github into R
(10 answers)
Closed 8 years ago.
I have a data in which I would like to import in R.
One way is to download the data from gist and then use
read.delim
read.table
but I am more searching to find other ways instead downloading it and then use the above functions
Here is an example data:
https://gist.github.com/anonymous/2c69ab500bfa94d0268a

you can use something like this
url <- 'https://gist.githubusercontent.com/anonymous/2c69ab500bfa94d0268a/raw/example.txt'
library(RCurl)
library(bitops)
df <- getURL(url, ssl.verifypeer=FALSE)
df1 <- read.delim(textConnection(df),header=TRUE, row.names=1,stringsAsFactors=FALSE)
dim(df1)

I've been working on a package to do just this. It's called rio and is available on GitHub. Once it's installed, you can do this in one line:
# install and load rio
library("devtools")
install_github("leeper/rio")
library("rio")
# import
> d <- import("https://gist.githubusercontent.com/anonymous/2c69ab500bfa94d0268a/raw/cdaedc27897d4e570f61317711c3f548430570eb/example.txt")
Result:
> str(d)
'data.frame': 1679 obs. of 19 variables:
$ probes: chr "200645_at" "200690_at" "200691_s_at" "200692_s_at" ...
$ M1 : num 0.0446 -0.0165 0.0554 0 0.0608 ...
$ M2 : num 0.0744 0.1121 -0.0689 -0.0505 0.0601 ...
$ M3 : num -0.034 -0.0959 -0.0852 -0.0508 0.0115 ...
$ M4 : num 0.0173 0 0.0702 -0.0159 0.0744 ...
$ M5 : num 0.228 -0.4595 0.0823 -0.3041 -0.0232 ...
$ M6 : num 0.007 -0.0282 0.0361 -0.0684 -0.1095 ...
$ M7 : num -0.025 -0.1617 -0.0306 -0.0644 -0.0416 ...
$ M8 : num 0.0644 -0.0482 -0.0076 -0.0175 -0.0499 ...
$ M9 : num -0.0253 -0.2611 -0.034 0.0503 -0.0515 ...
$ M10 : num -0.123 0.0223 -0.0198 0.0546 0.0303 ...
$ M11 : num -0.625 -0.613 -0.182 -0.214 -0.115 ...
$ M12 : num 0.021 0.1961 -0.0681 -0.0216 0.0824 ...
$ M13 : num 0.1095 0.2119 0.1219 0.044 -0.0036 ...
$ M14 : num 0.1527 0.0122 -0.1615 -0.0811 0.0575 ...
$ M15 : num 0.0261 -0.5495 -0.0729 0.0964 0.0427 ...
$ M16 : num -0.2107 0.1518 -0.0696 0.0211 0.1104 ...
$ M17 : num -0.0196 -0.2409 0.0042 -0.0325 -0.0216 ...
$ M18 : num -0.2316 0.161 0.1239 0.181 0.0278 ...
The import function assumes .txt data files are tab-separated. If you have another format, you can specify it in the format argument.

Related

Quickly sum a big list of lists?

I have a 10000 lists (results of a simulation), each containing 22500 lists (each list is a pixel in an image) which contains a vector of length 55.
# Simple Example
m <- replicate(2, list(runif(55)))
m2 <- replicate(3, list(m))
str(m2,list.len = 3)
List of 3
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
# my function
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
#by hand
identical(m2[[1]][[1]] + m2[[2]][[1]] + m2[[3]][[1]], m3[[1]] )
I wrote a nested lapply with Reduce to sum the lists. On a small example, as in above, it's fast but on my real data, it's really slow.
#slow code
m <- replicate(22500, list(runif(55)))
m2 <- replicate(10000, list(m))
str(m2,list.len = 3)
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
How can I speed this up, or should I change data structures?
Thanks.
This gives some improvement (>2x):
split(Reduce(`+`, lapply(m2, unlist)), rep(seq_along(m2[[1]]), lengths(m2[[1]])))
Since your data is essentially rectangular, had you stored it in this shape:
library(data.table)
d = rbindlist(lapply(m2, function(x) transpose(as.data.table(x))), id = T
)[, id.in := 1:.N, by = .id]
# .id V1 V2 V55 id.in
#1: 1 0.4605065 0.09744975 ... 0.8620728 1
#2: 1 0.6666742 0.10435471 ... 0.3991940 2
#3: 2 0.4605065 0.09744975 ... 0.8620728 1
#4: 2 0.6666742 0.10435471 ... 0.3991940 2
#5: 3 0.4605065 0.09744975 ... 0.8620728 1
#6: 3 0.6666742 0.10435471 ... 0.3991940 2
You could do the aggregation even faster by doing:
d[, lapply(.SD, sum), by = id.in]
But if the list is your starting point, the conversion would take up the majority of the time.

Apply function not working in R on data frame

I have a data frame:
rawDataLogged
I have a function:
doForRow <- function(row) {
transpose <- t(row);
transpose <- transpose[like(row.names(transpose), "H.M")]
frame <- data.frame(transpose)
frame$BR <- c(1,1,2,2)
frame$TR <- c(1,2,1,2)
colnames(frame)[1] <- "Log2Ratio"
frame$Log2Ratio <- as.numeric(levels(frame$Log2Ratio))[frame$Log2Ratio]
summ <- summary(aov(Log2Ratio ~ BR + Error(TR), data=frame))
summ[[2]][[1]]["BR",]$'Pr(>F)'
}
If I execute my function with a row from my data frame, I get a result:
> doForRow(rawDataLogged[5,])
[1] 0.4973168
However if I try to use 'apply' to get the results for all my rows, it does not work:
tmp <- apply(rawDataLogged, 1, doForRow)
Error in $<-.data.frame(*tmp*, "BR", value = c(1, 1, 2, 2)) :
replacement has 4 rows, data has 0
When I place a breakpoint in my own function, I see that 'row' is empty, as in nothing seems to be getting passed into my function by apply.
Any ideas why this could be happening? I've spent hours trying to solve this myself, perhaps a loop would be easiest instead of an apply family function. I'm at a loss as to why my function is called without any row data.
I have placed an R data file containing the 'rawDataLogged' object at this url: Link which could be used for debugging. Example data created using dput: Link
Here is a dump from str to show the structure of my data frame:
'data.frame': 1262 obs. of 15 variables:
$ Protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN;tr|H0Y8S8|H0Y8S8_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Majority.protein.IDs : Factor w/ 1262 levels "sp|A0AVT1|UBA6_HUMAN",..: 654 190 894 196 834 268 474 1221 366 973 ...
$ Ratio.M.L.normalized.X1.1: num -0.27 -0.707 0.244 -0.728 -2.025 ...
$ Ratio.H.L.normalized.X1.1: num 0.0036 0.0588 -0.0886 0.1561 -0.0843 ...
$ Ratio.H.M.normalized.X1.1: num 0.339 0.66 -0.211 0.477 1.926 ...
$ Ratio.M.L.normalized.X1.2: num -0.132 -0.661 0.283 -1.045 -1.223 ...
$ Ratio.H.L.normalized.X1.2: num -0.07779 0.10273 -0.00251 -0.09755 0.18929 ...
$ Ratio.H.M.normalized.X1.2: num 0.0793 0.7718 -0.2657 0.9651 1.3532 ...
$ Ratio.M.L.normalized.X3.1: num -3.55 -2.08 -1.99 -1.98 -1.85 ...
$ Ratio.H.L.normalized.X3.1: num 0.1336 0.0777 -0.1014 -0.3478 -0.0259 ...
$ Ratio.H.M.normalized.X3.1: num -0.187 2.259 1.852 1.511 1.928 ...
$ Ratio.M.L.normalized.X3.2: num 0.106 -2.118 -1.864 -2.364 -1.847 ...
$ Ratio.H.L.normalized.X3.2: num 0.0141 0.0746 -0.0315 -0.1772 -0.0936 ...
$ Ratio.H.M.normalized.X3.2: num -0.143 2.248 1.842 2.279 1.758 ...
$ id : int 1369 564 2170 577 1966 700 1050 1357 855 2482 ...

Build a proper dataframe from a matrix list after importing .xlsx file

Implemented:
I am importing a .xlsx file into R.
This file consists of three sheets.
I am binding all the sheets into a list.
Need to Implement
Now I want to combine this matrix lists into a single data.frame. With the header being the --> names(dataset).
I tried using the as.data.frame with read.xlsx as given in the help but it did not work.
I explicitly tried with as.data.frame(as.table(dataset)) but still it generates a long list of data.frame but nothing that I want.
I want to have a structure like
header = names and the values below that, just like how the read.table imports the data.
This is the code I am using:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
b <- rbind(list(lapply(1:sheet_ct, function(x) {
res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
b <- b [-c(1),] # Just want to remove the second header
I want to have the data arrangement something like below.
Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH
1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427
2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042
3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636
4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231
5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000
Please dont suggest me to have all data on a single sheet and also convert .xlsx to .csv or simple text format. I am trying really hard to have a proper dataframe from a .xlsx file.
Following is the file
And this is the post following : Followup
This is what resulted:
str(full_data)
'data.frame': 0 obs. of 19 variables:
$ Experiment : Factor w/ 2 levels "#","1":
$ Mesocosm : Factor w/ 10 levels "#","1","2","3",..:
$ Exp.day : Factor w/ 24 levels "1","10","11",..:
$ Hour : Factor w/ 24 levels "108","12","132",..:
$ Temperature: Factor w/ 125 levels "10","10.01","10.02",..:
$ Salinity : num
$ pH : num
$ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..:
$ TA : Factor w/ 117 levels "1813","1826",..:
$ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..:
$ Chl.a : Factor w/ 156 levels "0.171","0.22",..:
$ PIC : Factor w/ 194 levels "-0.47","-0.96",..:
$ POC : Factor w/ 199 levels "-0.046","1.733",..:
$ PON : Factor w/ 151 levels "1.675","1.723",..:
$ POP : Factor w/ 110 levels "0.032","0.034",..:
$ DOC : Factor w/ 93 levels "100.1","100.4",..:
$ DON : Factor w/ 1 level "µmol/L":
$ DOP : Factor w/ 1 level "µmol/L":
$ TEP : Factor w/ 100 levels "10.4934","11.0053",..:
[Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]
This is what I want to achieve:
str(a)
'data.frame': 9936 obs. of 29 variables:
$ Ei : int 1 1 1 1 1 1 1 1 1 1 ...
$ Mi : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1 2 3 4 5 6 7 8 9 10 ...
$ Cphy : num 0.653 0.645 0.637 0.63 0.624 ...
$ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ...
$ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ...
$ Chet : num 0.331 0.33 0.33 0.33 0.329 ...
$ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ...
$ Cdet : num 0.331 0.33 0.33 0.329 0.329 ...
$ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ...
$ DOC : num 49.8 49.5 49.3 49.1 48.8 ...
$ DIN : num 15 15 15 15 15 ...
$ DIC : num 2050 2050 2050 2051 2051 ...
$ AT : num 2150 2150 2150 2150 2150 ...
$ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ...
$ TEPC : num 0.134 0.165 0.194 0.221 0.246 ...
$ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ...
$ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ...
$ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ...
$ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ...
$ par : num 0 0 0.87 1.55 2.78 ...
$ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
$ Sal : num 31.3 31.3 31.3 31.3 31.3 ...
$ co2atm : num 370 370 370 370 370 370 370 370 370 370 ...
$ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ...
$ co2ppm : num 565 566 566 567 567 ...
$ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ...
$ pH : num 7.88 7.88 7.88 7.88 7.88 ...
[Note: sorry for the extra columns, this is another dataset (simple text), which I am reading from read.table]
With NA's handled:
> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4
Without handling NA's:
> unique(full_data$Exp.num)
[1] 1 NA 2 3
> unique(full_data$Mesocosm)
[1] 1 2 3 4 5 6 7 8 9 NA
I think this is what you need. I add a few comments on what I am doing:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
print(i)
variable_name <- sprintf('mydf_%s',i)
assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}
colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).
#some of the sheets were loaded with a few blank rows (full of NAs) which I remove
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
x <- x[!is.na(x)]
a <- length(x==TRUE)
}
mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]
full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])
#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working
n_row <- NULL
for ( i in 1:nrow(full_data)) {
x <- full_data[i,]
if ( all(is.na(x)) ) {
n_row <- append(n_row,i)
}
}
full_data <- full_data[-n_row,]
I think now this is what you need

Loop to rename fields in an R dataframe

I have a data frame in R that I have read in from a csv file. How can I append a string ("EA") to the end of all column names? I have figured out code that works for a single column, but for some reason my loop does not return renamed fields.
Here is the dataframe:
> str(mydataframe)
'data.frame': 8368 obs. of 4 variables:
$ gene: Factor w/ 8368 levels "A1BG","A1CF",..: 6949 4379 7111 4691 2331 4914 506 4985 7109 2072 ...
$ p : num 1.23e-09 1.05e-07 1.20e-07 2.53e-07 6.67e-07 ...
$ beta: num 2.86 2.52 2.51 1.72 2.34 ...
$ se : num 0.471 0.474 0.474 0.334 0.471 ...
Here is the code:
for(i in names(mydataframe)){
i_renamed <- paste(i, "EA", sep=".")
mydataframe$i_renamed <- mydataframe$i
mydataframe$i <- NULL
}
...but afterwards the object is still the same
> str(mydataframe)
'data.frame': 8368 obs. of 4 variables:
$ gene: Factor w/ 8368 levels "A1BG","A1CF",..: 6949 4379 7111 4691 2331 4914 506 4985 7109 2072 ...
$ p : num 1.23e-09 1.05e-07 1.20e-07 2.53e-07 6.67e-07 ...
$ beta: num 2.86 2.52 2.51 1.72 2.34 ...
$ se : num 0.471 0.474 0.474 0.334 0.471 ...
The desired result is a field "gene.EA" that is identical to the original "gene" field, etc for all columns
Thank you
You can avoid trying to use a loop to do this.
names(mydataframe) <- paste0(names(mydataframe), '.EA')
Or explicitly, you could do:
mydataframe <- setNames(mydataframe, paste0(names(mydataframe), '.EA'))

Adding principal components as variables to a data frame

I am working with a dataset of 10000 data points and 100 variables in R. Unfortunately the variables I have do not describe the data in a good way. I carried out a PCA analysis using prcomp() and the first 3 PCs seem to account for a most of the variability of the data. As far as I understand, a principal component is a combination of different variables; therefore it has a certain value corresponding to each data point and can be considered as a new variable. Would I be able to add these principal components as 3 new variables to my data? I would need them for further analysis.
A reproducible dataset:
set.seed(144)
x <- data.frame(matrix(rnorm(2^10*12), ncol=12))
y <- prcomp(formula = ~., data=x, center = TRUE, scale = TRUE, na.action = na.omit)
PC scores are stored in the element x of prcomp() result.
str(y)
List of 6
$ sdev : num [1:12] 1.08 1.06 1.05 1.04 1.03 ...
$ rotation: num [1:12, 1:12] -0.0175 -0.1312 0.3284 -0.4134 0.2341 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "X1" "X2" "X3" "X4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:12] 0.02741 -0.01692 -0.03228 -0.03303 0.00122 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ scale : Named num [1:12] 0.998 1.057 1.019 1.007 0.993 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ x : num [1:1024, 1:12] 1.023 -1.213 0.167 -0.118 -0.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1024] "1" "2" "3" "4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ call : language prcomp(formula = ~., data = x, na.action = na.omit, center = TRUE, scale = TRUE)
- attr(*, "class")= chr "prcomp"
You can get them with y$x and then chose those columns you need.
x.new<-cbind(x,y$x[,1:3])
str(x.new)
'data.frame': 1024 obs. of 15 variables:
$ X1 : num 1.14 2.38 0.684 1.785 0.313 ...
$ X2 : num -0.689 0.446 -0.72 -3.511 0.36 ...
$ X3 : num 0.722 0.816 0.295 -0.48 0.566 ...
$ X4 : num 1.629 0.738 0.85 1.057 0.116 ...
$ X5 : num -0.737 -0.827 0.65 -0.496 -1.045 ...
$ X6 : num 0.347 0.056 -0.606 1.077 0.257 ...
$ X7 : num -0.773 1.042 2.149 -0.599 0.516 ...
$ X8 : num 2.05511 0.4772 0.18614 0.02585 0.00619 ...
$ X9 : num -0.0462 1.3784 -0.2489 0.1625 0.6137 ...
$ X10: num -0.709 0.755 0.463 -0.594 -1.228 ...
$ X11: num -1.233 -0.376 -2.646 1.094 0.207 ...
$ X12: num -0.44 -2.049 0.315 0.157 2.245 ...
$ PC1: num 1.023 -1.213 0.167 -0.118 -0.186 ...
$ PC2: num 1.2408 0.6077 1.1885 3.0789 0.0797 ...
$ PC3: num -0.776 -1.41 0.977 -1.343 0.987 ...
Didzis Elferts's response only works if your data, x, has no NAs. Here's how you can add the components if your data does have NAs.
library(tidyverse)
components <- y$x %>% rownames_to_column("id")
x <- x %>% rownames_to_column("id") %>% left_join(components, by = "id")

Resources