why my data lose it's dimension when I use lapply? - r

I'm reading lot's of files from directory and doing some computation on each file.
because I want to make my script parralel, I use lapply. when I look at the dimention of my data frame at each element of list, it's become 1.
would someone help me to fix it ?
Here is my effort :
files <- list.files(path="path to file")
dfr <- lapply(files, function(x) read.table(x,header=T,sep="\n"))
for(i in drf){
Do some computation
if (ncol(i) > 1){
y <- as.matrix(i[1])
x <- as.matrix(i[2:ncol(i)])
}
.
.
}
#
> i
[[1]]
ACAA2.hsa.miR.124.AGO2.hsa.miR.124.AGO134
1 7.6561 18.5924339201 23.4560035028
2 7.2355 26.2524888635 33.6513700944
3 7.365 23.6841865928 28.2168475593
4 8.4768 22.4003094419 28.0983702155
5 5.5838 20.4838449736 26.8616064228
6 7.3123 20.8488005184 26.9155966811
7 7.2345 21.5272944711 26.2954400309
8 7.05 23.3113502366 29.3856555269
> dim(i[1])
NULL
> dim(i[[1]])
[1] 67 1
> a<-i[[1]]
> dim(a)
[1] 67 1
> a
ACAA2.hsa.miR.124.AGO2.hsa.miR.124.AGO134
1 7.6561 18.5924339201 23.4560035028
2 7.2355 26.2524888635 33.6513700944
3 7.365 23.6841865928 28.2168475593
4 8.4768 22.4003094419 28.0983702155
5 5.5838 20.4838449736 26.8616064228
6 7.3123 20.8488005184 26.915596681
but I would expect
>dim(a)
67 3
Because I loses the dimension of data, my *for loop* doesn't work

You problem is not the for loop or the lapply call but your read.table command. You use sep="\n" instead of sep=" ".
?read.table shows you that the sep argument is the field separator. It seems your field separator is a space " ". Just use read.table without specifying the sep argument should work.

Related

How can i convert a JSON file in R to a dataframe?

I want to load the data from a JSON file into R to make a new dataframe. However the JSON file consists out of other links with data, so i can't seem to find the actual data from the JSON file. I got the JSON file from this website: https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json
This is the code i used.
library(rjson)
JSONList1 <- fromJSON(file = "utrecht2.json")
print(JSONList1)
JSONList1_df <- as.data.frame(JSONList1)
when i use this code i get only 1 observation with 411 variables.
Any idea how to do this? I'm a beginner and i've never worked with JSON files.
Maybe try fromJSON from package jsonlite
library(jsonlite)
JSONList1 <- fromJSON("https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json")
There are several packages offering JSON importing abilities. If I use the one I am involved with, then the resulting data appears to contain a data.frame as the first list element.
d <- RcppSimdJson::fload("https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json")
> class(d)
[1] "list"
> class(d[[1]])
[1] "data.frame"
>
> head(d[[1]])
dynamicDataUrl
1 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/8d85bbdb-8bbd-4a24-b35f-85f21186ec04
2 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/21b0388a-56f7-4cba-8fd3-4a1c914f5fe2
3 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/45434989-3252-4c85-8731-c856b02c390c
4 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/9064b206-7e62-402d-ae62-f25a0e47571b
5 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/5829fb06-ee4a-4762-946c-ed6209edf7d5
6 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/e4da517a-ef32-426d-821c-96e29ac5ac80
staticDataUrl
1 http://opendata.technolution.nl/opendata/parkingdata/v1/static/8d85bbdb-8bbd-4a24-b35f-85f21186ec04
2 http://opendata.technolution.nl/opendata/parkingdata/v1/static/21b0388a-56f7-4cba-8fd3-4a1c914f5fe2
3 http://opendata.technolution.nl/opendata/parkingdata/v1/static/45434989-3252-4c85-8731-c856b02c390c
4 http://opendata.technolution.nl/opendata/parkingdata/v1/static/9064b206-7e62-402d-ae62-f25a0e47571b
5 http://opendata.technolution.nl/opendata/parkingdata/v1/static/5829fb06-ee4a-4762-946c-ed6209edf7d5
6 http://opendata.technolution.nl/opendata/parkingdata/v1/static/e4da517a-ef32-426d-821c-96e29ac5ac80
limitedAccess identifier name
1 FALSE 8d85bbdb-8bbd-4a24-b35f-85f21186ec04 P06 - Sluisstraat
2 FALSE 21b0388a-56f7-4cba-8fd3-4a1c914f5fe2 3 - Burcht
3 FALSE 45434989-3252-4c85-8731-c856b02c390c P01 - Stationsplein
4 FALSE 9064b206-7e62-402d-ae62-f25a0e47571b Jaarbeurs P3 - Jaarbeurs P3
5 FALSE 5829fb06-ee4a-4762-946c-ed6209edf7d5 P03 - Dek Stadspoort
6 FALSE e4da517a-ef32-426d-821c-96e29ac5ac80 PG-Pieter Vreedeplein
locationForDisplay
1 NA
2 WGS84, 52.4387428557465, 4.82805132865906
3 WGS84, 52.2573226613971, 6.16240739822388
4 WGS84, 52.0854991774024, 5.10619640350342
5 WGS84, 52.256324421386, 6.15569114685059
6 WGS84, 51.5582297848141, 5.08894979953766
>
I would expect this to be similar for the other ones.

How to convert a list with same type of field to a data.frame in R

I have a list and the field inside each list element is of same name(only values are different) and I need to convert that into a data.frame with column name is same as that of field name. Following is my list,
Data input (data input in json format.json)
library(rjson)
data <- fromJSON(file = "data input in json format.json")
head(data,3)
[[1]]
[[1]]$floors
[1] 5
[[1]]$elevation
[1] 15
[[1]]$bmi
[1] 23.7483
[[2]]
[[2]]$floors
[1] 4
[[2]]$elevation
[1] 12
[[2]]$bmi
[1] 23.764
[[3]]
[[3]]$floors
[1] 3
[[3]]$elevation
[1] 9
[[3]]$bmi
[1] 23.7797
And my expected data.frame is,
floors elevation bmi
5 15 23.7483
4 12 23.7640
3 9 23.7797
Can you help me to figure out this ?.
Thanks in adavance.
You can use jsonlite.
library(jsonlite)
Then use fromJSON() and specify the path to your file (or alternatively a URL or the raw text) in the argument txt:
fromJSON(txt = 'path/to/json/file.json')
The result is:
floors elevation bmi
1 5 15 23.7483
2 4 12 23.7640
3 3 9 23.7797
If you prefer rjson, you could first read it as previously:
data <- rjson::fromJSON(file = 'path/to/json/file.json')
Then use do.call() and rbind.data.frame() to convert the list to a dataframe:
do.call("rbind.data.frame", data)
Alternatively to do.call(): use data.tables rbindlist() which is faster:
data.table::rbindlist(data)

subset function returns all rows

I recently reverted to R version 3.1.3 for compatibility reasons and am now encountering an unexplained error with the subset function.
I want to extract all rows for the gene "Migut.A00003" from the data frame transcr_effects using the gene name as listed in the data frame expr_mim_genes. (this will later become a loop). This action always returns all rows instead of specific rows I am looking for, no matter the formatting of the subset lookup:
> class(expr_mim_genes)
[1] "data.frame"
> sapply(expr_mim_genes, class)
gene longest.tr pair.length
"character" "logical" "numeric"
> head(expr_mim_genes)
gene longest.tr pair.length
1 Migut.A00003 NA 0
2 Migut.A00006 NA 0
3 Migut.A00007 NA 0
4 Migut.A00012 NA 0
5 Migut.A00014 NA 0
6 Migut.A00015 NA 0
> class(transcr_effects)
[1] "data.frame"
> sapply(transcr_effects, class)
pair gene
"character" "character"
> head(transcr_effects)
pair gene
1 pair1 Migut.N01020
2 pair10 Migut.A00351
3 pair1000 Migut.F00857
4 pair10007 Migut.D01637
5 pair10008 Migut.A00401
6 pair10009 Migut.G00442
. . .
7168 pair3430 Migut.A00003
. . .
The gene I am interested in:
> expr_mim_genes[1,"gene"]
[1] "Migut.A00003"
R sees these two terms as equivalent:
> expr_mim_genes[1,"gene"] == "Migut.A00003"
[1] TRUE
If I type in the name of the gene manually, the correct number of rows are returned:
> nrow(subset(transcr_effects, transcr_effects$gene=="Migut.A00003"))
[1] 1
> subset(transcr_effects, transcr_effects$gene=="Migut.A00003")
pair gene
7168 pair3430 Migut.A00003
However, this should return one row from the data.frame but it returns all rows:
> nrow(subset(transcr_effects, transcr_effects$gene == (expr_mim_genes[1,"gene"]))
[1] 10122
I have a feeling this has something to do with text formatting, but I've tried everything and haven't been able to figure it out. I've seen this issue with quoted v.s. unquoted entries, but it does not appear to be the issue here (see equality above).
I didn't have this problem before switching to R v.3.1.3, so maybe it is a version convention I am unaware of?
EDIT:
This is driving me crazy, but at least I think I have found a patch. There was quite a bit of data and file processing to get to this point in the code, involving loading at least 4 files. I've tried taking snippets of each file to post a reproducible example here, but sometimes when I analyze the snippets the error recurs, sometimes it does not (!!). After going through the process though, I discover that:
i = 1
gene = expr_mim_genes[i,"gene"]
> nrow(subset(transcr_effects, gene == gene))
[1] 10122
> nrow(subset(transcr_effects, gene == (expr_mim_genes[i,"gene"])))
[1] 1
I still can't explain this behavior of the code, but at least I know how to work around it.
Thanks all.

read.table() changes column names [duplicate]

Whenever I read in a file using read.csv() with option header=T, the headers change in weird (but predictable) ways. A header name which ought to read "P(A<B)" becomes "P.A.B.", for instance:
> # when header=F:
> myfile1 <- read.csv(fullpath,sep="\t",header=F,nrow=3)
> myfile1
V1 V2 V3
1 ID Name P(A>B)
2 AB001 Alice 0.997
3 AB002 Bob 0.497
>
> # When header=T:
> myfile2 <- read.csv(fullpath,sep="\t",header=T,nrow=3)
> myfile2
ID Name P.A.B.
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
I tried to fix it like this, but it didn't work:
> names(myfile2) <- myfile1[1,]
> myfile2
3 3 3
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
So then I tried to use sub() to write a function that would take any vector "arbitrary.lengths.here." and return a vector "arbitrary(lengths>here)", but I didn't really get anywhere, and I started to suspect that I was making this problem more complicated than it had to be.
How would you deal with this problem of headers? Was I on the right track with sub()?
Set check.names=FALSE in read.csv()
read.csv(fullpath,sep="\t", header=FALSE, nrow=3, check.names=FALSE)
From the help for ?read.csv:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.
Not really intended as an answer, but intended to be helpful to Rnewbs: Those headers were read in as factors (and caused the third column to also be a factor. The screwy names() assignments probably used their integer storage mode. #Andrie has already given you the preferred solution, but if you wanted to just reassign the names (which would not undo the damage to the thrid column) you could use:
names(myfile1) <- scan(file=fullpath, what="character" nmax=1 , sep="\t")
myfile1 <- myfile[-1, ] # gets rid of unneeded line

Use the string of characters from a cell in a dataframe to create a vector

>titletool<-read.csv("TotalCSVData.csv",header=FALSE,sep=",")
> class(titletool)
[1] "data.frame"
>titletool[1,1]
[1] Experiment name : CONTROL DB AD_1
>t<-titletool[1,1]
>t
[1] Experiment name : CONTROL DB AD_1
>class(t)
[1] "character"
now i want to create an object (vector) with the name "Experiment name : CONTROL DB AD_1" , or even better if possible CONTROL DB AD_1
Thank you
Use assign:
varname <- "Experiment name : CONTROL DB AD_1"
assign(varname, 3.14158)
get("Experiment name : CONTROL DB AD_1")
[1] 3.14158
And you can use a regular expression and sub or gsub to remove some text from a string:
cleanVarname <- sub("Experiment name : ", "", varname)
assign(cleanVarname, 42)
get("CONTROL DB AD_1")
[1] 42
But let me warn you this is an unusual thing to do.
Here be dragons.
If I understand correctly, you have a bunch of CSV files, each with multiple experiments in them, named in the pattern "Experiment ...". You now want to read each of these "experiments" into R in an efficient way.
Here's a not-so-pretty (but not-so-ugly either) function that might get you started in the right direction.
What the function basically does is read in the CSV, identify the line numbers where each new experiment starts, grabs the names of the experiments, then does a loop to fill in a list with the separate data frames. It doesn't really bother making "R-friendly" names though, and I've decided to leave the output in a list, because as Andrie pointed out, "R has great tools for working with lists."
read.funkyfile = function(funkyfile, expression, ...) {
temp = readLines(funkyfile)
temp.loc = grep(expression, temp)
temp.loc = c(temp.loc, length(temp)+1)
temp.nam = gsub("[[:punct:]]", "",
grep(expression, temp, value=TRUE))
temp.out = vector("list")
for (i in 1:length(temp.nam)) {
temp.out[[i]] = read.csv(textConnection(
temp[seq(from = temp.loc[i]+1,
to = temp.loc[i+1]-1)]),
...)
names(temp.out)[i] = temp.nam[i]
}
temp.out
}
Here is an example CSV file. Copy and paste it into a text editor and save it as "funkyfile1.csv" in the current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv)
"Experiment Name: Here Be",,
1,2,3
4,5,6
7,8,9
"Experiment Name: The Dragons",,
10,11,12
13,14,15
16,17,18
Here is a second CSV. Again, copy-paste and save it as "funkyfile2.csv" in your current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv)
"Promises: I vow to",,
"H1","H2","H3"
19,20,21
22,23,24
25,26,27
"Promises: Slay the dragon",,
"H1","H2","H3"
28,29,30
31,32,33
34,35,36
Notice that funkyfile1 has no column names, while funkyfile2 does. That's what the ... argument in the function is for: to specify header=TRUE or header=FALSE. Also the "expression" identifying each new set of data is "Promises" in funkyfile2.
Now, use the function:
read.funkyfile("funkyfile1.csv", "Experiment", header=FALSE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
# "Experiment", header=FALSE) # Uncomment to load remotely
# $`Experiment Name Here Be`
# V1 V2 V3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
#
# $`Experiment Name The Dragons`
# V1 V2 V3
# 1 10 11 12
# 2 13 14 15
# 3 16 17 18
read.funkyfile("funkyfile2.csv", "Promises", header=TRUE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv",
# "Experiment", header=TRUE) # Uncomment to load remotely
# $`Promises I vow to`
# H1 H2 H3
# 1 19 20 21
# 2 22 23 24
# 3 25 26 27
#
# $`Promises Slay the dragon`
# H1 H2 H3
# 1 28 29 30
# 2 31 32 33
# 3 34 35 36
Go get those dragons.
Update
If your data are all in the same format, you can use the lapply solution mentioned by Andrie along with this function. Just make a list of the CSVs that you want to load, as below. Note that the files all need to use the same "expression" and other arguments the way the function is currently written....
temp = list("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
"http://dl.dropbox.com/u/2556524/testing/funkyfile3.csv")
lapply(temp, read.funkyfile, "Experiment", header=FALSE)

Resources