I have been facing an error while reading a csv file. first few lines of the line is as given below:
"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA
reading the csv.file
test <- read.csv(file='/home/userxyz/test.csv')
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
Explicitly specifying the presence of the header.
test <- read.csv(file='/home/userxyz/test.file', header=T)
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
While explicitly specifying the row.names, it didn't work.
test <- read.csv(file='/home/userxyz/test.file', row.names=T)
#Error in read.table(file = file, header = header, sep = sep, quote = quote, :
# invalid 'row.names' specification
read.table, read.delim functions have also been looked at.
Is the error because of special characters in the row.names?
I think you are trying to read in the first column as row name. Try:
x <- '"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA'
read.csv(text = x, row.names = 1L)
# X1.CEL X2.CEL X3.CEL X4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
If you want to preserve exactly the header, do
read.csv(text = x, row.names = 1L, check.names = FALSE)
# 1.CEL 2.CEL 3.CEL 4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
Regarding row.name, read ?read.csv:
row.names: a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the
table which contains the row names, or character string
giving the name of the table column containing the row names.
Related
I need to convert persian date to gergorian using ConvCalendar library. the character vector is as follows:
str(df$death_date)
chr [1:286] NA NA NA NA "1399/03/12" NA NA NA NA NA NA NA NA NA "1399/03/25" NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
death <- OtherDate(day=substr(df$death_date,9,10),
month=substr(df$death_date,6,7),
year=substr(df$death_date,1,4),
calendar="persian")
once I go to convert the death into gregorian using as.Date(death) the following error comes out:
Error in as.Date.OtherDate(death) : NAs in foreign function call (arg 4)
could anyone please tell me what the wrong is?
Package ConvCalendar was archived a long time ago, see CRAN:
Archived on 2018-05-24 as check problems were not corrected despite reminders.
If it is installed from source, the following will work.
library(ConvCalendar)
y <- x[!is.na(x)]
y <- as.POSIXlt(y, format = '%Y/%m/%d', origin = '1970-01-01')
pers <- OtherDate(day=y$mday, month=y$mon+1, year=y$year, calendar="persian")
as.Date(pers)
#[1] "0120-06-02" "0120-06-15"
Data
x <- scan(what = character(), text = '
NA NA NA NA "1399/03/12" NA NA NA NA NA NA NA NA NA "1399/03/25" NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
')
Assume a data.frame:
df <- data.frame(name = c("a","b","c","d","e"),rank = c(1,1,4,3,2))
name rank
a 1
b 1
c 4
d 3
e 2
Based on the above data.frame, I want to create a new one that holds the count of transitions from one rank to another. So the output would be something like this:
name 1to1 1to2 1to3 1to4 2to1 2to2 2to3 2to4 3to1 3to2 3to3 3to4 4to1 4to2 4to3 4to4
1 b 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 c NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA
3 d NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
4 e NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
One way to do this would be to run a for loop and then using ifs but I am pretty sure there should be a more efficient way of doing this.
For example, if item d has a rank of 3 and item c is ranked as 4 then the code should increase the count of the 4to3 column under d's row (as per example above). Please let me know if this is unclear and I appreciate all the help.
P.S. colnames are not that important.
You could use Map to create sequences for extracting the transitions and collapse them into the desired form using paste.
tmp <- sapply(Map(seq, 1:(nrow(df1)-1), 2:nrow(df1)), function(i) df1$rank[i])
v <- apply(tmp, 2, function(x) paste(x, collapse="to"))
Then create a grid with all permutations
to <- apply(expand.grid(1:4, 1:4), 1, function(x) paste(x, collapse="to"))
and compare them with the actual transitions to get the resulting binary structure; create a data frame out of it.
res <- data.frame(name=df1$name[-1], t(sapply(v, function(i) setNames(+(i == to), to))))
Afterwards, you may convert the zeroes to NA using
res[res == 0] <- NA
Result
res
# name X1to1 X2to1 X3to1 X4to1 X1to2 X2to2 X3to2 X4to2 X1to3 X2to3 X3to3 X4to3 X1to4 X2to4 X3to4 X4to4
# 1to1 b 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 1to4 c NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA
# 4to3 d NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 3to2 e NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA
Data
df1 <- structure(list(name = structure(1:5, .Label = c("a", "b", "c",
"d", "e"), class = "factor"), rank = c(1, 1, 4, 3, 2)), class = "data.frame", row.names = c(NA,
-5L))
I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize.
where the data frame is:
ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA
library(stringr)
df[str_detect(names(df), "ABBA" )]
works, and returns:
ABBA01A ABBA01B ABBA02A ABBA02B
1908 NA NA NA NA
So, I would like to create a dataframe for each of my species:
Speciesnames=unique ( substring (names(df),0, 4))
Speciesnames
[1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL"
I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it.
and I would like to add additional calculations in the loop
for ( i in seq_along(Speciesnames)){
df=df[str_detect(names(df), pattern =[i])]
print(df)
#my function for the subsetted dataframe
}
thank you for your help!
Using your data you could do the following:
create a list to hold the data.frames to be created.
filter the data.frames and store in the list
give each data.frame the name of of the specie
bring all the data.frames to the global environment out of the list
Speciesnames <- unique(substring(names(df),0, 4))
data <- vector("list", length(Speciesnames))
for(i in seq_along(Speciesnames)) {
data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
}
names(data) <- Speciesnames
list2env(data, envir = globalenv())
The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.
An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.
# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))
# Pass each species
listOfDFs <- mapply(function(x){
df[,startsWith(names(df),x)] # Return only columns starting with species
}, species, SIMPLIFY=FALSE)
listOfDFs
# $ABBA
# ABBA01A ABBA01B ABBA02A ABBA02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
#
# $ACRU
# ACRU01A ACRU01B ACRU02A ACRU02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
Data:
df <- read.table(text =
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA",
header = TRUE, stringsAsFactors = FALSE)
I think that you should select all matching columns first, and then subselect your data.frame.
patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]
res object is a list of matched columns for each pattern
Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.
If you are writing production code probably it is not the best answer.
I'm trying to subset a raster based on cell numbers. I want to provide a vector of cell numbers and return a raster with the original cell values for those cells referenced in the cell numbers vector. I tried the rasterFromCells() function but this seems to interpolate between cell numbers and doesn't return values, but rather cell numbers. I've tried:
#original raster loaded with 400 sample values ranging from 1:24
foo <- raster(ncol=20, nrow=20)
foo[] <- sample(seq(1,24),400,replace = TRUE)
#vector of desired cell numbers
my.pts <- c(2,20,200)
#rasterFromCells attempt
bar<-rasterFromCells(foo, my.pts, values=TRUE)
How can I return a raster layer with foo's values for cell numbers 2, 20 and 200 and all other cells asNA?
If you want to create a new raster with the values at only the cell locations in my.pts replaced by the values at those cell locations in foo and all other cell values set to NA, you just have to:
create a raster (i.e., bar) the same size as foo.
fill it with NAs
Use bar[my.pts] <- foo[my.pts]
For example:
library(raster)
set.seed(123) ## for reproducible results
foo <- raster(ncols=20, nrows=20)
foo[] <- sample(seq(1,24),400,replace = TRUE)
#vector of desired cell numbers
my.pts <- c(2,20,200)
## create raster the same size as foo filled with NAs
bar <- raster(ncols=ncol(foo), nrows=nrow(foo))
bar[] <- NA
## replace the values with those in foo
bar[my.pts] <- foo[my.pts]
foo[my.pts]
##[1] 19 23 14
bar[]
## [1] NA 19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 23 NA NA NA NA NA NA NA NA NA NA NA
## [32] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [63] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [94] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[125] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[156] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[187] NA NA NA NA NA NA NA NA NA NA NA NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[218] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[249] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[280] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[311] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[342] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[373] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Another approach to accomplish the same result is to copy foo to bar and then set all cells locations not in my.pts to NAs:
bar <- foo
bar[setdiff(1:ncell(foo),my.pts)] <- NA
The advantage of rasterFromCells is that it returns a smaller raster, as it contains only the cropped version of what you want.
So what you need to do is to feed again the value of your initial raster (r) in the new one (r2), which is eased by the fact that the new one (r2) returns the original cell numbers:
r <- raster(ncols=100, nrows=100)
r[] <- rnorm(ncell(r))
cells <- c(3:5, 210)
r2 <- rasterFromCells(r, cells, values=TRUE)
ini_cells <- getValues(r2)
Simply feed the values according to the index:
r2[] <- r[ini_cells]
This results in a raster of 24 cells instead of 10'000!
c(ncell(r), ncell(r2))
Let us compare the results:
data.frame(Orig=getValues(r)[cells], New=getValues(r2)[ini_cells %in% cells])
[,1] [,2]
[1,] -0.5081512 -0.5081512
[2,] -0.8799739 -0.8799739
[3,] 0.3722788 0.3722788
[4,] -0.7661364 -0.7661364
Note: you wanted to set all others to NA. You would do this with:
r2[!ini_cells %in% cells] <- NA
head(getValues(r2))
-0.5081512 -0.8799739 0.3722788 NA NA NA
I am trying to get my head around how to use data.tables. It is not going well.
I have a large data.table with a bunch of returns and AUM. I subsetted that data.table into two data.tables, one with returns, and one with AUM. I now want to subset the returns data.table, to get only the returns from funds with AUM less than the 50th percentile.
To give you an idea, this is my code:
fundDetails <- data.table(read.table("Fund_Deets.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
fundNAV <- data.table(read.table("NAV_AUM.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
allFundDetails <- fundDetails[Currency == 'USD']
allFundNAV <- fundNAV[Fund.ID %in% allFundDetails$Fund.ID]
allFundAUM <- allFundNAV[Type == 'AUM', -c(1,3), with = FALSE]
allFundAUM <- setnames(data.table(t(sapply(allFundAUM[,-1, with = FALSE],as.numeric))), as.character(allFundAUM$Fund.ID))
allFundReturns <- allFundNAV[Type == 'Return', -c(1,3), with = FALSE]
allFundReturns <- setnames(data.table(t(sapply(allFundReturns[,-1, with = FALSE],as.numeric)/100)), as.character(allFundReturns$Fund.ID))
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
This Produces the following three tables (smallFundReturns is obviously just NA's):
> allFundAUM[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 1 27
2: NA NA NA NA NA NA 117 NA 1 27
3: NA NA NA NA NA NA 120 NA 1 27
4: NA NA NA NA NA NA 133 NA 1 27
5: NA NA NA NA NA NA 146 NA 1 29
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> allFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 0.0188 -0.0116
2: NA NA NA NA NA NA -0.0315 NA -0.0120 0.0134
3: NA NA NA NA NA NA -0.0978 NA -0.0908 -0.0206
4: NA NA NA NA NA NA -0.0445 NA -0.0269 -0.0287
5: NA NA NA NA NA NA 0.0139 NA 0.0298 -0.0141
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
I am trying to subset using this for loop (using a for loop in an attempt to debug):
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
This produces an error:
Error in `[<-.data.table`(`*tmp*`, i, theSubset, with = FALSE, value = list( :
unused argument (with = FALSE)
I tried removing the 'with' part, but this spits out a bunch of warnings:
> warnings()
Warning messages:
1: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '41526' (3019 unused)
2: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45993' (3019 unused)
3: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45994' (3019 unused)
4: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
I am confused on how to do this. Any ideas on how I can subset the second data.table by the subset on the first?
EDIT:
I tried the suggestion below:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
And I got these warnings():
> warnings()
Warning messages:
1: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
2: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
3: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
And the code produced this, with 'TRUE' everywhere I would expect a number:
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA TRUE TRUE
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
EDIT 2:
I figured out the issue. Apparently, this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
created the table as being logical. I changed it to this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) as.numeric(rep(NA, length(x)))))
And everything worked after #HubertL fix. Thanks!!
You have to write it like that:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
Suggestions for improvement:
Try reading data with fread instead of read.table if possible. It's way faster and the result is data.table not data.frame.
When doing "data.table operations" with the statement ", with=FALSE" you actually force R to use the much slower data.frame operations instead of using the blazingly fast data.table methods.
Have fun