Change a column value based on nchar - r

I have the following issue: Consider a dataframe of equal length with the columns title and subtitle. title is rather clean data and subtitle is rather messy (wrong values, NAs, ...) However, when subtitle is filled in correctly, it contains a lot more information than my title variable.
I would like to replace the values in my title column where the nchar of a certain subtitle observation exceeds the nchar of the title observation.
at the moment my failed code looks like this:
#filter from real table
baseTable_sentiment <- filter(baseTable, theme_ecoFin == 1)
#with this code I try to do what I explained, while coping with the NA's in subtitle
baseTable_sentiment$title <- baseTable_sentiment$subtitle[nchar(baseTable_sentiment$subtitle , allowNA = TRUE , keepNA = TRUE) > nchar(baseTable_sentiment$title) , ]
an alternative approach to cope with the NAs
#filter from real table
baseTable_sentiment <- filter(baseTable, theme_ecoFin == 1)
#change NA to text value "na"
baseTable_sentiment$subtitle <- replace(baseTable_sentiment$subtitle,which(is.na(baseTable_sentiment$subtitle)),"na")
#same code as before
baseTable_sentiment$title <- baseTable_sentiment$subtitle[nchar(baseTable_sentiment$subtitle ) > nchar(baseTable_sentiment$title) , ]
Now when I run one of the two examples: I get the following error:
Error in
baseTable_sentiment$subtitle[(nchar(baseTable_sentiment$subtitle, :
incorrect number of dimensions
However: when i check all used dimensions
> > > length(baseTable_sentiment$subtitle) [1] 170206
> > > length(baseTable_sentiment$title) [1] 170206
> > > length(nchar(baseTable_sentiment$subtitle , allowNA = TRUE ) > nchar(baseTable_sentiment$title)) [1] 170206
How can I fix this, or do you guys have an alternative way to do this operation?
The following link contains a data example
Thank you in advance
Olivier

I found the error: the dimension of my title does not correspond anymore with the new dimension created by the condition after the assignment.
baseTable_sentiment$title[nchar(baseTable_sentiment$subtitle , allowNA = TRUE , keepNA = FALSE) > nchar(baseTable_sentiment$title) ]<- baseTable_sentiment$subtitle[nchar(baseTable_sentiment$subtitle , allowNA = TRUE , keepNA = FALSE) > nchar(baseTable_sentiment$title) ]
The dimensions are now matched and the code runs perfectly

Related

Problem with for loop when downloading species occurrence data

I want to download the occurrence data from gbif website and I use the following R script. When I run the script, I got an error with the following message "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0)". It would be highly appreciated if anyone could help me with this.
My data: data
My R script:
flist<-read_excel("Mekong fish.xlsx",sheet="Sheet1")
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list)) {
# create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
datfish[[i]]<-as.data.frame(fname[[i]]$data)
}
I got a different error:
Expecting logical in D1424 / R1424C4: got 'in Lao'Expecting logical in D1426 / R1426C4: got 'in China'Expecting logical in D1467 / R1467C4: got 'only Cambodia'Expecting logical in D1469 / R1469C4: got 'only in VN'Expecting logical in D1473 / R1473C4: got 'only in China'Expecting logical in D1486 / R1486C4: got 'only in Malaysia'Expecting logical in D1488 / R1488C4: got 'only 1 point in VN'
I think the problem is caused in some fields in the 4th column. I don't have the right packages installed to run your code. But I got a different error (package missing) once i dropped the fourth column.
flist<-read_excel("~/Downloads/Mekong fish.xlsx",sheet="Sheet1")
flist <=subset(flist, select = -4)
...
EDIT:
This worked for me. read_excel assigned column 4 the type boolean. When I explicitly set it to text it worked.
library(readxl)
library(rgbif)
library(raster)
flist<-read_excel("~/Downloads/Mekong fish.xlsx",
sheet="Sheet1",
col_types = c("numeric", "text", "numeric", "text"))
flist
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list[1:2])) {
message(i)
# # create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
message(Occ[[i]])
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
message(fname[[i]])
datfish[[i]]<-as.data.frame(fname[[i]]$data)
message(datfish[[i]])
}
> 1
> list(offset = 0, limit = 2, endOfRecords = FALSE, count = >15)list(list(name = c("Animalia", "Chordata", "Actinopterygii",
> "Cypriniformes", "Cyprinidae", "Aaptosyax", "Aaptosyax grypus"), key = > > c("1", "44", "204", "1153", "7336", "2363805", "2363806"),
> etc...

how to interpolate data within groups in R using seqtime?

I am trying to use seqtime (https://github.com/hallucigenia-sparsa/seqtime) to analyze time-serie microbiome data, as follow:
meta = data.table::data.table(day=rep(c(15:27),each=3), condition =c("a","b","c"))
meta<- meta[order(meta$day, meta$condition),]
meta.ts<-as.data.frame(t(meta))
otu=matrix(1:390, ncol = 39)
oturar<-rarefyFilter(otu, min=0)
rarotu<-oturar$rar
time<-meta.ts[1,]
interp.otu<-interpolate(rarotu, time.vector = time,
method = "stineman", groups = meta$condition)
the interpolation returns the following error:
[1] "Processing group a"
[1] "Number of members 13"
intervals
0
12
[1] "Selected interval: 1"
[1] "Length of time series: 13"
[1] "Length of time series after interpolation: 1"
Error in stinepack::stinterp(time.vector, as.numeric(x[i, ]), xout = xout, :
The values of x must strictly increasing
I tried to change method to "hyman", but it returns the error below:
Error in interpolateSub(x = x, time.vector = time.vector, method = method) :
Time points must be provided in chronological order.
I am using R version 3.6.1 and I am a bit new to R.
Please can anyone tell me what I am doing wrong/ how to go around these errors?
Many thanks!
I used quite some time stumbling around trying to figure this out. It all comes down to the data structure of meta and the resulting time variable used as input for the time.vector parameter.
When meta.ts is being converted to a data frame, all strings are automatically converted to factors - this includes day.
To adjust, you can edit your code to the following:
library(seqtime)
meta <- data.table::data.table(day=rep(c(15:27),each=3), condition =c("a","b","c"))
meta <- meta[order(meta$day, meta$condition),]
meta.ts <- as.data.frame(t(meta), stringsAsFactors = FALSE) # Set stringsAsFactors = FALSE
otu <- matrix(1:390, ncol = 39)
oturar <- rarefyFilter(otu, min=0)
rarotu <- oturar$rar
time <- as.integer(meta.ts[1,]) # Now 'day' is character, so convert to integer
interp.otu <- interpolate(rarotu, time.vector = time,
method = "stineman", groups = meta$condition)
As a bonus, read this blogpost for information on the stringsAsFactors parameter. Strings automatically being converted to Factors is a common bewilderment.

Pass arguments to the R Phyloseq subset_taxa wrapper

I'll explain the end goal, and what I'm trying as a test first. (Because I'm likely going about it the wrong way.)
I am using the phyloseq package to visualize microbiome data. I want to "automate" it to an extent by having users choose levels of analysis and have my script generate the visualizations without someone hand typing in each combination.
The issue is passing variables into the subset function. I get these errors primarily (depending on what combinations of paste0, eval, parse, as.logical, expression, noquote....etc that i've tried):
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
A user would set the levels of analysis. So lets say for now there are two levels, and selecting the second level automatically means you want the first level as well. (I haven't worked on that part yet, but I wanted to explain it upfront.
#Set lineage level
lin_level <- 1
lin_list <- c("k__Kingdom", "p__Phylum","c__Class", "o__Order","f__Family")
lin_select <- lin_list[lin_level]
sub_lin <- lin_list[(lin_level +1)]
#Kingdom
king_list <- "k__Bacteria"
#set Phylum list
if (lin_select == "p__Phylum"){
phylum_list <- c("p__Firmicutes","p__Proteobacteria","p__Bacteroidetes","p__Actinobacteria","p__Tenericutes")
}
subgroup <- "All"
From here, the script would ultimately get to the graphing section. If lin_level is set to 1, it would look like this:
FIXED
gphic = subset_taxa(physeq1, Kingdom=="k__Bacteria")
title = paste0(subgroup," ", "Bacteria-only")
plot_bar(gpsfb, "Phylum", "Abundance", "Phylum",
title=title, facet_grid="Type~.")
AUTOMATED
gphic = subset_taxa(physeq1, (substring(lin_select,4)) == king_list)
title = paste0(subgroup," ", (substring(king_list,4)),"-only")
plot_bar(gpsfb, (substring(sub_lin,4)), "Abundance", (substring(sub_lin,4)),
title=title, facet_grid="Type~.")
But, trying to pass (substring(lin_select,4)) == king_list as an argument results in errors.
I've searched through the various threads on this issue, but haven't been able to get the different answers to work. Ultimately I need to run the graphing section once for Kingdom, and then again each time for each item in the Phylum list. But before i can get there, I need to be able to pass the arguments into the subset function.
Things I've tried:
test <- paste0(substring(lin_select,4),"==","\"","p__Bacteroidetes","\"")
noquote(test)
[1] Phylum=="p__Bacteroidetes"
gphic = subset_taxa(physeq1, noquote(test))
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
gphic = subset_taxa(physeq1, paste0(substring(lin_select,4),"==","\"","p__Bacteroidetes","\""))
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
gphic = subset_taxa(physeq1, as.logical(test))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
as.logical(noquote(test))
[1] NA
gphic = subset_taxa(physeq1, as.logical(noquote(test)))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
noquote(test)
[1] Phylum=="p__Bacteroidetes"
as.logical(noquote(test))
[1] NA
as.logical(as.character(noquote(test)))
[1] NA
test2 <- eval(parse(text= test))
Error in eval(parse(text = test)) : object 'Phylum' not found
test2 <- eval(test)
gphic = subset_taxa(physeq1, as.logical(test2))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
as.logical(test2)
[1] NA
And a lot of other permutations trying to sub in different things, but you get the idea.
gphic = subset_taxa(physeq1, eval(as.name(level_tax)) == king_list)
Here , level_tax is the variable in a loop. Say level_tax = "Order", then we convert the string "Order" into variable name by as.name(level_tax) or as.symbol(level_tax). Then we use eval(), which takes an expression and evaluates in the specified environment

replacing a value in column X based on columns Y with R

i've gone through several answers and tried the following but each either yields an error or an un-wanted result:
here's the data:
Network Campaign
Moburst_Chartboost Test Campaign
Moburst_Chartboost Test Campaign
Moburst_Appnext unknown
Moburst_Appnext 1065
i'd like to replace "Test Campaign" with "1055" whenever "Network" == "Moburst_Chartboost". i realize this should be very simple but trying out these:
dataset = read.csv('C:/Users/User/Downloads/example.csv')
for( i in 1:nrow(dataset)){
if(dataset$Network == 'Moburst_Chartboost') dataset$Campaign <- '1055'
}
this yields an error: Warning messages:
1: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
2: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
etc.
then i tried:
within(dataset, {
dataset$Campaign <- ifelse(dataset$Network == 'Moburst_Chartboost', '1055', dataset$Campaign)
})
this turned ALL 4 values in row "Campaign" into "1055" over running what was there even when condition isn't met
also this:
dataset$Campaign[which(dataset$Network == 'Moburst_Chartboost')] <- 1055
yields this error, and replaced the values in the two first rows of "Campaign" with NA:
Warning message:
In `[<-.factor`(`*tmp*`, which(dataset$Network == "Moburst_Chartboost"), :
invalid factor level, NA generated
scratching my head here. new to R but this shouldn't be so hard :(
In your first attempt, you're trying to iterate over all the columns, when you only want to change the 2nd column.
In your second, you're trying to assign the value "1055" to all of the 2nd column.
The way to think about it is as an if else, where if the condition in col 1 is met, col 2 is changed, otherwise it remains the same.
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", "1065"))
dataset$Campaign <- ifelse(dataset$Network == "Moburst_Chartboost",
"1055",
dataset$Campaign)
head(dataset)
Network Campaign
1 Moburst_Chartboost 1055
2 Moburst_Chartboost 1055
3 Moburst_Appnext unknown
4 Moburst_Appnext 1065
You may also try dataset$Campaign[dataset$Campaign=="Test Campaign"]<-1055 to avoid the use of loops and ifelse statements.
Where dataset
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", 1065))
Try the following
dataset = read.csv('C:/Users/User/Downloads/example.csv', stringsAsFactors = F)
for( i in 1:nrow(dataset)){
if(dataset$Network[i] == 'Moburst_Chartboost') dataset$Campaign[i] <- '1055'
}
It seems your forgot the index variable. Without [i] you work on the whole vector of the data frame, resulting in the error/warning you mentioned.
Note that I added stringsAsFactors = F to the read.csv() function to make sure the strings are indeed interpreted as strings and not factors. Using factors this would result in an error like this
In `[<-.factor`(`*tmp*`, i, value = c(NA, 2L, 3L, 1L)) :
invalid factor level, NA generated
Alternatively you can do the following without using a for loop:
idx <- which(dataset$Network == 'Moburst_Chartboost')
dataset$Campaign[idx] <- '1055'
Here, idx is a vector containing the positions where Network has the value 'Moburst_Chartboost'
thank you for the help! not elegant, but since this lingered with me when going to sleep last night i decided to try to bludgeon this with some ugly code but it worked too - just as a workaround...separated to two data frames, replaced all values and then binded back...
# subsetting only chartboost
chartboost <- subset(dataset, dataset$Network=='Moburst_Chartboost')
# replace all values in Campaign
chartboost$Campaign <-sub("^.*", "1055",chartboost$Campaign)
#subsetting only "not chartboost"
notChartboost <-subset(dataset, dataset$Network!='Moburst_Chartboost')
# binding back to single dataframe
newSet <- rbind(chartboost, notChartboost)
Ugly as a duckling but worked :)

Error in as(x, class(k)) : no method or default for coercing “NULL” to “data.frame”

I am currently facing an error mentioned below which is related to NULL values being coerced to a data frame. The data set does contain nulls, however I have tried both is.na() and is.null() functions to replace the null values with something else. The data is stored on hdfs and is stored in a pig.hive format. I have also attached the code below. The code works fine if I remove v[,25] from the key.
Code:
AM = c("AN");
UK = c("PP");
sample.map <- function(k,v){
key <- data.frame(acc = v[!which(is.na(v[,1],1],
year = substr(v[!which(is.na(v[,1]),2],1,4),
month = substr(v[!which(is.na(v[,1]),2],5,6))
value <- data.frame(v[,3],count=1)
keyval(key,value)
}
sample.reduce <- function(key,v){
AT <- sum(v[which(v[,1] %in% AM=="TRUE"),2])
UnknownT <- sum(v[which(v[,1] %in% UK=="TRUE"),2])
Total <- AT + UnknownT
d <- data.frame(AT,UnknownT,Total)
keyval(key,d)
}
out <- mapreduce(input ="/user/hduser/input",
output = "/user/hduser/output",
input.format = make.input.format("pig.hive", sep = "\u0001")
output.format = make.output.format("csv", sep = ","),
map= sample.map)
reduce = sample.reduce)
Error:
Warning in asMethod(object) : NAs introduced by coercion
Warning in split.default(1:rmr.length(y), unique(ind), drop = TRUE) : data length is not a multiple of split variable
Warning in rmr.split(x, x, FALSE, keep.rownames = FALSE) : number of items to replace is not a multiple of replacement length Warning in split.default(1:rmr.length(y), unique(ind), drop = TRUE) :
data length is not a multiple of split variable
Warning in rmr.split(v, ind, lossy = lossy, keep.rownames = TRUE) : number of items to replace is not a multiple of replacement length
Error in as(x, class(k)) :
no method or default for coercing “NULL” to “data.frame”
Calls: <Anonymous> ... apply.reduce -> c.keyval -> reduce.keyval -> lapply -> FUN -> as No traceback available
UPDATE
I have added the sample data and edited the code above. Hope this helps!
Sample Data:
NULL,"2014-03-14","PP"
345689202,"2014-03-14","AN"
234539390,"2014-03-14","PP"
123125444,"2014-03-14","AN"
NULL,"2014-03-14","AN"
901828393,"2014-03-14","AN"
There are some issues with as which have been identified recently. I don't see why as can't handle this by default, but you can modify coerce which handles the conversion with an S4 method to call as.data.frame.
setMethod("coerce",c("NULL","data.frame"), function(from, to, strict=TRUE) as.data.frame(from))
[1] "coerce"
as(NULL,"data.frame")
data frame with 0 columns and 0 rows

Resources