I'm trying to subset a dataframe with character conditions from a vector!
This works:
temp <- USA[USA$RegionName == "Virginia",]
Now for a loop I created a column-vector containing all States by name, so I could filter through them:
> states
[1] "virginia" "Alaska" "Alabama" (...)
But if I know try to consign the "RegionName" condition via the column-vector it does not work anymore:
temp <- USA[USA$RegionName == states[1],]
What I tried so far:
paste(states[1])
as.factor(states[1])
as.character(states[1])
For recreation of the relevant dataframe:
string <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv")
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
(In my vector Virginia is just on top for convenience!)
Based on the reproducible example, the first element of 'states' is empty
states[1]
[1] ""
We need to remove the blank elements
states <- states[nzchar(states)]
and then execute the code
dim(USA[USA$RegionName == states[1],])
[1] 569 51
I have not tried akruns option as the classic import function is very very very slow. When using a more modern approach I noticed, that the autoimport makes RegionName a logical and thus every values gets converted to NA. Therefore here is my approach to your problem:
# way faster to read in csv data and you need to set all columns to character as autoimport makes RegionName a logical returning all NA
string <- readr::read_csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", col_types = cols(.default = "c"))
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
temp <- USA[USA$RegionName == states[1],]
You will have to convert the columns acording to your need or specify exactly when importing which column should be of which data type
I must admit that this is a poorly asked question. Actually I wanted to subset the dataframe based on two conditions, a date condition and a state condition.
The date condition worked fine. The state condition did not work in combination or alone, so I assumed that the error was here.
In fact, I did a very bizarre transformation of the date from the original source. After I implemented another, much more reliable transformation, the state condition also worked fine, with the code in the question asked.
The error lay as apparently with a badly implemented date transformation! Sorry for the rookie mistake; my next question will be more sophisticated
Related
I work with the package flowCore found in Bioconductor, which reads my data files in an S4 class format. You can type library(flowCore) and then data(GvHD) which loads an example dataset. When you type GvHD you can see that this dataset is made out of 35 experiments, which can be accessed individually by typing for example GvHD[[1]].
Now I am trying to delete two columns FSC-H and SSC-H from all the experiments, but I have been unsuccessful.
I have tried myDataSet<- within(GvHD, rm("FSC-H","SSC-H")) but it doesn't work. I would greatly appreciate any help.
rm isn't meant for removing columns. The normal procedure is to assign NULL to that column:
for (i in 1:35){
GvHD[[i]][,c("FSC-H","SSC-H")] <- NULL
}
This is the same as you would do for a data frame.
I posted my question on the relevant GitHub page for flowCore and the answer was provided by Jacob Wagner.
GvHD[[1]] is a flowFrame, not a simple data frame, which is why the NULL assignment doesn't work. The underlying representation is also a matrix, which also doesn't support dropping a column by assigning it NULL.
If you want to drop columns, here are some ways you could do that. Note for all of these I'm subsetting columns for the whole flowSet rather than looping through each flowFrame. But you could perform these operations on each flowFrame as well.
As Greg mentioned, choose the columns you want to keep:
data(GvHD)
all_cols <- colnames(GvHD)
keep_cols <- all_cols[!(all_cols %in% c("FSC-H", "SSC-H"))]
GvHD[,keep_cols]
Or you could just filter in the subset:
GvHD[,!colnames(GvHD) %in% c("FSC-H", "SSC-H")]
You could also grab the numerical indices you want to drop and then use negative subsetting.
# drop_idx <- c(1,2)
drop_idx <- which(colnames(GvHD) %in% c("FSC-H", "SSC-H"))
GvHD[,-drop_idx]
I have a lot of data to work on, and to make things more efficient I would like to come up with a code that will allow me to assign a regional code to an article per the country of origin of its author.
In other words, I have the following:
country$author_country
MEX
COL
TUN
GBR
USA
BRA
etc.
I have created a column 'author_region' filled with NAs. I want to assign a region code to everyone of the author_country values.
Instead of doing it by hand, for instance something like if(country$author_country == MEX){country$author_region == 1},
I was hoping there is a way to create an object that would allow me to list all the countries from a region, and then assign a value to my author_region column based on whether or not author_country matches the content of this object.
I thought about doing it like this:
LatAm <- list('COL', 'MEX', 'BRA')
for (i in country$author_country) if (country$author_country == LatAm)
{country$author_region[i] <- 1}
I know this looks wrong and it obviously does not work, but I couldn't find a solution to this issue.
Could you help me please?
Thank you!!
A WORKAROUND:
There is a workaround:
country$author_region = unclass(as.factor(country$author_country)) + 1
This solution assumes you want a one-line workaround and don't care which country gets what code number. Basically the operation above is doing:
Filling the author_region with exactly author_country.
Converting author_region into a factor.
Unclassing the factor. Unclassing changes a factor vector to an integer vector encoding each factor.
Adding 1 to the result, since unclass result starts from integer 0.
IF A DATAFRAME THAT TELLS US THE CODE OF EACH COUNTRY IS AVAILABLE:
Let's say you have a dataframe country_codes with columns author_country specifying the country and author_region specifying the code you intend to use, then you can use join:
library(tidyverse)
author %>%
left_join(country_codes)
This is the better solution since you can assign specific codes to specific country as you wish.
I would like to carry out a subsetting in my shapefile without specifying the name of the first column in the .dbf file.
To be more precise I would like to select all the rows with value 1 in the first column of the .dbf, but I don't want to specify the name of this column.
For example this script works because I specify the name of the column (as columnName)
library(rgdal) # readOGR
shapeIn <- readOGR(nomeFile)
shapeOut <- subset(shapeIn, columnName == 1)
instead it doesn't works
shapeOut <- (shapeIn[,1] == 1)
and I get an error message:
comparison (1) is possible only for atomic and list types shapeOut and shapeIn are ESRI vector files.
This is the header of my shapeIn
coordinates mask_1000_
1 (54000, 1218000) 0
2 (55000, 1218000) 0
3 (56000, 1218000) 0
Can you help me? Thank you
This
shapeOut <- (shapeIn[,1] == 1)
doesn't work beacuse SpatialPolygonsDataFrames contain other info other than the data. So "common" data.frame subsetting doesn't work in the same way. To have it work, you must make the "logical check" for subsetting on the #data slot: this should work (either using subset or "direct" indexing):
shapeOut <- subset(shapeIn, shapeIn#data[,1] == 1)
OR
shapeOut <- shapeIn[shapeIn#data[,1] == 1,]
(however, by recent experience, referencing to data by column number is seldom a good idea... ;-) )
ciao Giacomo !!!
I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.
(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]