Assign value in R based on text content of list - r

I have a lot of data to work on, and to make things more efficient I would like to come up with a code that will allow me to assign a regional code to an article per the country of origin of its author.
In other words, I have the following:
country$author_country
MEX
COL
TUN
GBR
USA
BRA
etc.
I have created a column 'author_region' filled with NAs. I want to assign a region code to everyone of the author_country values.
Instead of doing it by hand, for instance something like if(country$author_country == MEX){country$author_region == 1},
I was hoping there is a way to create an object that would allow me to list all the countries from a region, and then assign a value to my author_region column based on whether or not author_country matches the content of this object.
I thought about doing it like this:
LatAm <- list('COL', 'MEX', 'BRA')
for (i in country$author_country) if (country$author_country == LatAm)
{country$author_region[i] <- 1}
I know this looks wrong and it obviously does not work, but I couldn't find a solution to this issue.
Could you help me please?
Thank you!!

A WORKAROUND:
There is a workaround:
country$author_region = unclass(as.factor(country$author_country)) + 1
This solution assumes you want a one-line workaround and don't care which country gets what code number. Basically the operation above is doing:
Filling the author_region with exactly author_country.
Converting author_region into a factor.
Unclassing the factor. Unclassing changes a factor vector to an integer vector encoding each factor.
Adding 1 to the result, since unclass result starts from integer 0.
IF A DATAFRAME THAT TELLS US THE CODE OF EACH COUNTRY IS AVAILABLE:
Let's say you have a dataframe country_codes with columns author_country specifying the country and author_region specifying the code you intend to use, then you can use join:
library(tidyverse)
author %>%
left_join(country_codes)
This is the better solution since you can assign specific codes to specific country as you wish.

Related

Why after I use "subset", the filtered data is less than it should be?

I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.

Character vectors for subset conditions

I'm trying to subset a dataframe with character conditions from a vector!
This works:
temp <- USA[USA$RegionName == "Virginia",]
Now for a loop I created a column-vector containing all States by name, so I could filter through them:
> states
[1] "virginia" "Alaska" "Alabama" (...)
But if I know try to consign the "RegionName" condition via the column-vector it does not work anymore:
temp <- USA[USA$RegionName == states[1],]
What I tried so far:
paste(states[1])
as.factor(states[1])
as.character(states[1])
For recreation of the relevant dataframe:
string <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv")
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
(In my vector Virginia is just on top for convenience!)
Based on the reproducible example, the first element of 'states' is empty
states[1]
[1] ""
We need to remove the blank elements
states <- states[nzchar(states)]
and then execute the code
dim(USA[USA$RegionName == states[1],])
[1] 569 51
I have not tried akruns option as the classic import function is very very very slow. When using a more modern approach I noticed, that the autoimport makes RegionName a logical and thus every values gets converted to NA. Therefore here is my approach to your problem:
# way faster to read in csv data and you need to set all columns to character as autoimport makes RegionName a logical returning all NA
string <- readr::read_csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", col_types = cols(.default = "c"))
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
temp <- USA[USA$RegionName == states[1],]
You will have to convert the columns acording to your need or specify exactly when importing which column should be of which data type
I must admit that this is a poorly asked question. Actually I wanted to subset the dataframe based on two conditions, a date condition and a state condition.
The date condition worked fine. The state condition did not work in combination or alone, so I assumed that the error was here.
In fact, I did a very bizarre transformation of the date from the original source. After I implemented another, much more reliable transformation, the state condition also worked fine, with the code in the question asked.
The error lay as apparently with a badly implemented date transformation! Sorry for the rookie mistake; my next question will be more sophisticated

How do I run a for loop over all columns of a data frame and return the result as a separate data frame or matrix

I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))

R - Assign the mean of a column sub-sector to each row of that sub-sector

I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

Resources