How can I transform full state names to abbreviations? - r

I have a data frame A with a column called "states". The states are recorded by their full name, ex. "California". There are multiple rows for each state.
I have a data frame B, which has the number of gun deaths for each state. The states are recorded by abbreviations, ex. "CA"
What I would like is: I want each row in A to have the number of gun deaths for the corresponding state. I was planning to use dplyr::inner_join() for this.
But of course, the problem is that the state names are different in the different data frames.
What is the best way to make the names match?

If you have two vectors of the same length and want to construct a translation table just add the input vector of state names as the names-attribute of the output vector of state abbreviationa, and then pass in the names as inputs to the "["-function:
names(state.abb) <- state.name
# both are in-built values in the `state`-item of default `datasets` package
?state # also See: ?Constants and ?data
dfrm$abbrev <- state.abb[dfrm$states]

Related

How do you create a wrapper for filter in R that takes two arguments?

I have to create a filter that:
Takes two arguments: a dataframe (df) and a country name (country_name).
Checks to see if there is a column called “country” or “Country”. Note - a few functions check column names. We used names in class.
If there is a country column, filter the data by the country specified with country_name and return just the filtered data.
If there is not a country column, return “No country data”.
I tried:
# sample data
df=data.frame(country_name)
Hoping to be able to create a data frame but instead I got an error symbol. What am I supposed to do?

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Question related to table() function in r

I have a very big dataset where there are a number of repetitions of suppose the state column for all latitude and longitude that it covers. Now, I want to find the order in which these states appear (data frame is too big so doesn't show all names) so as to add another column of values in the correct order corresponding to state names. The inner_join doesn't work and says that cannot assign variable of size 122.3Gb. I wanted to use the table() function but it gives alphabetically sorted values and not the order in which the state names appear in the data frame. What can I do?

How to find and replace values of a data frame in R?

I have two sets of Data frames. One contains the region names for different regional code. Another data frames has certain GTU data's with regional code mentioned. I want to replace regional codes of second data with region names based on the 1st data Using R. Please help !
If your data.frames are called df1 and df2 then you could try
try df_final <- merge(df1,df2,by="regional code").
Then df_final will contain everything (also region names and regional code).
Later you can delete the regional code column if you want by using
df_final[, !(colnames(df_final) %in% c("regional code"))]

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Resources