Need to slice dataframe - r

I have been trying to group this data frame by county however when I used pivot_wider() for the 2 casualty types, it created 2 rows for every county, I have tried fixing this but am new to R, any help would be appreciatedenter image description here

Related

create DF in R referring 2 different columns following one after another

I need to create a data frame with the new column 'Variable1' in R below are the requirement. I have the one column name as 'Country' and another column name as 'City'. I need to check first if the data is available in the City column, if there is no data then move to the Country column based on the week. example:
Country Count
A 5
B 6
C 7
City Count
A 3
B 5
When I create a new column it first checks the count in the City column for all count values.. and if it's not available for it will move to Country and bring that count. and if even its not available in the country then it will just fill forward the last value
NEW Variable
A - 3
B-5
C-7
Can someone please assist with it? Is there any way to do it?

Adding rows to data frame with zero values

I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0

R read csv with observations in colums

in every example I have seen so far for reading csv files in R, the variables are in columns and the observations (individuals) are in rows. In a introductory statistics course I am taking there is an example table where the (many) variables are in rows and the (few) observations are in columns. Is there a way to read such a table so that you get a dataframe in the usual "orientation"?
Here is a solution that uses the tidyverse. First, we gather the data into narrow format tidy data, then we spread it back to wide format, setting the first column as a key for the gathered observations by excluding it from gather().
We'll demonstrate the technique with state level summary data from the U.S. Census Bureau.
I created a table of population data for four states, where the states (observations) are in columns, and the variables are listed in rows of the table.
To make the example reproducible, we entered the data into Excel and saved it as a comma separated values file, which we assign to a vector in R and read with read.csv().
textFile <- "Variable,Georgia,California,Alaska,Alabama
population2018Estimate,10519475,39557045,737438,4887871
population2010EstimatedBase,9688709,37254523,710249,4780138
pctChange2010to2018,8.6,6.2,3.8,2.3
population2010Census,8676653,37253956,710231,4779736"
# load tidyverse libraries
library(tidyr)
library(dplyr)
# first gather to narrow format then spread back to wide format
data %>%
gather(.,state,value,-Variable) %>% spread(Variable,value)
...and the results:
state pctChange2010to2018 population2010Census
1 Alabama 2.3 4779736
2 Alaska 3.8 710231
3 California 6.2 37253956
4 Georgia 8.6 8676653
population2010EstimatedBase population2018Estimate
1 4780138 4887871
2 710249 737438
3 37254523 39557045
4 9688709 10519475

Match one column of data frame to another, pull in other columns, combine into large dataset

I've got a list of Store IDs and their Zipcodes in a 2 column numeric vector (in R). I'm using the "Zipcode" package (https://cran.rproject.org/web/packages/zipcode/zipcode.pdf) and have access to the longitude/latitude coordinates for these zipcodes. The zipcode package has a data frame with every zip code, city,state, and longitude and latitude for all the zipcodes (as a large dataframe).
I'm looking to get the longitude and latitude coordinates for my Zipcodes, and add them as columns 3 and 4 (i.e. Store ID, Zip Code, Longtitude, Latitude)
Any thoughts?
Thank you!
EDIT: I've tried the merge function (i.e.) total<-merged(CleanData,zipcode, by=zip) and I'm getting an error because they must have the same number of columns?
The column name passed as the by argument has to be enclosed within quotes. You don't need the by argument in merge in this example, if zipcode is the only common column in the two dataframes.
Example datasets:
#cleanData
d1<-tibble::tribble(~z,~id,131,1,114,2,155,5)
#zipcode
d2<-
tibble::tribble(~z,~x,~y,131,2,5,166,2,6,162,6,5,177,7,1,114,2,1,155,5,9)
result <- merge(d1,d2)
gives
z id x y
1 114 2 2 1
2 131 1 2 5
3 155 5 5 9
You can remove any unnecessary columns from the result dataframe by simply using dplyr::select(). Suppose you don't need column y (which may be a state name, for example)
result <- dplyr::select(result, z, id, x)
Ended up using this: How to join (merge) data frames (inner, outer, left, right)?
essentially I used the Left Outer function because I wanted to keep all of the zipcodes in my store database. I believe the answer above would eliminate zipcodes not found in the second list of zipcodes.

Create new numeric columns from 1 string column

I'm a beginner. I have a dataset taken from here which consists of people profiles with different attributes, while profession is of them. There are 12 professions: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
I'd like to apply K-NN to that dataset, so I'd like to distribute the profession column into 12 new columns, and attribute 1 to the corresponding profession, and 0 to all the other 11 professions that don't belong to that person.
I tried foreach package and for loops, unsuccessfully. I'm not being able to work with foreach, and I don't know what to do next, from the following code:
jobs <- data[,2]
jobs
for (job in jobs) {
print(job)
#No idea how to create the new columns here, based on if conditionals
}
How would be the best way to do this?
Thanks.
You can certainly solve the problem using a for loop, but may I suggest a solution that is more efficient in the long run: reshape2 package (https://cran.r-project.org/web/packages/reshape2/).
I have the data from bank-full.csv read into R in object bank. Next reshape2 package needs to be downloaded, installed, and loaded:
install.packages("reshape2")
library(reshape2)
The data can then be shaped into a format where observations are on rows and jobs on columns. An accessory id column is first added to the data:
bank$id<-1:nrow(bank)
Then, taking the columns 2 and 18 (job and id) from the data frame bank and casting them into the aforementioned form can be done as:
tmp<-dcast(bank[,c(2, 18)], id~job, length)
That should give a new data frame tmp, where each job has it's own column. Since every id is present in the data only once, the length function used in the dcast function to aggregate the data puts just zeros and ones in every column.
Last, these new columns can be added to the original data set:
bank<-cbind(bank[,-18], tmp[,-1])
Negative subscripts inside the square brackets delete the columns from the dataset, so this simultaneously let's you get rid off the id column.
Another, even more efficient way to do this is to use the function model.matrix:
bank2<-cbind(bank, model.matrix( ~ 0 + job, bank))
This should give you a data frame with each job as a new column. Note however that it changes the column names a bit (adds job to the beginning of the job columns).

Resources