Convert one column into multiple columns - r

I am a novice. I have a data set with one column and many rows. I want to convert this column into 5 columns. For example my data set looks like this:
Column
----
City
Nation
Area
Metro Area
Urban Area
Shanghai
China
24,000,000
1230040
4244234
New york
America
343423
23423434
343434
Etc
The output should look like this
City | Nation | Area | Metro City | Urban Area
----- ------- ------ ------------ -----------
Shangai China 2400000 1230040 4244234
New york America 343423 23423434 343434
The first 5 rows of the data set (City, Nation,Area, etc) need to be the names of the 5 columns and i want the rest of the data to get populated under these 5 columns. Please help.

Here is a one liner (considering that your column is character, i.e. df$column <- as.character(df$column))
setNames(data.frame(matrix(unlist(df[-c(1:5),]), ncol = 5, byrow = TRUE)), c(unlist(df[1:5,])))
# City Nation Area Metro_Area Urban_Area
#1 Shanghai China 24,000,000 1230040 4244234
#2 New_york America 343423 23423434 343434

I'm going to go out on a limb and guess that the data you're after is from the URL: https://en.wikipedia.org/wiki/List_of_largest_cities.
If this is the case, I would suggest you actually try re-reading the data (not sure how you got the data into R in the first place) since that would probably make your life easier.
Here's one way to read the data in:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
XPATH <- '//*[#id="mw-content-text"]/table[2]'
cities <- URL %>%
read_html() %>%
html_nodes(xpath=XPATH) %>%
html_table(fill = TRUE)
Here's what the data currently looks like. Still needs to be cleaned up (notice that some of the columns which had names in merged cells from "rowspan" and the sorts):
head(cities[[1]])
## City Nation Image Population Population Population
## 1 Image City proper Metropolitan area Urban area[7]
## 2 Shanghai China 24,256,800[8] 34,750,000[9] 23,416,000[a]
## 3 Karachi Pakistan 23,500,000[10] 25,400,000[11] 25,400,000
## 4 Beijing China 21,516,000[12] 24,900,000[13] 21,009,000
## 5 Dhaka Bangladesh 16,970,105[14] 15,669,000 18,305,671[15][not in citation given]
## 6 Delhi India 16,787,941[16] 24,998,000 21,753,486[17]
From there, the cleanup might be like:
cities <- cities[[1]][-1, ]
names(cities) <- c("City", "Nation", "Image", "Pop_City", "Pop_Metro", "Pop_Urban")
cities["Image"] <- NULL
head(cities)
cities[] <- lapply(cities, function(x) type.convert(gsub("\\[.*|,", "", x)))
head(cities)
# City Nation Pop_City Pop_Metro Pop_Urban
# 2 Shanghai China 24256800 34750000 23416000
# 3 Karachi Pakistan 23500000 25400000 25400000
# 4 Beijing China 21516000 24900000 21009000
# 5 Dhaka Bangladesh 16970105 15669000 18305671
# 6 Delhi India 16787941 24998000 21753486
# 7 Lagos Nigeria 16060303 13123000 21000000
str(cities)
# 'data.frame': 163 obs. of 5 variables:
# $ City : Factor w/ 162 levels "Abidjan","Addis Ababa",..: 133 74 12 41 40 84 66 148 53 102 ...
# $ Nation : Factor w/ 59 levels "Afghanistan",..: 13 41 13 7 25 40 54 31 13 25 ...
# $ Pop_City : num 24256800 23500000 21516000 16970105 16787941 ...
# $ Pop_Metro: int 34750000 25400000 24900000 15669000 24998000 13123000 13520000 37843000 44259000 17712000 ...
# $ Pop_Urban: num 23416000 25400000 21009000 18305671 21753486 ...

Related

Adding a new column in R from matching multiple columns in two dataframes?

I have two data frames like the following and am trying to add a new column. I want to add a new column in Data Frame 1 which comes from matching 3 different columns with different names (Name,Age,Country) in Dataframe 1 and (First,Age,BornPlace) in Dataframe2.
I have tried filtering and setting a new column but I cannot get to work for every row in df1.
Data frame 1
Name Age Country Unrelated Unrelated
1 Josh 15 USA ... ...
2 Kyle 18 USA ... ...
3 Pete 17 USA ... ...
4 Devin 19 USA ... ...
5 Josh 15 Canada ... ...
Data frame 2
First AgeNum BornPlace Unrelated Unrelated Weight
1 Max 25 USA ... ... 150
2 Morgan 28 USA ... ... 170
3 Josh 15 USA ... ... 140
3 Devin 19 USA ... ... 180
Expected Result(Dataframe1 with new column)
Name Age Country Unrelated Unrelated Weight
1 Josh 15 USA ... ... 140
2 Kyle 18 USA ... ... -
3 Pete 17 USA ... ... -
4 Devin 19 USA ... ... 180
5 Josh 15 Canada ... ... -
We could use left_join:
library(dplyr)
left_join(df1, df2, by=c("Name"="First","Age" = "AgeNum","Country" = "BornPlace"))
Name Age Country Weight
1 Josh 15 USA 140
2 Kyle 18 USA NA
3 Pete 17 USA NA
4 Devin 19 USA 180
5 Josh 15 Canada NA
use data.table package
merge.data.table(
x = DT1, y = DT2,
by.x = c('Name','Age','Country'),
by.y = c('First','Age','BornPlace'),
all.x = T, all.y = F)

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

How do I add another column to a dataframe in R that shows the difference between the columns of two other dataframes?

What I have:
I have two dataframes to work with. Those are:
> print(myDF_2003)
A_score country B_score
1 200 Germany 11
2 150 Italy 9
3 0 Sweden 0
and:
> print(myDF_2005)
A_score country B_score
1 -300 France 16
2 100 Germany 12
3 200 Italy 15
4 40 Spain 17
They are produced by the following code, which I do not want to change:
#_________2003______________
myDF_2003=data.frame(c(200,150,0),c("Germany", "Italy", "Sweden"), c(11,9,0))
colnames(myDF_2003)=c("A_score","country", "B_score")
myDF_2003$country=as.character(myDF_2003$country)
myDF_2003$country=factor(myDF_2003$country, levels=unique(myDF_2003$country))
myDF_2003$A_score=as.numeric(as.character(myDF_2003$A_score))
myDF_2003$B_score=as.numeric(as.character(myDF_2003$B_score))
#_________2005______________
myDF_2005=data.frame(c(-300,100,200,40),c("France","Germany", "Italy", "Spain"), c(16,12,15,17))
colnames(myDF_2005)=c("A_score","country", "B_score")
myDF_2005$country=as.character(myDF_2005$country)
myDF_2005$country=factor(myDF_2005$country, levels=unique(myDF_2005$country))
myDF_2005$A_score=as.numeric(as.character(myDF_2005$A_score))
myDF_2005$B_score=as.numeric(as.character(myDF_2005$B_score))
What I want:
I want to paste another column to myDF_2005 which has the difference of the B_Scores of countries that exist in both previous dataframes. In other words: I want to produce this output:
> print(myDF_2005_2003_Diff)
A_score country B_score B_score_Diff
1 -300 France 16
2 100 Germany 12 1
3 200 Italy 15 6
4 40 Spain 17
Question:
What is the most elegant code to do this?
# join in a temporary dataframe
temp <- merge(myDF_2005, myDF_2003, by = "country", all.x = T)
# calculate the difference and assign a new column
myDF_2005$B_score_Diff <- temp$B_score.x - temp$B_score.y
A solution using dplyr. The idea is to merge the two data frame and then calculate the difference.
library(dplyr)
myDF_2005_2 <- myDF_2005 %>%
left_join(myDF_2003 %>% select(-A_score), by = "country") %>%
mutate(B_score_Diff = B_score.x - B_score.y) %>%
select(-B_score.y) %>%
rename(B_score = B_score.x)
myDF_2005_2
# A_score country B_score B_score_Diff
# 1 -300 France 16 NA
# 2 100 Germany 12 1
# 3 200 Italy 15 6
# 4 40 Spain 17 NA

R - Converting one table to multiple table

I have a csv file called data.csv that contains 3 tables all merged in just one. I would like to separate them in 3 different data.frame when I import it to R.
So far this is what I have got after running this code:
df <- read.csv("data.csv")
View(df)
Student
Name Score
Maria 18
Bob 25
Paul 27
Region
Country Score
Italy 65
India 99
United 88
Sub region
City Score
Paris 77
New 55
Rio 78
How can I split them in such a way that I get this result:
First:
View(StudentDataFrame)
Name Score
Maria 18
Bob 25
Paul 27
Second:
View(regionDataFrame)
Country Score
Italy 65
India 99
United 88
Third:
View(SubRegionDataFrame)
City Score
Paris 77
New 55
Rio 78
One option would be to read the dataset with readLines, create a grouping variable ('grp') based on the location of 'Student', 'Region', 'Sub region' in the 'lines', split it and read it with read.table
i1 <- trimws(lines) %in% c("Student", "Region", "Sub region")
grp <- cumsum(i1)
lst <- lapply(split(lines[!i1], grp[!i1]), function(x)
read.table(text=x, stringsAsFactors=FALSE, header=TRUE))
lst
#$`1`
# Name Score
#1 Maria 18
#2 Bob 25
#3 Paul 27
#$`2`
# Country Score
#1 Italy 65
#2 India 99
#3 United 88
#$`3`
# City Score
#1 Paris 77
#2 New 55
#3 Rio 78
data
lines <- readLines("data.csv")

Make repeating character vector values

Hey there everyone just getting started with R, so I decided to make some data up with the eventual goal of superimposing it on top of a map.
Before I can get there I'm trying to add a name to my data to sort by Province.
Drugs <- c("Azin", "Prolof")
Provinces <- c("Ontario", "British Columbia", "Quebec")
Gender <- c("Female", "Male")
raw <- c(10,16,8,20,7,12,13,11,9,7,14,7)
yomom <- matrix(raw, nrow = 6, ncol = 2)
colnames(yomom) <- Drugs
bro <- data.frame(Gender, yomom)
idunno <- data.frame(Provinces, bro)
The first problem I've encountered is that the provinces vector is repeating, I'm not sure how to make it look like this in R. I'm basically trying to get it to skip a row.
Something like this?
idunno <- data.frame(Provinces=rep(Provinces,each=2), bro)
idunno
# Provinces Gender Azin Prolof
# 1 Ontario Female 10 13
# 2 Ontario Male 16 11
# 3 British Columbia Female 8 9
# 4 British Columbia Male 20 7
# 5 Quebec Female 7 14
# 6 Quebec Male 12 7
Read the documentation on rep(...)

Resources