Read csv file with too many commas - r

Maybe it's easy, but I have a csv file with a lot of commas and R doesn't read it correctly, it puts all data in the first column and doesn't present it as a table.
Do you know how I can make to read the file correctly as a classic csv file?
You can download the file here from world bank

Well, it takes some time to clean the file, fortunately I have the time.
gdp2012 <- read.csv("getdata_data_GDP.csv", stringsAsFactors = FALSE)
cnames <- gdp2012[3, ]
names(cnames) <- NULL
cnames[1] <- "Abbrev"
cnames[5] <- "Millions.USD"
names(gdp2012) <- cnames
names(gdp2012)
[1] "Abbrev" "Ranking" "NA" "Economy" "Millions.USD"
[6] "" "NA" "NA" "NA" "NA"
gdp2012 <- gdp2012[, -grep("NA", names(gdp2012))]
gdp2012 <- gdp2012[, -ncol(gdp2012)]
gdp2012 <- gdp2012[-c(1:4, 237:nrow(gdp2012)), ]
dim(gdp2012)
[1] 232 4
str(gdp2012)
'data.frame': 232 obs. of 4 variables:
$ Abbrev : chr "USA" "CHN" "JPN" "DEU" ...
$ Ranking : chr "1" "2" "3" "4" ...
$ Economy : chr "United States" "China" "Japan" "Germany" ...
$ Millions.USD: chr " 16,244,600 " " 8,227,103 " " 5,959,718 " " 3,428,131 " ...
gdp2012[[4]] <- as.numeric(gsub(",", "", gdp2012[[4]]))
Warning message:
NAs introduced by coercion
gdp2012[[2]] <- as.numeric(gdp2012[[2]])
head(gdp2012)
Abbrev Ranking Economy Millions.USD
5 USA 1 United States 16244600
6 CHN 2 China 8227103
7 JPN 3 Japan 5959718
8 DEU 4 Germany 3428131
9 FRA 5 France 2612878
10 GBR 6 United Kingdom 2471784
If you want the row numbers to start at 1, just do
rownames(gdp2012) <- NULL

Delete row 1,-3,238-241 in csv itself
use rio library.
data=import("filename").
There are lots of blank rows in data. so you can use
data[rowSums(is.na(data)) == 0,]

Using data.table, fread function, which has many import arguments and also faster:
library(data.table)
res <- fread("myFile.csv",
sep = ",", # separator is comma
skip = 5, # skip first 5 rows
select = c(1, 2, 4, 5), # select columns by index
na.strings = c("", ".."), # convert blanks to NA
# set column names
col.names = c("Country", "Ranking", "Economy", "USD_Mln"))
# remove blank rows
res <- res[ !is.na(Country), ]
# convert character numbers to numbers
res[ , USD_Mln := as.numeric(gsub(",", "", USD_Mln))]
head(res)
# Country Ranking Economy USD_Mln
# 1: USA 1 United States 16244600
# 2: CHN 2 China 8227103
# 3: JPN 3 Japan 5959718
# 4: DEU 4 Germany 3428131
# 5: FRA 5 France 2612878
# 6: GBR 6 United Kingdom 2471784

Related

Switching values to labels in a new column

I got a column of labelled values. Let's call it country.
When I run:
attr(dat[["Country"]], "labels")
I get the next table:
USA Germany France UK Spain India Saudi Arabia
1 2 3 4 5 6 7
Now I got a new column of int values that are not labelled. Let's call it newCountry. I would like to change those int values to the label of the original Country column. In other words, I would like to go from this in an efficient way...
3
2
2
1
5
4
to this...
France
Germany
Germany
USA
Spain
UK
The problem is that the data frame has a column, Country, with the attribute "labels" set. In its turn, this attribute, which is just a vector, has the attribute "names" set. So the steps to get the "names" of the "labels" are:
Get the "labels" of column Country;
Get the "names" of the vector of labels;
Extract the names corresponding to a vector of indices, the vector i.
First read in the posted data.
nms <- scan(text = "USA Germany France UK Spain India 'Saudi Arabia'",
what = character())
i <- scan(text = "3 2 2 1 5 4")
Now create a data set example.
labs <- setNames(1:7, nms)
dat <- data.frame(Country = sample(letters, 7))
attr(dat[["Country"]], "labels") <- labs
And extract what the question asks for, following the steps above.
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
Or a one-liner:
names(attr(dat[["Country"]], "labels"))[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
To see that this does not depend on the values of the labels, create a second example.
labs2 <- setNames(101:107, nms)
attr(dat[["Country"]], "labels") <- labs2
And though the "labels" are different, the same instructions work:
attr(dat[["Country"]], "labels")
# USA Germany France UK Spain India Saudi Arabia
# 101 102 103 104 105 106 107
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]

Comparing answers and solutions in R (i.e., comparing two table)

I have two tables, 1. answers table and 2. solution table.
The answer table is a list of Name+Answer.
name=c("Jenns","Amy","Jake","Alison","Tommy","Jason","Alex","Vivian")
guess_answer=c("sdgf23894011","lp98ung67543","pwerugji22im","21loop98un89","9580ik8584sf","awe25f6ty788","k0o2jgpo146i","rgyhuj87630l")
answer=data.frame(cbind(name,guess_answer))
> answer
name guess_answer
1 Jenns sdgf23894011
2 Amy lp98ung67543
3 Jake pwerugji22im
4 Alison 21loop98un89
5 Tommy 9580ik8584sf
6 Jason awe25f6ty788
7 Alex k0o2jgpo146i
8 Vivian rgyhuj87630l
The solution table is lists of country with a corresponding (digit+alphabet).
corresponding_number=c("2341rg4524gr","9580ik7584sf","pp0or9rjg7n2","g0o2jgpo146i","lp98ung67543","pwerugji22im","lokibh678901")
country=c("US","UK","CN","AU","JP","KR", "NP")
counry_name=c("United State","United Kingdom","China","Australia","Japan","Korea","North Pole")
solution = cbind(country, corresponding_number,counry_name)
solution = data.frame(solution)
> solution
country corresponding_number counry_name
1 US 2341rg4524gr United State
2 UK 9580ik7584sf United Kingdom
3 CN pp0or9rjg7n2 China
4 AU g0o2jgpo146i Australia
5 JP lp98ung67543 Japan
6 KR pwerugji22im Korea
7 NP lokibh678901 North Pole
I would like to compare the answer table to the solution table, in which if the guess_number is the exact same or 1 digit/alphabet different, it is consider as correct. Then I want to create a table with the country, corresponding_number, and the counry_name.
For example:
> newtable
name corresponding_number country_name
[1,] "xxx" "sdgf23894011" "xxx"
[2,] "JP" "lp98ung67543" "Japan"
[3,] "KR" "pwerugji22im" "Korea"
[4,] "xxx" "21loop98un89" "xxx"
[5,] "UK" "9580ik8584sf" "United Kingdom"
[6,] "xxx" "awe25f6ty788" "xxx"
[7,] "AU" "k0o2jgpo146i" "Australia"
[8,] "xxx" "rgyhuj87630l" "xxx"
The name needs to be: either replaced by "xxx" if answer is wrong, or the "country abbreviations" if answer is wrong.
whether the answer is correct or wrong is based on the guess_answer; the guess_answer is correct if it is a)exactly the same as one of the corresponding_number, or b)1 digit/alphabet different.
the guess_answer will not change but the colname will become "corresponding_number"
Include a third columns showing the full country name, if the guess_answer is wrong, the responding full country name will be "xxx" as well .
edit: first condition.
Here one option is stringdist_left_join, after the join and mutate to replace the NA elements with 'xxx'
library(fuzzyjoin)
library(dplyr)
stringdist_left_join(answer, solution,
by = c("guess_answer" = "corresponding_number"))%>%
mutate(corresponding_number = case_when(is.na(corresponding_number)
~ guess_answer, TRUE ~ corresponding_number),
name = case_when(is.na(country) ~ 'xxx', TRUE ~ country),
counry_name = replace(counry_name, is.na(counry_name), 'xxx')) %>%
select(name, corresponding_number = guess_answer, counry_name)
# name corresponding_number counry_name
#1 xxx sdgf23894011 xxx
#2 JP lp98ung67543 Japan
#3 KR pwerugji22im Korea
#4 xxx 21loop98un89 xxx
#5 UK 9580ik8584sf United Kingdom
#6 xxx awe25f6ty788 xxx
#7 AU k0o2jgpo146i Australia
#8 xxx rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,
counry_name, stringsAsFactors = FALSE)
In base R, we can use adist.
#Calculate distance between guess_answer and corresponding_number
mat <- adist(answer$guess_answer, solution$corresponding_number)
#assign default value to result column
answer$country_name <- 'xxx'
#select values with distance of less than or equal to 1
mat1 <- which(mat <= 1, arr.ind = TRUE)
#Order them by row
ord <- order(mat1[, 1])
#Assign values to the column
answer$country_name[mat1[ord, 1]] <- solution$counry_name[mat1[ord, 2]]
answer
# name guess_answer country_name
#1 Jenns sdgf23894011 xxx
#2 Amy lp98ung67543 Japan
#3 Jake pwerugji22im Korea
#4 Alison 21loop98un89 xxx
#5 Tommy 9580ik8584sf United Kingdom
#6 Jason awe25f6ty788 xxx
#7 Alex k0o2jgpo146i Australia
#8 Vivian rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,counry_name,
stringsAsFactors = FALSE)

Regular Expressions to Unmerge row entries

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04
You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

Extract|Grep|Substring character vector in R

String that start with ^passport only those entry need to be captured
example :
entry = c("passport AR4133553 expires 11 mar 2019","passport 472420180","passport 563220533 (korea, north)",
"passport iraq","passport m 788439","following data derived from an eritrean passport issued",
"passport and national")
desired output : Data has to capture only the passport and country name
**passport** **passport_country**
"AR4133553" NA
"472420180" NA
"563220533" "korea, north"
NA "iraq"
"788439" NA
NA NA
NA NA
#sample data
entry = c("passport AR4133553 expires 11 mar 2019",
"passport 472420180",
"passport 563220533 (korea, north)",
"passport iraq",
"passport m 788439",
"following data derived from an eritrean passport issued",
"passport and national")
#fetch passport number from sample data (i.e. second string having numbers which is immediately after 'passport')
passport_no <- gsub("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", "\\1", entry, perl=T)
ind <- grep("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", entry, value=F)
passport_no[-ind] <- NA
#fetch passport country from sample data
library(maptools)
data(wrld_simpl)
passport_country <- lapply(gsub("[()]","",entry), function(x)
as.character(wrld_simpl#data$NAME[sapply(wrld_simpl#data$NAME, grepl, x, ignore.case=T)]))
passport_country <- lapply(passport_country, function(x)
if(identical(x, character(0))) NA_character_ else x)
#note that 'Korea, North' is not selected in above comparison as it's offical country name is 'Korea, Democratic People's Republic of'
#final data
df <- data.frame(cbind(passport_no, passport_country))
df
Output is:
passport_no passport_country
1 AR4133553 NA
2 472420180 NA
3 563220533 NA
4 NA Iraq
5 NA NA
6 NA Eritrea
7 NA NA

Find levels of a factors that appear more than once

I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?
A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"
Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"

Resources