Loading data with missing values as numeric data - r

I am trying to impute missing values using the mi package in r and ran into a problem.
When I load the data into r, it recognizes the column with missing values as a factor variable. If I convert it into a numeric variable with the command
dataset$Income <- as.numeric(dataset$Income)
It converts the column to ordinal values (with the smallest value being 1, the second smallest as 2, etc...)
I want to convert this column to numeric values, while retaining the original values of the variable. How can I do this?
EDIT:
Since people have asked, here is my code and an example of what the data looks like.
DATA:
96 GERMANY 6 1960 72480 73 50.24712 NA 0.83034767 0
97 GERMANY 6 1961 73123 85 48.68375 NA 0.79377610 0
98 GERMANY 6 1962 73739 98 48.01359 NA 0.70904115 0
99 GERMANY 6 1963 74340 132 46.93588 NA 0.68753213 0
100 GERMANY 6 1964 74954 146 47.89413 NA 0.67055298 0
101 GERMANY 6 1965 75638 160 47.51518 NA 0.64411484 0
102 GERMANY 6 1966 76206 172 48.46009 NA 0.58274711 0
103 GERMANY 6 1967 76368 183 48.18423 NA 0.57696055 0
104 GERMANY 6 1968 76584 194 48.87967 NA 0.64516949 0
105 GERMANY 6 1969 77143 210 49.36219 NA 0.55475352 0
106 GERMANY 6 1970 77783 227 49.52712 3,951.00 0.53083969 0
107 GERMANY 6 1971 78354 242 51.01421 4,282.00 0.51080717 0
108 GERMANY 6 1972 78717 254 51.02941 4,655.00 0.48773913 0
109 GERMANY 6 1973 78950 264 50.61033 5,110.00 0.48390087 0
110 GERMANY 6 1974 78966 270 48.82353 5,561.00 0.56562229 0
111 GERMANY 6 1975 78682 284 50.50279 6,092.00 0.56846030 0
112 GERMANY 6 1976 78298 301 49.22833 6,771.00 0.53536154 0
113 GERMANY 6 1977 78160 321 49.18999 7,479.00 0.55012371 0
Code:
Income <- dataset$Income
gives me a factor variable, as there are NA's in the data.If I try to turn it into numeric with
as.numeric(Income)
It throws away the original values, and replaces them with the rank of the column. I would like to keep the original values, while still recognizing missing values.

A problem every data manager from Germany knows: The column with the NAs conatins numbers with colons. But R only knows the english style of decimal points without digit grouping. So this column is treated as ordinally scaled character variable.
Try to remove the colons and you'll get the numeric values.
By the way, even if we write decimal colons in Germany, Numbers like 3,951.00 syntactically don't make sense. They even don't make sense in other languages. See these examples of international number syntax.

Related

Importing Data in R

I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>

Multiple lines on a line plot in R [duplicate]

This question already has answers here:
Plotting two variables as lines using ggplot2 on the same graph
(5 answers)
Closed 4 years ago.
I'm trying to create a line plot in R, showing lines for different places over time.
My data is in a table with Year in the first column, the places England, Scotland, Wales, NI as separate columns:
Year England Scotland Wales NI
1 2006/07 NA 411 188 111
2 2007/08 NA 415 193 112
3 2008/09 NA 424 194 114
4 2009/10 NA 429 194 115
5 2010/11 NA 428 199 116
6 2011/12 NA 428 200 116
7 2012/13 NA 425 199 117
8 2013/14 NA 427 202 117
9 2014/15 NA 431 200 121
10 2015/16 3556 432 199 126
11 2016/17 3436 431 200 129
12 2017/18 3467 NA NA NA
I'm using ggplot, and can get a lineplot for any of the places, but I'm having difficulty getting lines for all the places on the same plot.
It seems like this might work if I had the places in a column as well (instead of across the top), as then I could set y in the code below to be that column, as opposed to the column that is a specific place. But that seems a bit convoluted and as I have lots of data in the existing format, I'm hoping there's either a way to do this with the format I have or a quick way of transforming it.
ggplot(data=mysheets$sheet1, aes(x=Year, y=England, group=1)) +
geom_line()+
geom_point()
From what I can tell, I'll need to reshape my data (into long form?) but I haven't found a way to do that where I don't have a column for places (i.e., I have a column for each place but the table doesn't have a way of saying these are all places and the same kind of thing).
I've also tried transposing my data, so the places are down the side and the years are along the top, but R still has its own headers for the columns - I guess another option might be if it was possible to have the years as headers and have that recognised by R?
As you said, you have to convert to long format to make the most out of ggplot2.
library(ggplot2)
library(dplyr)
mydata_raw <- read.table(
text = "
Year England Scotland Wales NI
1 2006/07 NA 411 188 111
2 2007/08 NA 415 193 112
3 2008/09 NA 424 194 114
4 2009/10 NA 429 194 115
5 2010/11 NA 428 199 116
6 2011/12 NA 428 200 116
7 2012/13 NA 425 199 117
8 2013/14 NA 427 202 117
9 2014/15 NA 431 200 121
10 2015/16 3556 432 199 126
11 2016/17 3436 431 200 129
12 2017/18 3467 NA NA NA"
)
# long format
mydata <- mydata_raw %>%
tidyr::gather(country, value, England:NI) %>%
dplyr::mutate(Year = as.numeric(substring(Year, 1, 4))) # convert to numeric date
ggplot(mydata, aes(x = Year, y = value, color = country)) +
geom_line() +
geom_point()

Frequency categories getting randomly split with table function

I have a large 2 column data frame (a) with country codes (ALB, ALG, ...) and years. There are thousands of unordered rows so the countries rows repeat often and randomly:
> a
Country Year
1 ALB 1991
2 ALB 1993
3 ALB 1994
4 ALB 1994
5 ALB 1996
6 ALG 1996
7 ALG 1971
8 AUS 1942
9 BLG 1998
10 BLG 1923
11 PAR 1957
12 PAR 1994
...
I tried frequency <- data.frame(table(a[,1])) but it does something really weird. It gives me something like this:
Var1 Freq
1 AFG 1
2 ALB 3
3 ARG 1
4 AUS 1
5 AUT 3
6 AZE 2
7 BEL 3
8 BEN 2
9 BGD 3
10 BIH 4
...
129 ALB 33
130 ALG 73
131 AMS 7
132 ANC 1
133 AND 3
134 ANG 36
135 ANT 4
136 ARG 148
137 ARM 12
138 AUS 268
139 AUT 144
...
It'll go through most of the country variables, and then go through them once more, giving me 1 or 2 entries for all Countries. If I add the frequencies up, they'll give me the correct total for their respective countries... but I have no idea why they're getting split like this.
In addition, the countries are getting split at all sorts of random places. The first instance is a relatively small number (no more than 20 with one exception) while the second instance is usually but not always a larger number. Some countries AFG only appear in the first instance while others, ANC only appear in the second...

Adding data frame below another data frame [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 5 years ago.
I want to do the following:
I have a Actual Sales Dataframe
Dates Actual
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
Another data frame of Predicted values
Dates Predicted
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
Add the predicted Sales data frame below the Actual data Frame in following manner:
Dates Actual Predicted
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
With:
library(dplyr)
bind_rows(d1, d2)
you get:
Dates Actual Predicted
1 24/04/2017 58 NA
2 25/04/2017 59 NA
3 26/04/2017 58 NA
4 27/04/2017 154 NA
5 28/04/2017 117 NA
6 29/04/2017 127 NA
7 30/04/2017 178 NA
8 01/05/2017 NA 68.54159
9 02/05/2017 NA 90.73130
10 03/05/2017 NA 82.76875
11 04/05/2017 NA 117.48913
12 05/05/2017 NA 110.38090
13 06/05/2017 NA 156.53363
14 07/05/2017 NA 198.14819
Or with:
library(data.table)
rbindlist(list(d1,d2), fill = TRUE)
Or with:
library(plyr)
rbind.fill(d1,d2)

creating unique sequence for October 15 to April 30th following year- R

Basically, I'm looking at snowpack data. I want to assign a unique value to each date (column "snowday") over the period October 15 to May 15th the following year (the winter season of course) ~215 days. then add a column "snowmonth" that corresponds to the sequential months of the seasonal data, as well as a "snow year" column that represents the year where each seasonal record starts.
There are some missing dates- however- but instead of finding those dates and inserting NA's into the rows, I've opted to skip that step and instead go the sequential root which can then be plotted with respect to the "snowmonth"
Basically, I just need to get the "snowday" sequence of about 1:215 (+1 for leap years down in a column, and the rest I can do myself. It looks like this
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 NA NA
12 27 1955 24 1955-12-27 361 NA NA
12 28 1955 24 1955-12-28 362 NA NA
12 29 1955 24 1955-12-29 363 NA NA
12 30 1955 26 1955-12-30 364 NA NA
12 31 1955 26 1955-12-31 365 NA NA
1 1 1956 25 1956-01-01 1 NA NA
1 2 1956 25 1956-01-02 2 NA NA
1 3 1956 26 1956-01-03 3 NA NA
man<-data.table()
man <-  read.delim('mansfieldstake.txt',header=TRUE, check.names=FALSE)
man[is.na(man)]<-0
man$date<-paste(man$yy, man$mm, man$dd,sep="-", collapse=NULL)
man$yearday<-NA #day of the year 1-365
colnames(man)<- c("month","day","year","depth", "date","yearday")
man$date<-as.Date(man$date)
man$yearday<-yday(man$date)
man$snowday<-NA
man$snowmonth<-NA
man[420:500,]
head(man)
output would look something like this:
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 73 3
12 27 1955 24 1955-12-27 361 74 3
12 28 1955 24 1955-12-28 362 75 3
12 29 1955 24 1955-12-29 363 76 3
12 30 1955 26 1955-12-30 364 77 3
12 31 1955 26 1955-12-31 365 78 3
1 1 1956 25 1956-01-01 1 79 4
1 2 1956 25 1956-01-02 2 80 4
1 3 1956 26 1956-01-03 3 81 4
I've thought about loops and all that- but it's inefficient... leap years kinda mess things up as well- this has become more challenging than i thought. good first project though!
just looking for a simple sequence here, dropping all non-snow months. thanks for anybody who's got input!
If I understand correctly that snowday should be the number of days since the beginning of the season, all you need to make this column using data.table is:
day_one <- as.Date("1955-10-01")
man[, snowday := -(date - day_one)]
If all you want is a sequence of unique values, then seq() is your best bet.
Then you can create the snowmonth using:
library(lubridate)
man[, snowmonth := floor(-time_length(interval(date, day_one), unit = "month"))

Resources