I have a large 2 column data frame (a) with country codes (ALB, ALG, ...) and years. There are thousands of unordered rows so the countries rows repeat often and randomly:
> a
Country Year
1 ALB 1991
2 ALB 1993
3 ALB 1994
4 ALB 1994
5 ALB 1996
6 ALG 1996
7 ALG 1971
8 AUS 1942
9 BLG 1998
10 BLG 1923
11 PAR 1957
12 PAR 1994
...
I tried frequency <- data.frame(table(a[,1])) but it does something really weird. It gives me something like this:
Var1 Freq
1 AFG 1
2 ALB 3
3 ARG 1
4 AUS 1
5 AUT 3
6 AZE 2
7 BEL 3
8 BEN 2
9 BGD 3
10 BIH 4
...
129 ALB 33
130 ALG 73
131 AMS 7
132 ANC 1
133 AND 3
134 ANG 36
135 ANT 4
136 ARG 148
137 ARM 12
138 AUS 268
139 AUT 144
...
It'll go through most of the country variables, and then go through them once more, giving me 1 or 2 entries for all Countries. If I add the frequencies up, they'll give me the correct total for their respective countries... but I have no idea why they're getting split like this.
In addition, the countries are getting split at all sorts of random places. The first instance is a relatively small number (no more than 20 with one exception) while the second instance is usually but not always a larger number. Some countries AFG only appear in the first instance while others, ANC only appear in the second...
Related
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
I am trying to replicate the same column values for the next 2 cells in the column using R.
I have a data-frame of the following form:
Time World Cate Data
1994 Africa A 12
1994 B 17
1994 C 22
1994 Asia A 55
1994 B 10
1994 C 58
1995 Africa A 62
1995 B 87
1995 C 12
1995 Asia A 59
1995 B 12
1995 C 38
and I want to convert it to the following form:
Time World Cate Data
1994 Africa A 12
1994 Africa B 17
1994 Africa C 22
1994 Asia A 55
1994 Asia B 10
1994 Asia C 58
1995 Africa A 62
1995 Africa B 87
1995 Africa C 12
1995 Asia A 59
1995 Asia B 12
1995 Asia C 38
Use fill from the tidyr package:
If your dataframe is called dat, then
dat <- tidyr::fill(dat, World)
Using na.locf function from library(zoo)
library(zoo)
na.locf(df)
Time World Cate Data
1 1994 Africa A 12
2 1994 Africa B 17
3 1994 Africa C 22
4 1994 Asia A 55
5 1994 Asia B 10
6 1994 Asia C 58
7 1995 Africa A 62
8 1995 Africa B 87
9 1995 Africa C 12
10 1995 Asia A 59
11 1995 Asia B 12
12 1995 Asia C 38
Code
dummy$World <- rep(dummy$World[(1:floor(dim(dummy)[1]/5))*5-4],each = 5)
dummy
I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>
I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)
I have a table of the form:
headers c1.r1.s1 c1.r1.s2 c1.r2.s1 c1.r2.s2 c2.r1.s1
c1.r1.s1 34 76 86 21 45
c1.r1.s2 85 34 47 35 97
c1.r2.s1 12 25 64 47 23
c1.r2.s1 87 54 78 31 25
c2.r1.s1 34 67 49 10 72
where the headers of columns (and rows) represent a combination of country (1 and 2), region (1 and 2) and sector (1 and 2). Let's name the first column "headers" for convenience.
I would like to add two additional rows and columns with partial sums, defined by the headers.
For the first extra row and column, I would like to add the values defined by the same region in the same country (within a certain column and row):
headers c1.r1.s1 c1.r1.s2 c1.r2.s1 c1.r2.s2 c2.r1.s1 sum1r
c1.r1.s1 **34** **76** 86 21 45 **110**
c1.r1.s2 **85** **44** 47 35 97 **129**
c1.r2.s1 12 25 **64** **47** 23 **111**
c1.r2.s1 87 54 **78** **31** 25 **109**
c2.r1.s1 34 67 49 10 **72** **72**
sum1c **119** **120** **142** **78** **72**
For the second extra column and row, I want something similar but adding the values of the same country (as defined in the header):
headers c1.r1.s1 c1.r1.s2 c1.r2.s1 c1.r2.s2 c2.r1.s1 sum1r sum2r
c1.r1.s1 **34** **76** **86** **21** 45 110 **217**
c1.r1.s2 **85** **44** **47** **35** 97 129 **211**
c1.r2.s1 **12** **25** **64** **47** 23 111 **148**
c1.r2.s1 **87** **54** **78** **31** 25 109 **250**
c2.r1.s1 34 67 49 10 **72** 72 **72**
sum1c 119 120 142 78 72
sum2c **218** **199** **275** **144** **72**
My main problem is that I have many countries, regions and sectors; and I'm unable to make my head around how to code "sum the values of this column if the header of the row is the same to this extent".
I'm very sorry if this has been already addressed. I looked around and couldn't find a solution, but if someone can give me any hint I would be incredibly thankful.
EDIT
I found this, which looks pretty much like a solution to my problem, although I don't need a separate matrix with the results, and the sums are slightly different:
R partial sum of rows/columns of a matrix
I'm not that familiar with R (obviously), so I'm wondering if this can be modified to fit my problem.
I understand the structure of the data is not ideal, but I need to keep it as it is since it reflects inter-industry flows.
Generally speaking your dataframe is inflatet, since the column- and rownames contain the same information, wich doesnt help you. try to make it more tidy, in a way that every coulmn contains one type of information, like country or region. (it wouldnt matter if two different countrys have the same region code like "r1", because R can handle this easy.
To demonstrate what i mean i created this short example with Continents and their countrys:
df<-cbind(na.exclude(countrycode::codelist[,c(2,18)]),
rnorm(length(na.exclude(countrycode::codelist[,c(2,18)]))),
dnorm(length(na.exclude(countrycode::codelist[,c(2,18)]))))
colnames(df)<-c("continent","region", "value", "value2")
#
> head(df)
continent region value value2
1 Asia Afghanistan 0.4148095 0.05399097
2 Europe Åland Islands 0.3974413 0.05399097
3 Europe Albania 0.4148095 0.05399097
4 Africa Algeria 0.3974413 0.05399097
5 Oceania American Samoa 0.4148095 0.05399097
6 Europe Andorra 0.3974413 0.0539909
following up we use the dplyr package with the group function to make your calculations:
library(dplyr)
df2<-df %>%
group_by(continent) %>% mutate(continent.val.sums= sum(value, value2))
> head(df2)
# A tibble: 6 x 5
# Groups: continent [4]
continent region value value2 continent.val.sums
<chr> <chr> <dbl> <dbl> <dbl>
1 Asia Afghanistan 0.415 0.0540 23.1
2 Europe Åland Islands 0.397 0.0540 23.4
3 Europe Albania 0.415 0.0540 23.4
4 Africa Algeria 0.397 0.0540 26.7
5 Oceania American Samoa 0.415 0.0540 11.9
6 Europe Andorra 0.397 0.0540 23.4
there are many ways and functions to do these types of calculations group and mutate being only one.
df3<-df2 %>% group_by(region) %>% mutate(region.val.sums= sum(value2))
> head(df3)
# A tibble: 6 x 6
# Groups: region [6]
continent region value value2 continent.val.sums region.val.sums
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Asia Afghanistan 0.415 0.0540 23.1 0.0540
2 Europe Åland Islands 0.397 0.0540 23.4 0.0540
3 Europe Albania 0.415 0.0540 23.4 0.0540
4 Africa Algeria 0.397 0.0540 26.7 0.0540
5 Oceania American Samoa 0.415 0.0540 11.9 0.0540
6 Europe Andorra 0.397 0.0540 23.4 0.0540
this sum of course makes no sense because sum of value2 per region equals the value since theres only one distinct region. But its only to demonstrate the principle , you can make other groups and subgroups, or use other functions like mean() or summarise() summarize(), etc
I am trying to impute missing values using the mi package in r and ran into a problem.
When I load the data into r, it recognizes the column with missing values as a factor variable. If I convert it into a numeric variable with the command
dataset$Income <- as.numeric(dataset$Income)
It converts the column to ordinal values (with the smallest value being 1, the second smallest as 2, etc...)
I want to convert this column to numeric values, while retaining the original values of the variable. How can I do this?
EDIT:
Since people have asked, here is my code and an example of what the data looks like.
DATA:
96 GERMANY 6 1960 72480 73 50.24712 NA 0.83034767 0
97 GERMANY 6 1961 73123 85 48.68375 NA 0.79377610 0
98 GERMANY 6 1962 73739 98 48.01359 NA 0.70904115 0
99 GERMANY 6 1963 74340 132 46.93588 NA 0.68753213 0
100 GERMANY 6 1964 74954 146 47.89413 NA 0.67055298 0
101 GERMANY 6 1965 75638 160 47.51518 NA 0.64411484 0
102 GERMANY 6 1966 76206 172 48.46009 NA 0.58274711 0
103 GERMANY 6 1967 76368 183 48.18423 NA 0.57696055 0
104 GERMANY 6 1968 76584 194 48.87967 NA 0.64516949 0
105 GERMANY 6 1969 77143 210 49.36219 NA 0.55475352 0
106 GERMANY 6 1970 77783 227 49.52712 3,951.00 0.53083969 0
107 GERMANY 6 1971 78354 242 51.01421 4,282.00 0.51080717 0
108 GERMANY 6 1972 78717 254 51.02941 4,655.00 0.48773913 0
109 GERMANY 6 1973 78950 264 50.61033 5,110.00 0.48390087 0
110 GERMANY 6 1974 78966 270 48.82353 5,561.00 0.56562229 0
111 GERMANY 6 1975 78682 284 50.50279 6,092.00 0.56846030 0
112 GERMANY 6 1976 78298 301 49.22833 6,771.00 0.53536154 0
113 GERMANY 6 1977 78160 321 49.18999 7,479.00 0.55012371 0
Code:
Income <- dataset$Income
gives me a factor variable, as there are NA's in the data.If I try to turn it into numeric with
as.numeric(Income)
It throws away the original values, and replaces them with the rank of the column. I would like to keep the original values, while still recognizing missing values.
A problem every data manager from Germany knows: The column with the NAs conatins numbers with colons. But R only knows the english style of decimal points without digit grouping. So this column is treated as ordinally scaled character variable.
Try to remove the colons and you'll get the numeric values.
By the way, even if we write decimal colons in Germany, Numbers like 3,951.00 syntactically don't make sense. They even don't make sense in other languages. See these examples of international number syntax.