why multiple columns are shown as one vector in R - r

I am trying to read in data from a URL however when I do run the following code:
x <- read.csv(url(myUrl), sep = '\t', head = FALSE)
print(x)
I get this
V1 V2
1 18.0 8 30.7 130.0 hello
2 32.0 6 23.5 121.5 bye
and I want this
V1 V2 V3 V4 V5
1 18.0 8.0 30.7 130.0 hello
2 32.0 6.0 23.5 121.5 bye
for some reason it is reading it as 2 columns instead of 5
Edit 1
Here is a snippet of the data file from the url:
Edit 2
Here is the url: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Instead of \t, may be, use use ' ' or don't specify the delimiter
x <- read.table(url(myUrl), header = FALSE)
based on the url updated in the OP's post
x <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header = FALSE)
str(x)
#'data.frame': 398 obs. of 9 variables:
# $ V1: num 18 15 18 16 17 15 14 14 14 15 ...
# $ V2: int 8 8 8 8 8 8 8 8 8 8 ...
# $ V3: num 307 350 318 304 302 429 454 440 455 390 ...
# $ V4: chr "130.0" "165.0" "150.0" "150.0" ...
# $ V5: num 3504 3693 3436 3433 3449 ...
# $ V6: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
# $ V7: int 70 70 70 70 70 70 70 70 70 70 ...
# $ V8: int 1 1 1 1 1 1 1 1 1 1 ...
# $ V9: chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Related

How to split a CHR column, pivot, then combine tables?

So I have two tables:
LST data (24 months in total) (already pivoted_longer)
Buffer Date LST
<chr> <chr> <chr>
1 100 15/01/2010 6.091741043
2 100 16/02/2010 6.405879111
3 100 20/03/2010 8.925945159
4 100 24/04/2011 6.278147269
5 100 07/05/2010 6.133940129
6 100 08/06/2010 7.705591939
7 100 13/07/2011 4.066052173
8 100 11/08/2010 5.962087092
9 100 12/09/2010 5.761892842
10 100 17/10/2011 3.155769317
# ... with 1,550 more rows
Weather data (24 months in total)
Weather variable 15/01/2010 16/02/2010 20/03/2010 24/04/2011 07/05/2010
1 Temperature 12.0 15.0 16.0 23.00 21.50
2 Wind_speed 10.0 9.0 10.5 19.50 9.50
3 Wind_trend 1.0 1.0 1.0 0.00 1.00
4 Wind_direction 22.5 45.0 67.5 191.25 56.25
5 Humidity 40.0 44.5 22.0 24.50 7.00
6 Pressure 1024.0 1018.5 1025.0 1005.50 1015.50
7 Pressure_trend 1.0 1.0 1.0 1.00 1.00
If I pivot the weather data I get:
1 Temperature 15/01/2010 12
2 Temperature 16/02/2010 15
3 Temperature 20/03/2010 16
4 Temperature 24/04/2011 23
5 Temperature 07/05/2010 21.5
6 Temperature 08/06/2010 36.5
7 Temperature 13/07/2011 33
8 Temperature 11/08/2010 34.5
9 Temperature 12/09/2010 33
10 Temperature 17/10/2011 27
# ... with 158 more rows
(each weather variable listed in turn).
I need to combine 1) and 3) - using the date and something like data_long <- merge(LST_data,weather_data,by="Date") I think - appending weather data columns to each row in 1).
But I'm stuck.
The solution I found to this was to pivot the weather data (longer):
weather_long <- weather %>% pivot_longer(cols = 2:21, names_to = "Date", values_to = "Value")
which gives a tibble in the format:
# A tibble: 180 x 3
`Weather variable` Date Value
<chr> <chr> <dbl>
1 Temperature 28/10/2016 17
2 Temperature 31/12/2016 22
3 Temperature 16/01/2017 25
4 Temperature 05/03/2017 19
(as described above in the question).
Because this process changes the 'Date' variable type:
tibble [180 x 3] (S3: tbl_df/tbl/data.frame)
$ Weather variable: chr [1:180] "Temperature" "Temperature" "Temperature" "Temperature" ...
$ Date : chr [1:180] "28/10/2016" "31/12/2016" "16/01/2017" "05/03/2017" ...
$ Value : num [1:180] 17 22 25 19 20 22 11 10 3 9 ...
I then corrected this:
weather_long$Date <- as.Date(weather_long$Date, format = "%d/%m/%Y")
Next was to convert the weather data to the 'wide' format (in preparation for the next step):
weather_wide <- weather_long %>%
pivot_wider(names_from = "Weather variable", values_from = "Value")
Then join it to the LST data using the Date column as the key:
LST_Weather_dataset <- full_join(data_long, weather_wide, by = "Date")
This produced the desired result:
str(LST_Weather_dataset)
'data.frame': 380 obs. of 16 variables:
$ Buffer : int 100 200 300 400 500 600 700 800 900 1000 ...
$ Date : Date, format: "2016-10-28" "2016-10-28" "2016-10-28" "2016-10-28" ...
$ LST : num 0.918 0.951 0.791 0.748 0.687 ...
$ Month : num 10 10 10 10 10 10 10 10 10 10 ...
$ Year : num 2016 2016 2016 2016 2016 ...
$ JulianDay : num 302 302 302 302 302 302 302 302 302 302 ...
$ TimePeriod : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Temperature : num 17 17 17 17 17 17 17 17 17 17 ...
$ Humidity : num 59 59 59 59 59 59 59 59 59 59 ...
$ Humidity_trend: num 1 1 1 1 1 1 1 1 1 1 ...
$ Wind_speed : num 19 19 19 19 19 19 19 19 19 19 ...
$ Wind_gust : num 0 0 0 0 0 0 0 0 0 0 ...
$ Wind_trend : num 2 2 2 2 2 2 2 2 2 2 ...
$ Wind_direction: num 338 338 338 338 338 ...
$ Pressure : num 1017 1017 1017 1017 1017 ...
$ Pressure_trend: num 2 2 2 2 2 2 2 2 2 2 ...

write.csv() extremely unexpected behavior

Something explainable to me is going on. I have a dataframe:
> head(df)
id lon lat temp month year hr prec ws
1 1 27.75 -22.25 295.35 9 2018 0.00007675205 401.1297 12.88135
2 2 28.25 -22.25 295.95 9 2018 0.00008084041 426.3411 12.89902
3 3 28.75 -22.25 296.25 9 2018 0.00008487972 449.7063 12.63242
4 4 29.25 -22.25 296.45 9 2018 0.00009112679 484.3495 12.59484
5 5 29.75 -22.25 296.65 9 2018 0.00009995372 533.0175 12.28485
6 6 30.25 -22.25 296.95 9 2018 0.00010895965 583.8255 11.80009
it looks like this:
> nrow(df)
[1] 607
> ncol(df)
[1] 9
when I do write.csv(df, /data/df.csv) it writes a humongous csv with tens of columns and thousands of rows. Has anyone experienced this kind of behavior? I rebooted my machine, restarted R, updated everything, and still persistently malicious, this keeps happening.
Output of dput(df):
https://drive.google.com/file/d/1AkGK9Svwi9mSAcB0G3Ecx7aDC6ccnYRg/view?usp=sharing
str(x) will help you figure out what's going on.
x <- dget("fupedCSV.txt")
str(x)
## 'data.frame': 607 obs. of 9 variables:
## <a bunch of normal columns> ...
## $ rh :'data.frame': 607 obs. of 1 variable:
## ..$ hr: num 7.68e-05 8.08e-05 8.49e-05 9.11e-05 1.00e-04 ...
## $ prec :'data.frame': 607 obs. of 1 variable:
## ..$ prec: num 401 426 450 484 533 ...
## $ ws :'data.frame': 607 obs. of 1 variable:
## ..$ ws: num 12.9 12.9 12.6 12.6 12.3 ...
Note the last three columns, which are actually data frames nested inside the data frame.
## ORIGINAL: y <- as.data.frame(lapply(x, function(x) if (is.list(x)) x[[1]] else x ))
y <- do.call(data.frame,x) ## thanks #akrun!
str(y)
## 'data.frame': 607 obs. of 9 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ lon : num 27.8 28.2 28.8 29.2 29.8 ...
## $ lat : num -22.2 -22.2 -22.2 -22.2 -22.2 ...
## $ temp : num 295 296 296 296 297 ...
## $ month: int 9 9 9 9 9 9 9 9 9 9 ...
## $ year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ rh : num 7.68e-05 8.08e-05 8.49e-05 9.11e-05 1.00e-04 ...
## $ prec : num 401 426 450 484 533 ...
## $ ws : num 12.9 12.9 12.6 12.6 12.3 ...
I haven't tested writing to a file, but I think this will clear up your problem.
The below does the trick.
do.call(data.frame,x) and
write.csv2(x,file="xxxx.csv", row.names=FALSE)

Find mean from subset of one column based on ranking in the top 50 of another column

I have a data frame that has the following columns:
> str(wbr)
'data.frame': 214 obs. of 12 variables:
$ countrycode : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
$ countryname : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
$ gdp_per_capita : num 19913 35628 415 2738 4091 ...
$ literacy_female : num 96.7 NA 17.6 59.1 95.7 ...
$ literacy_male : num 96.9 NA 45.4 82.5 98 ...
$ literacy_all : num 96.8 NA 31.7 70.6 96.8 ...
$ infant_mortality : num NA 2.2 70.2 101.6 13.3 ...
$ illiteracy_female: num 3.28 NA 82.39 40.85 4.31 ...
$ illiteracy_mele : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_male : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_all : num 3.18 NA 68.26 29.42 3.15 ...
I would like to find the mean of illiteracy_all from the top 50 countries with the highest GDP.
Before you answer me I need to inform you that the data frame has NA values meaning that if I want to find the mean I would have to write:
mean(wbr$illiteracy_all, na.rm=TRUE)
For a reproducible example, let's take:
data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))
So how could I average the y values for e.g. the top 5 values of x?
> data.df
x y
1 101 1
2 102 2
3 103 3
4 104 NA
5 105 1
6 106 2
7 107 3
8 108 NA
9 109 1
10 110 2
11 111 3
12 112 NA
13 113 1
14 114 2
15 115 3
16 116 NA
17 117 1
18 118 2
19 119 3
20 120 NA
Any of the following would work:
mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
To unpack why this works, note first that rank gives ranks in a different order to what you might expect, 1 being the rank of the smallest number not the largest:
> rank(data.df$x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can get round that by negating the input:
> rank(-data.df$x)
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
So now ranks 1 to 5 are the "top 5". If we want a vector of TRUE and FALSE to indicate the position of the top 5 we can use:
> rank(-data.df$x)<=5
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method argument for rank to see how you want to handle this.)
So let's grab the values of y in those positions:
> data.df[rank(-data.df$x)<=5,"y"]
[1] NA 1 2 3 NA
Or you could use:
> data.df$y[rank(-data.df$x)<=5]
[1] NA 1 2 3 NA
So now we know what to input into mean:
> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2
Or:
> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2
Or if you don't like repeating the name of the data frame, use with:
> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Help on subsetting a dataframe

I am using %in% for subsetting and I came across a strange result.
> my.data[my.data$V3 %in% seq(200,210,.01),]
V1 V2 V3 V4 V5 V6 V7
56 470 48.7 209.73 yes 26.3 54 470
That was correct. But when I widen the range... row 56 just disappears
> my.data[my.data$V3 %in% seq(150,210,.01),]
V1 V2 V3 V4 V5 V6 V7
51 458 48.7 156.19 yes 28.2 58 458
67 511 30.5 150.54 yes 26.1 86 511
73 535 40.6 178.76 yes 29.5 73 535
Can you tell me what's wrong?
Is there a better way to subset the dataframe?
Here is its structure
> str(my.data)
'data.frame': 91 obs. of 7 variables:
$ V1: Factor w/ 91 levels "100","10004",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V2: num 44.6 22.3 30.4 38.6 15.2 18.3 16.3 12.2 36.7 12.2 ...
$ V3: num 110.83 25.03 17.17 57.23 2.18 ...
$ V4: Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 1 1 ...
$ V5: num 22.3 30.5 24.4 25.5 4.1 28.4 7.9 5.1 24 12.2 ...
$ V6: int 50 137 80 66 27 155 48 42 65 100 ...
$ V7: chr "" "10004" "10005" "10012" ...
Ooops. You are trying to do exact matching on a computer that can't represent all numbers exactly.
> any(209.73 == seq(200,210,.01))
[1] TRUE
> any(209.73 == seq(150,210,.01))
[1] FALSE
> any(209.73 == zapsmall(seq(150,210,.01)))
[1] TRUE
The reason for the discrepancy is in the second sequence, the value in the sequence is not exactly 209.73. This is something you have to appreciate when doing computation with computers.
This is covered in many places on the interweb, but in relation to R, see point 7.31 in the R FAQ.
Anyway, that said, you are going about the problem incorrectly. You want to use proper numeric operators:
my.data[my.data$V3 >= 150 & my.data$V3 <= 210, ]
## or
subset(my.data, V3 >= 150 & V3 <= 210)

Resources