Read in CSV in mixed English and French number format - r

I would like to read the a CSV into R that is quoted, comma-separated (i.e. sep = "," not sep = ";" as read.csv2 defaults to) but that
uses the comma inside fields as the decimal separator
contains periods to separate each group of three digits from the right
An example of a problematic entry is "3.051,00" in the final line of the excerpt from the CSV shown.
I tried
dat <- read.csv2("path_to_csv.csv", sep = ",", stringsAsFactors = FALSE)
and a variant using read.csv (both are identical except for their defaults as noted in Difference between read.csv() and read.csv2() in R. Both return improperly-formatted data.frames (e.g. containing 3.051,00).
Can I read this comma-separated file in directly with read.table without having to perform text-preprocessing?
Excerpt of CSV
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
Note: I am aware of the question European and American decimal format for thousands, which is not sufficient. This user preprocesses the file they want to read in whereas I would like a direct means of reading a CSV of the kind shown into R.

Most of it is resolved with dec=",",
# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
# 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352,00 3.37 16.5 61.50
# 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692,00 6.49 25.0 91.40
# 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505,00 18.60 31.1 7.64
# 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400,00 11.70 22.7 6.85
Only one column is string:
sapply(out, class)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "character" "numeric" "numeric"
# pjnk
# "numeric"
This can be resolved post-read with:
ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
# [1] 414 352 403 692 505 400 956 1407 207 3051
If you'd rather read it in without post-processing, you can pipe(.) it, assuming you have sed available[^1]:
out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))
Notes:
sed is generally available on all linux/macos systems, and on windows computers it is included within Rtools.

Like r2evans's comment says, dec = "," takes care of the cases without thousands separators. Then use lapply/gsub to process the other cases, which are still of class "character".
txt <- '
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
'
df1 <- read.csv(textConnection(txt), dec = ",")
i <- sapply(df1, is.character)
df1[i] <- lapply(df1[i], \(x) gsub("\\.", "", x))
df1[i] <- lapply(df1[i], \(x) as.numeric(sub(",", ".", x)))
df1
#> praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
#> 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414 17.00 44.9 40.00
#> 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352 3.37 16.5 61.50
#> 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403 11.40 31.9 19.50
#> 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692 6.49 25.0 91.40
#> 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505 18.60 31.1 7.64
#> 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400 11.70 22.7 6.85
#> 7 47.8 19.6 17.50 33.1 82.00 17.90 35.2 956 22.50 43.3 20.00
#> 8 59.9 53.3 11.80 77.7 12.90 11.10 37.9 1407 18.80 29.4 16.80
#> 9 46.6 27.1 12.40 109.0 21.90 21.50 38.2 207 11.00 31.3 12.00
#> 10 51.9 21.3 49.10 58.8 10.80 58.80 200.0 3051 15.30 39.2 15.70
Created on 2022-02-07 by the reprex package (v2.0.1)

Related

Reshape horizontal to to long format using pivot_longer

I am trying to reshape my data to long instead of wide format using the same code provided earlier link , however it doesn't work even after several trials to modify names_pattern = "(.*)_(pre|post.*)",
My data sample is
data1<-read.table(text="
Serial_ID pre_EDV pre_ESV pre_LVEF post_EDV post_ESV post_LVEF
1 76.2 32.9 56.8 86.3 36.6 57.6
2 65.4 35.9 45.1 60.1 26.1 56.7
3 64.4 35.1 45.5 72.5 41.1 43.3
4 50 13.9 72.1 46.4 18.4 60.4
5 89.6 32 64.3 70.9 19.3 72.8
6 62 20.6 66.7 55.9 17.8 68.2
7 91.2 37.7 58.6 61.9 23.8 61.6
8 62 24 61.3 69.3 34.9 49.6
9 104.1 22.7 78.8 38.6 11.5 70.1
10 90.6 31.2 65.6 48 16.1 66.4", sep="", header=T)
I want to reshape my data to
put identical column headings below each other eg post_EDV below
pre_EDV
Create new column Pre vs. post
Fix column heading (remove "pre_" and "post_" to be "EDV" only (as shown in the screenshot below)).
This is the used code:
library(dplyr);library(tidyr);library(stringr)
out <- data %>% pivot_longer(cols = -Serial_ID,
names_to = c(".value", "prevspost"),
names_pattern = "(.*)_(pre|post.*)",
names_sep="_") #%>% as.data.frame
Also I tried names_prefix = c("pre_","post_") instead of names_pattern = "(.*)_(pre|post.*)", but it doesn't work.
Any advice will be greatly appreciated.
Edit I recommend using #Dave2e's superior approach.
The reason your attempt didn't work is because the pattern has to match in order. You could try this:
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("prevspost",".value"),
names_pattern = "(pre|post)_(\\w+)") %>%
dplyr::arrange(desc(prevspost),Serial_ID)
# A tibble: 20 x 5
Serial_ID prevspost EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 2 pre 65.4 35.9 45.1
3 3 pre 64.4 35.1 45.5
4 4 pre 50 13.9 72.1
5 5 pre 89.6 32 64.3
6 6 pre 62 20.6 66.7
7 7 pre 91.2 37.7 58.6
8 8 pre 62 24 61.3
9 9 pre 104. 22.7 78.8
10 10 pre 90.6 31.2 65.6
11 1 post 86.3 36.6 57.6
12 2 post 60.1 26.1 56.7
13 3 post 72.5 41.1 43.3
14 4 post 46.4 18.4 60.4
15 5 post 70.9 19.3 72.8
16 6 post 55.9 17.8 68.2
17 7 post 61.9 23.8 61.6
18 8 post 69.3 34.9 49.6
19 9 post 38.6 11.5 70.1
20 10 post 48 16.1 66.4
Your initial approach very close, it needed some simplification. Use only "names_sep" or "names_pattern"
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("Pre vs. post", '.value'),
names_sep="_")
# A tibble: 20 x 5
Serial_ID `Pre vs. post` EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 1 post 86.3 36.6 57.6
3 2 pre 65.4 35.9 45.1
4 2 post 60.1 26.1 56.7
5 3 pre 64.4 35.1 45.5
6 3 post 72.5 41.1 43.3
7 4 pre 50 13.9 72.1
8 4 post 46.4 18.4 60.4
9 5 pre 89.6 32 64.3
10 5 post 70.9 19.3 72.8
11 6 pre 62 20.6 66.7
12 6 post 55.9 17.8 68.2
13 7 pre 91.2 37.7 58.6
14 7 post 61.9 23.8 61.6
15 8 pre 62 24 61.3
16 8 post 69.3 34.9 49.6
17 9 pre 104. 22.7 78.8
18 9 post 38.6 11.5 70.1
19 10 pre 90.6 31.2 65.6
20 10 post 48 16.1 66.4
try this:
library(dplyr);library(tidyr);library(stringr)
out <- data1 %>% pivot_longer(-Serial_ID,
names_to = c("measurement", "names"),
values_to = "values",
names_sep = "_")
out
# # A tibble: 60 x 4
# Serial_ID measurement names values
# <int> <chr> <chr> <dbl>
# 1 1 pre EDV 76.2
# 2 1 pre ESV 32.9
# 3 1 pre LVEF 56.8
# 4 1 post EDV 86.3
# 5 1 post ESV 36.6
# 6 1 post LVEF 57.6
# 7 2 pre EDV 65.4
# 8 2 pre ESV 35.9
# 9 2 pre LVEF 45.1
# 10 2 post EDV 60.1
# # ... with 50 more rows
Your code snipped passed the object "data" instead of "data1" into the pipe which produced an error:
"Error: No tidyselect variables were registered".

Gathering multiple data columns currently in factor form

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!
If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6
Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

Using R to read html but got a mistake

http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504
This is the website from with I want to read data.
My code is as follows,
library(XML)
fileurl <- "http://www.aqistudy.cn/historydata/daydata.php?city=苏州&month=201404"
doc <- htmlTreeParse(fileurl, useInternalNodes = TRUE, encoding = "utf-8")
rootnode <- xmlRoot(doc)
pollution <- xpathSApply(rootnode, "/td", xmlValue)
But I got a lot of messy code, and I don't know how to fix this problem.
I appreciate for any help!
This can be simplified using library(rvest) to directly read the table
library(rvest)
url <- "http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504"
doc <- read_html(url) %>%
html_table()
doc[[1]]
# 日期 AQI 范围 质量等级 PM2.5 PM10 SO2 CO NO2 O3 排名
# 1 2015-04-01 106 67~144 轻度污染 79.3 105.1 20.2 1.230 89.5 76 308
# 2 2015-04-02 74 31~140 良 48.1 79.7 18.8 1.066 51.5 129 231
# 3 2015-04-03 98 49~136 良 72.9 89.2 16.0 1.323 50.9 62 293
# 4 2015-04-04 92 56~158 良 67.6 78.2 14.3 1.506 57.4 93 262
# 5 2015-04-05 87 42~167 良 63.7 56.1 16.9 1.245 50.8 91 215
# 6 2015-04-06 46 36~56 优 29.1 30.8 10.0 0.817 37.5 98 136
# 7 2015-04-07 45 34~59 优 27.0 42.4 12.0 0.640 36.6 77 143

run function on consecutive vals with specific range in the vector with R

spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)

subset dataframe variables through part of names

Suppose I have a data frame that contains these series and something else.
Where Ru and Uk are country codes.
Date CPI.Ru CPI.g.Ru CPI.s.Ru CPI.Uk CPI.g.Uk CPI.s.Uk
Q4-1990 61.4 66.4 67.5 72.2 68.2 32.4
Q1-1991 61.3 67.0 68.0 72.6 68.8 33.2
Q2-1991 61.4 67.5 68.1 73.2 69.5 35.1
Q3-1991 61.7 68.7 68.9 73.7 70.6 35.9
Q4-1991 62.3 68.4 69.3 74.3 71.9 38.2
Q1-1992 62.3 69.7 69.6 74.7 72.9 39.2
Q2-1992 62.1 70.3 70.0 75.3 73.7 40.6
Q3-1992 62.2 71.4 70.5 75.3 74.1 41.2
Q4-1992 62.5 71.1 70.9 75.7 74.3 44.0
I want to subset dataframe by country and then do something with this series.
For example I want to divide CPI index for each country by its first element.
How can I do it in cycle or maybe with apply function?
countries <- c("Ru","Uk")
for (i in countries)
{dataFrameName$CPI.{i} <- dfName$CPI.{i}/dfName$CPI.{i}[1]}
What should I write instead of {i}?
$ only accept fixed column names. To select columns based on an expression you can instead use double brackets:
countries <- c("Ru", "Uk")
for (i in countries){
x <- paste0("CPI.", i)
dfName[[x]] <- dfName[[x]]/dfName[[x]][1]
}
This is not a loop, but if your data is always of the same form for each country, so that each country has 3 columns, and you always want to operate on the first column per country, you could try this:
sub <- df[,seq(2,ncol(df), 3)] #create a subsetted data.frame containing the CPI index per country
apply(sub, 2, function(x) x/x[1]) #then use apply to operate on each column
# CPI.Ru CPI.Uk
# [1,] 1.0000000 1.000000
# [2,] 0.9983713 1.005540
# [3,] 1.0000000 1.013850
# [4,] 1.0048860 1.020776
# [5,] 1.0146580 1.029086
# [6,] 1.0146580 1.034626
# [7,] 1.0114007 1.042936
# [8,] 1.0130293 1.042936
# [9,] 1.0179153 1.048476

Resources