Web Scraping with R : gz/csv files

Web Scraping with R : gz/csv files - r

Im trying to read the archive on this link:
COVID CSV
I'm using the read.csv, but it doesn't seems to work:
read.table(file = "https://data.brasil.io/dataset/covid19/caso.csv.gz")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 3 elements
I'm trying build a code that pulls up the data from this website with COVID infos, so i don't have to download it everytime that i wan't to use it.

We could use fread
library(data.table)
fread("https://data.brasil.io/dataset/covid19/caso.csv.gz")
# date state city place_type confirmed deaths order_for_place is_last estimated_population_2019
# 1: 2020-07-17 AP state 33436 499 119 TRUE 845731
# 2: 2020-07-16 AP state 33004 493 118 FALSE 845731
# 3: 2020-07-15 AP state 32408 488 117 FALSE 845731
# 4: 2020-07-14 AP state 31885 483 116 FALSE 845731
# 5: 2020-07-13 AP state 31552 478 115 FALSE 845731
# ---
#372166: 2020-06-23 SP Óleo city 1 0 5 FALSE 2496
#372167: 2020-06-22 SP Óleo city 1 0 4 FALSE 2496
#372168: 2020-06-21 SP Óleo city 1 0 3 FALSE 2496
#372169: 2020-06-20 SP Óleo city 1 0 2 FALSE 2496
#372170: 2020-06-19 SP Óleo city 1 0 1 FALSE 2496
# city_ibge_code confirmed_per_100k_inhabitants death_rate
# 1: 16 3953.5030 0.0149
# 2: 16 3902.4229 0.0149
# 3: 16 3831.9513 0.0151
# 4: 16 3770.1113 0.0151
# 5: 16 3730.7371 0.0151
# ---
#372166: 3533809 40.0641 0.0000
#372167: 3533809 40.0641 0.0000
#372168: 3533809 40.0641 0.0000
#372169: 3533809 40.0641 0.0000
#372170: 3533809 40.0641 0.0000

Seems to work with readr::read_csv
readr::read_csv("https://data.brasil.io/dataset/covid19/caso.csv.gz")
# A tibble: 376,064 x 12
# date state city place_type confirmed deaths order_for_place is_last
# <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
# 1 2020-07-18 AC NA state 17202 457 124 TRUE
# 2 2020-07-17 AC NA state 16965 452 123 FALSE
# 3 2020-07-16 AC NA state 16865 447 122 FALSE
# 4 2020-07-15 AC NA state 16672 446 121 FALSE
# 5 2020-07-14 AC NA state 16479 436 120 FALSE
# 6 2020-07-13 AC NA state 16260 430 119 FALSE
# 7 2020-07-12 AC NA state 16190 426 118 FALSE
# 8 2020-07-11 AC NA state 16080 419 117 FALSE
# 9 2020-07-10 AC NA state 15768 417 116 FALSE
#10 2020-07-09 AC NA state 15465 411 115 FALSE
# … with 376,054 more rows, and 4 more variables:
# estimated_population_2019 <dbl>, city_ibge_code <dbl>,
# confirmed_per_100k_inhabitants <dbl>, death_rate <dbl>

Related

How to add a moving sum and a function to a dataframe

I need to add a new column containing an specifica function to a data frame.
Basically i need to calculate an indicator which is the sum of the past 5 observations (in column "value1") multuplied by 100 and divided by column "value2" {this one not as a sum, just the simple observatio} of my sample data below.
somewhat like this (its not a formal notation):
indicator = [sum (i-5) value1 / value2] * 100
the indicator must be calculate by country.
in case of countries or dates "mixed" in the data frame the formula need to be able to recognize and sum the correct values only, in the correct order.
If there is a NA value in the value 1, the formula should also be able to ignore this line as a computation. ex: 31/12, 1/01, 2/01, 3/01, 4/01 = NA, 05/01 --> the indicator of 06/01 will then take into account the past 5 valid observation, 31/12, 1/01, 2/01, 3/01, 05/01.
Important -> only use base R
Example of the data frame (my actual data frame is more complex)
set.seed(1)
Country <- c(rep("USA", 10),rep("UK", 10), rep("China", 10))
Value1 <- sample(x = c(120, 340, 423), size = 30, replace = TRUE)
Value2 <- sample(x = c(1,3,5,6,9), size = 30, replace = TRUE)
date <- seq(as.POSIXct('2020/01/01'),
as.POSIXct('2020/01/30'),
by = "1 day")
df = data.frame(Country, Value1, Value2, date)
I thank you all very much in advance. this one has bein very hard to crack :D

Since it has to be done group-wise but in base R, you could use the split-apply-bind method
df2 <- do.call(rbind, lapply(split(df, df$Country), function(d) {
d <- d[order(d$date),]
d$computed <- 100 * d$Value1 / d$Value2
d$Result <- NA
for(i in 5:nrow(d)) d$Result[i] <- sum(tail(na.omit(d$computed[seq(i)]), 5))
d[!names(d) %in% "computed"]
}))
rn <- sapply(strsplit(rownames(df2), "\\."), function(x) as.numeric(x[2]))
`rownames<-`(df2[rn,], NULL)
#> Country Value1 Value2 date Result
#> 1 USA 423 9 2020-01-01 NA
#> 2 USA 120 3 2020-01-02 NA
#> 3 USA 120 3 2020-01-03 NA
#> 4 USA 423 5 2020-01-04 NA
#> 5 USA 120 1 2020-01-05 33160.00
#> 6 USA 120 1 2020-01-06 40460.00
#> 7 USA 120 3 2020-01-07 40460.00
#> 8 USA 340 1 2020-01-08 70460.00
#> 9 USA 423 6 2020-01-09 69050.00
#> 10 USA 340 9 2020-01-10 60827.78
#> 11 UK 340 5 2020-01-11 NA
#> 12 UK 423 6 2020-01-12 NA
#> 13 UK 423 3 2020-01-13 NA
#> 14 UK 340 1 2020-01-14 NA
#> 15 UK 120 3 2020-01-15 65950.00
#> 16 UK 120 9 2020-01-16 60483.33
#> 17 UK 423 1 2020-01-17 95733.33
#> 18 UK 423 9 2020-01-18 86333.33
#> 19 UK 340 1 2020-01-19 86333.33
#> 20 UK 340 3 2020-01-20 93666.67
#> 21 China 340 1 2020-01-21 NA
#> 22 China 340 9 2020-01-22 NA
#> 23 China 423 3 2020-01-23 NA
#> 24 China 120 1 2020-01-24 NA
#> 25 China 340 9 2020-01-25 67655.56
#> 26 China 340 5 2020-01-26 40455.56
#> 27 China 120 5 2020-01-27 39077.78
#> 28 China 340 9 2020-01-28 28755.56
#> 29 China 340 9 2020-01-29 20533.33
#> 30 China 423 5 2020-01-30 25215.56
Created on 2022-06-08 by the reprex package (v2.0.1)

Here's an option - not sure if the calculation is as you intend:
split_df <- split(df, Country)
split_df <- lapply(split_df, function(x) {
x <- x[order(x$date),]
x$index <- nrow(x):1
x$indicator <- ifelse(x$index <= 5, sum(x$Value2[x$index <= 5]) * 100 / x$Value2, NA)
x$index <- NULL
return(x)
})
final_df <- do.call(rbind, split_df)
Country Value1 Value2 date indicator
China.21 China 120 3 2020-01-21 NA
China.22 China 423 5 2020-01-22 NA
China.23 China 340 6 2020-01-23 NA
China.24 China 120 3 2020-01-24 NA
China.25 China 340 9 2020-01-25 NA
China.26 China 423 6 2020-01-26 366.6667
China.27 China 120 3 2020-01-27 733.3333
China.28 China 340 3 2020-01-28 733.3333
China.29 China 120 5 2020-01-29 440.0000
China.30 China 340 5 2020-01-30 440.0000
UK.11 UK 423 1 2020-01-11 NA
UK.12 UK 340 6 2020-01-12 NA
UK.13 UK 423 1 2020-01-13 NA
UK.14 UK 423 5 2020-01-14 NA
UK.15 UK 340 6 2020-01-15 NA
UK.16 UK 340 1 2020-01-16 2400.0000
UK.17 UK 120 5 2020-01-17 480.0000
UK.18 UK 423 9 2020-01-18 266.6667
UK.19 UK 120 6 2020-01-19 400.0000
UK.20 UK 423 3 2020-01-20 800.0000
USA.1 USA 423 1 2020-01-01 NA
USA.2 USA 423 5 2020-01-02 NA
USA.3 USA 423 5 2020-01-03 NA
USA.4 USA 423 6 2020-01-04 NA
USA.5 USA 423 1 2020-01-05 NA
USA.6 USA 340 5 2020-01-06 600.0000
USA.7 USA 340 5 2020-01-07 600.0000
USA.8 USA 423 6 2020-01-08 500.0000
USA.9 USA 423 5 2020-01-09 600.0000
USA.10 USA 423 9 2020-01-10 333.3333

In base R you could do:
transform(df,Results=ave(Value1,Country,FUN=function(x)replace(x,!is.na(x),
filter(na.omit(x),rep(1,5),sides=1)))/Value2)
Country Value1 Value2 date Results
1 USA 120 1 2020-01-01 NA
2 USA 423 6 2020-01-02 NA
3 USA 120 1 2020-01-03 NA
4 USA 340 6 2020-01-04 NA
5 USA 120 5 2020-01-05 224.6000
6 USA 423 3 2020-01-06 475.3333
7 USA 423 3 2020-01-07 475.3333
8 USA 340 6 2020-01-08 274.3333
9 USA 340 6 2020-01-09 274.3333
10 USA 423 6 2020-01-10 324.8333
11 UK 423 3 2020-01-11 NA
12 UK 120 6 2020-01-12 NA
13 UK 120 1 2020-01-13 NA
14 UK 120 1 2020-01-14 NA
15 UK 340 6 2020-01-15 187.1667
16 UK 340 1 2020-01-16 1040.0000
17 UK 340 3 2020-01-17 420.0000
18 UK 340 5 2020-01-18 296.0000
19 UK 423 3 2020-01-19 594.3333
20 UK 120 3 2020-01-20 521.0000
21 China 423 9 2020-01-21 NA
22 China 120 3 2020-01-22 NA
23 China 120 1 2020-01-23 NA
24 China 120 5 2020-01-24 NA
25 China 120 5 2020-01-25 180.6000
26 China 340 6 2020-01-26 136.6667
27 China 120 5 2020-01-27 164.0000
28 China 120 1 2020-01-28 820.0000
29 China 340 6 2020-01-29 173.3333
30 China 340 9 2020-01-30 140.0000

Unnest or move rows to columns?

This is just one of those things that I can't figure out how to word in order to search for a solution to my problem. I have some election data for Democratic and Republican candidates. The data is contained in 2 rows per county with one of those rows corresponding to one of the two candidates.
I need a data frame with one row per county and I need to create a new column out of the second row for each county. I've tried to un-nest the dataframe, but that doesn't work. I've seen something about using un-nest and mutate together, but I can't figure that out. Transposing the dataframe didn't help either. I've also tried to ungroup without success.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# Remove unnecessary columns
election <- within(election, rm('ElectionDate','OfficeCode.Text.','DistrictCode.Text.','StatusCode','CountyCode','OfficeDescription','PartyOrder','PartyName','CandidateID','CandidateFirstName','CandidateMiddleName','CandidateFormerName','WriteIn.W..Uncommitted.Z.','Recount...','Nominated.N..Elected.E.'))
# Remove offices other than POTUS
election <- election[-c(167:2186),]
# Keep only DEM and REP parties
election <- election %>%
filter(PartyDescription == "Democratic" |
PartyDescription == "Republican")
[
I'd like it to look like this:

dplyr
library(dplyr)
library(tidyr) # pivot_wider
election %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
slice(-(167:2186)) %>%
filter(PartyDescription %in% c("Democratic", "Republican")) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
# # A tibble: 83 x 25
# CountyName Biden Trump Richer LaFave Cambensy Wagner Metsa Markkanen Lipton Strayhorn Carlone Frederick Bernstein Diggs Hubbard Meyers Mosallam Vassar `O'Keefe` Schuitmaker Dewaelsche Stancato Gates Land
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ALCONA 2142 4848 NA NA NA NA NA NA 1812 1748 4186 4209 1818 1738 4332 4114 1696 1770 4273 4187 1682 1733 4163 4223
# 2 ALGER 2053 3014 NA NA 2321 2634 NA NA 1857 1773 2438 2470 1795 1767 2558 2414 1757 1769 2538 2444 1755 1757 2458 2481
# 3 ALLEGAN 24449 41392 NA NA NA NA NA NA 20831 19627 37681 38036 20043 19640 38805 37375 18820 19486 37877 39052 19081 19039 37322 38883
# 4 ALPENA 6000 10686 NA NA NA NA NA NA 5146 4882 8845 8995 5151 4873 9369 8744 4865 4935 9212 8948 4816 4923 9069 9154
# 5 ANTRIM 5960 9748 NA NA NA NA NA NA 5042 4798 8828 8886 4901 4797 9108 8737 4686 4810 9079 8867 4679 4781 8868 9080
# 6 ARENAC 2774 5928 NA NA NA NA NA NA 2374 2320 4626 4768 2396 2224 4833 4584 2215 2243 5025 4638 2185 2276 4713 4829
# 7 BARAGA 1478 2512 NA NA NA NA 1413 2517 1267 1212 2057 2078 1269 1233 2122 2003 1219 1243 2090 2056 1226 1228 2072 2074
# 8 BARRY 11797 23471 NA NA NA NA NA NA 9794 9280 20254 20570 9466 9215 20885 20265 9060 9324 21016 20901 8967 9121 20346 21064
# 9 BAY 26151 33125 NA NA NA NA NA NA 23209 22385 26021 26418 23497 22050 27283 25593 21757 22225 27422 25795 21808 21999 26167 26741
# 10 BENZIE 5480 6601 NA NA NA NA NA NA 4704 4482 5741 5822 4584 4479 6017 5681 4379 4449 5979 5756 4392 4353 5704 5870
# # ... with 73 more rows

#r2evans had the right idea, but slicing the data before filtering lost a lot of the voting data. I hadn't realized that before.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# That's an ugly dataset...let's make it better
election <- election[-c(1:5,7:9,11,13:15,17:19)]
election <- election %>%
filter(CandidateLastName %in% c("Biden", "Trump")) %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)

How to apply rolling t.test in R on a single variable?

I have a data.frame (df) with two columns (Date & Count) which looks something like shown below:
Date Count
1/1/2022 5
1/2/2022 13
1/3/2022 21
1/4/2022 29
1/5/2022 37
1/6/2022 45
1/7/2022 53
1/8/2022 61
1/9/2022 69
1/10/2022 77
1/11/2022 85
1/12/2022 93
1/13/2022 101
1/14/2022 109
1/15/2022 117
Since I have single variable (count), the idea is to identify if there's been a change in mean in every three days, therefore I want to apply rolling t.test with a window of 3 days and save the resulting p-value next to Count column which I can plot later. Since I have seen people doing these sorts of tests with two variables usually, I can't figure out how to do it with a single variable.
For example, I saw this relevant answer here:
ttestFun <- function(dat) {
myTtest = t.test(x = dat[, 1], y = dat[, 2])
return(myTtest$p.value)
}
rollapply(df_ts, 7, FUN = ttestFun, fill = NA, by.column = FALSE)
But again, this is with two columns. Any guidance please?

Irrespective of any discussion about the usefulness of the approach, given a fixed number of measurements of 3, you could just shift the counts by 3 and perform t-test between two columns as in your example, such as:
library(data.table)
set.seed(123)
dates <- seq(as.POSIXct("2022-01-01"), as.POSIXct("2022-02-01"), by = "1 day")
dt <- data.table(Date=dates, count = sample(1:200, length(dates), replace=TRUE), key="Date")
dt[, nxt:=shift(count, 3, type = "lead")]
dt[, group:=rep(1:ceiling(length(dates)/3), each=3)[seq_along(dates)]]
dt[, p:= tryCatch(t.test(count, nxt)$p.value, error=function(e) NA), by="group"][]
#> Date count nxt group p
#> 1: 2022-01-01 159 195 1 0.7750944
#> 2: 2022-01-02 179 170 1 0.7750944
#> 3: 2022-01-03 14 50 1 0.7750944
#> 4: 2022-01-04 195 118 2 0.2240362
#> 5: 2022-01-05 170 43 2 0.2240362
#> 6: 2022-01-06 50 14 2 0.2240362
#> 7: 2022-01-07 118 118 3 0.1763296
#> 8: 2022-01-08 43 153 3 0.1763296
#> 9: 2022-01-09 14 90 3 0.1763296
#> 10: 2022-01-10 118 91 4 0.8896343
#> 11: 2022-01-11 153 197 4 0.8896343
#> 12: 2022-01-12 90 91 4 0.8896343
#> 13: 2022-01-13 91 185 5 0.8065021
#> 14: 2022-01-14 197 92 5 0.8065021
#> 15: 2022-01-15 91 137 5 0.8065021
#> 16: 2022-01-16 185 99 6 0.1060465
#> 17: 2022-01-17 92 72 6 0.1060465
#> 18: 2022-01-18 137 26 6 0.1060465
#> 19: 2022-01-19 99 7 7 0.5283156
#> 20: 2022-01-20 72 170 7 0.5283156
#> 21: 2022-01-21 26 137 7 0.5283156
#> 22: 2022-01-22 7 164 8 0.9612965
#> 23: 2022-01-23 170 78 8 0.9612965
#> 24: 2022-01-24 137 81 8 0.9612965
#> 25: 2022-01-25 164 43 9 0.6111337
#> 26: 2022-01-26 78 103 9 0.6111337
#> 27: 2022-01-27 81 117 9 0.6111337
#> 28: 2022-01-28 43 76 10 0.6453494
#> 29: 2022-01-29 103 143 10 0.6453494
#> 30: 2022-01-30 117 NA 10 0.6453494
#> 31: 2022-01-31 76 NA 11 NA
#> 32: 2022-02-01 143 NA 11 NA
#> Date count nxt group p
Created on 2022-04-07 by the reprex package (v2.0.1)
You could further clean that up, e.g. by taking the first date per group:
dt[, .(Date=Date[1], count=round(mean(count), 2), p=p[1]), by="group"]
#> group Date count p
#> 1: 1 2022-01-01 117.33 0.7750944
#> 2: 2 2022-01-04 138.33 0.2240362
#> 3: 3 2022-01-07 58.33 0.1763296
#> 4: 4 2022-01-10 120.33 0.8896343
#> 5: 5 2022-01-13 126.33 0.8065021
#> 6: 6 2022-01-16 138.00 0.1060465
#> 7: 7 2022-01-19 65.67 0.5283156
#> 8: 8 2022-01-22 104.67 0.9612965
#> 9: 9 2022-01-25 107.67 0.6111337
#> 10: 10 2022-01-28 87.67 0.6453494
#> 11: 11 2022-01-31 109.50 NA

You can create a grp, and then simply apply a t.test to each consecutive pair of groups:
d <- d %>% mutate(grp=rep(1:(n()/3), each=3))
d %>% left_join(
tibble(grp = 2:max(d$grp),
pval = sapply(2:max(d$grp), function(x) {
t.test(d %>% filter(grp==x) %>% pull(Count),
d %>% filter(grp==x-1) %>% pull(Count))$p.value
})
)) %>% group_by(grp) %>% slice_min(Date)
Output: (p-value is constant only because of the example data you provided)
Date Count grp pval
<date> <dbl> <int> <dbl>
1 2022-01-01 5 1 NA
2 2022-01-04 29 2 0.0213
3 2022-01-07 53 3 0.0213
4 2022-01-10 77 4 0.0213
5 2022-01-13 101 5 0.0213
Or a data.table approach:
setDT(d)[, `:=`(grp=rep(1:(nrow(d)/3), each=3),cy=shift(Count,3))] %>%
.[!is.na(cy), pval:=t.test(Count,cy)$p.value, by=grp] %>%
.[,.SD[1], by=grp, .SDcols=!c("cy")]
Output:
grp Date Count pval
<int> <Date> <num> <num>
1: 1 2022-01-01 5 NA
2: 2 2022-01-04 29 0.02131164
3: 3 2022-01-07 53 0.02131164
4: 4 2022-01-10 77 0.02131164
5: 5 2022-01-13 101 0.02131164

Subseting multiple object in R without specifying all of them

Let's assume we somehow ended up with data frame object (T2 in below example) and we want to subset our original data with that dataframe. Is there a way to do without using | in subset object?
Here is a dataset I was playing but failed
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
head(education)
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
subset(education, State=="ME"| State=="MA" | State=="MI" | State=="MN" | State=="MO" | State=="MD" | State=="MS" | State=="MT")
subset(education, State==T2[3])
subset(education, State==T2)
PS: I created T2 as states starting with M but I don't want using string or anything. Just assume we somehow ended up with T2 in which outputs are some states.

I'm not quite sure what would be an acceptable answer but subset(education, State %in% T2) uses T2 as is and does not use |. Does this solve your problem? It's almost the same approach as Jon Spring points out in the comments, but instead of specifying a vector we can just use T2 with %in%. You say T2 is a data.frame object, but in the data you provided it turns out to be a character vector.
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
T2 # T2 is not a data.frame object (R 4.0)
#> [1] "ME" "MA" "MI" "MN" "MO" "MD" "MS" "MT"
subset(education, State %in% T2)
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
But lets say T2 would be an actual data.frame:
T2 = education[T1,]["State"]
T2 #check
#> State
#> 1 ME
#> 4 MA
#> 13 MI
#> 15 MN
#> 17 MO
#> 23 MD
#> 33 MS
#> 38 MT
Then we could coerce it into a vector by subsetting it with drop = TRUE.
subset(education, State %in% T2[, , drop = TRUE])
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
Created on 2021-06-12 by the reprex package (v0.3.0)

Somthing is wrong with using pivot_wider and pivot_longer to gather data(I finished it by myself.It was solved.)

I used this method to gather mean and sd result successly before here .And then, I tried to use this methond to gather my gene counts DEG data with "logFC","cil","cir","ajustP_value" .But I failed because something wrong with my result.
Just like this:
data_1<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_1) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
head(data_1)
A<-paste0(1:48,"_logFC")
data_logFC<-data.frame(A=A,data_1)
#
data_2<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_2) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
B_L<-paste0(1:48,"_CI.L")
data_CIL<-data.frame(A=B_L,data_2)
data_CIL[1:48,1:6]
#
data_3<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_3) <- c(paste0("Gene_", 1:25))
rownames(data_3)<-NULL
C_R<-paste0(1:48,"_CI.R")
data_CIR<-data.frame(A=C_R,data_3)
data_CIR[1:48,1:6]
#
data_4<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_4) <- c(paste0("Gene_", 1:25))
rownames(data_4)<-NULL
D<-paste0(1:48,"_adj.P.Val")
data_ajustP<-data.frame(A=D,data_4)
data_ajustP[1:48,1:6]
# combine data_logFC data_CIL data_CIR data_ajustP
data <- bind_rows(list(
logFC = data_logFC,
CIL = data_CIL,
CIR =data_CIR,
AJSTP=data_ajustP
), .id = "stat")
data[1:10,1:6]
data_DEG<- data %>%
pivot_longer(-c(stat,A), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,100)
str(data_DEG$CIL)
> head(data_DEG,100)
# A tibble: 100 x 6
A Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1_logFC Gene_1 504 NA NA NA
2 1_logFC Gene_2 100 NA NA NA
3 1_logFC Gene_3 689 NA NA NA
4 1_logFC Gene_4 779 NA NA NA
5 1_logFC Gene_5 397 NA NA NA
6 1_logFC Gene_6 1152 NA NA NA
7 1_logFC Gene_7 780 NA NA NA
8 1_logFC Gene_8 155 NA NA NA
9 1_logFC Gene_9 142 NA NA NA
10 1_logFC Gene_10 1150 NA NA NA
# … with 90 more rows
Why is there so many NAs ?
Can somebody help me ? Vary thankful.
EDITE:
I confused the real sample group of my data. So I reshape my data without a right index.
Here is my right method:
data[1:10,1:6]
data<-separate(data,A,c("Name","stat2"),"_")
data<-data[,-3]
data_DEG<- data %>%
pivot_longer(-c(stat,Name), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,10)
tail(data_DEG,10)
> head(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1 Gene_1 504 1116 774 278
2 1 Gene_2 100 936 448 887
3 1 Gene_3 689 189 718 933
4 1 Gene_4 779 943 690 19
5 1 Gene_5 397 976 40 135
6 1 Gene_6 1152 304 343 647
7 1 Gene_7 780 1076 796 1024
8 1 Gene_8 155 645 469 180
9 1 Gene_9 142 256 889 1047
10 1 Gene_10 1150 976 1194 670
> tail(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 48 Gene_16 448 633 1080 1122
2 48 Gene_17 73 772 14 388
3 48 Gene_18 652 999 699 912
4 48 Gene_19 600 1163 512 241
5 48 Gene_20 428 1119 1142 348
6 48 Gene_21 66 553 240 82
7 48 Gene_22 753 1119 630 117
8 48 Gene_23 1017 305 1120 447
9 48 Gene_24 432 1175 447 670
10 48 Gene_25 482 394 371 696
It's a perfect result!!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping with R : gz/csv files - r

Related

How to add a moving sum and a function to a dataframe

Unnest or move rows to columns?

How to apply rolling t.test in R on a single variable?

Subseting multiple object in R without specifying all of them

Somthing is wrong with using pivot_wider and pivot_longer to gather data(I finished it by myself.It was solved.)

Categories

Resources