QA for merging
Purpose
This page summarizes which screening/demographics data and which home visit data have anomalies or missing data.
Screening/demographic data
screen_df <-
readr::read_csv(
file.path(
here::here(),
"data",
"csv",
"screening",
"agg",
"PLAY-screening-datab-latest.csv"
),
col_types = readr::cols(.default = 'c'),
show_col_types = FALSE
)
The following rows have incomplete or missing site_id
values:
screen_df |>
dplyr::filter(is.na(site_id) | is.null(site_id)) |>
dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
dplyr::arrange(vol_id, site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | vol_id | participant_ID | session_id | play_id | group_name |
---|---|---|---|---|---|
NA | 1103 | 001 | 44638 | NA | PLAY_Silver |
NA | 1481 | 006 | 64916 | NA | PLAY_Silver |
NA | 1576 | 003 | 64939 | NA | PLAY_Silver |
NA | 1656 | 001 | 70116 | NA | PLAY_Silver |
NA | 1657 | 003 | 72525 | NA | PLAY_Silver |
NA | 954 | 001 | 39302 | NA | PLAY_Silver |
Volume 1103 is OHIOS. Volume 1482 is CSUFL. Volume 954 is GEORG.
The missing values for vol_id
indicate that there is a bug in the cleaning code.
2023-10-20
On closer investigation, the screening data do not show an OHIOS session with participant_ID
== ‘001’. There are three with ‘000’ and two with ‘002’.
Similarly, for CSUFL, there is no ‘003’ or ‘006’. We have home visit data for ‘006’.
Similarly, for GEORG, there is a ‘???’, but no ‘001’.
The following rows have incomplete or missing participant_ID
values:
screen_df |>
dplyr::filter(is.na(participant_ID) | is.null(participant_ID)) |>
dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
dplyr::arrange(vol_id, site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | vol_id | participant_ID | session_id | play_id | group_name |
---|---|---|---|---|---|
NA | NA | NA | NA | NA | NA |
:——- | :—— | :————– | :———- | :——- | :———- |
The following rows have incomplete or missing play_id
values:
screen_df |>
dplyr::filter(is.na(play_id) | is.null(play_id)) |>
dplyr::select(site_id, vol_id, participant_ID, session_id, play_id, group_name) |>
dplyr::arrange(vol_id, site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | vol_id | participant_ID | session_id | play_id | group_name |
---|---|---|---|---|---|
UCSCR | 1066 | 001 | 56051 | NA | PLAY_Gold |
UCSCR | 1066 | 002 | 56073 | NA | PLAY_Gold |
UCSCR | 1066 | 003 | 56321 | NA | PLAY_Gold |
UCSCR | 1066 | 005 | 57998 | NA | PLAY_Gold |
UCSCR | 1066 | 009 | 58358 | NA | PLAY_Gold |
UCSCR | 1066 | 010 | 58466 | NA | PLAY_Gold |
UCSCR | 1066 | 011 | 58472 | NA | PLAY_Gold |
UCSCR | 1066 | 012 | 58477 | NA | PLAY_Gold |
UCSCR | 1066 | 014 | 58805 | NA | PLAY_Silver |
UCSCR | 1066 | 015 | 59804 | NA | PLAY_Gold |
UCSCR | 1066 | 016 | 59805 | NA | PLAY_Gold |
OHIOS | 1103 | 002 | 56674 | NA | PLAY_Gold |
OHIOS | 1103 | 002 | 56674 | NA | PLAY_Gold |
OHIOS | 1103 | 005 | 57182 | NA | PLAY_Gold |
OHIOS | 1103 | 006 | 58204 | NA | PLAY_Gold |
OHIOS | 1103 | 008 | 58230 | NA | PLAY_Gold |
OHIOS | 1103 | 009 | 57371 | NA | PLAY_Gold |
OHIOS | 1103 | 010 | 57212 | NA | PLAY_Gold |
OHIOS | 1103 | 011 | 57324 | NA | PLAY_Gold |
OHIOS | 1103 | 012 | 58231 | NA | PLAY_Gold |
OHIOS | 1103 | 014 | 58232 | NA | PLAY_Gold |
OHIOS | 1103 | 015 | 58641 | NA | PLAY_Gold |
OHIOS | 1103 | 016 | 58315 | NA | PLAY_Gold |
OHIOS | 1103 | 017 | 58642 | NA | PLAY_Gold |
OHIOS | 1103 | 018 | 58724 | NA | PLAY_Gold |
OHIOS | 1103 | 019 | 58725 | NA | PLAY_Gold |
OHIOS | 1103 | 021 | 58747 | NA | PLAY_Gold |
OHIOS | 1103 | 023 | 59001 | NA | PLAY_Gold |
OHIOS | 1103 | 024 | 59029 | NA | PLAY_Gold |
OHIOS | 1103 | 025 | 59109 | NA | PLAY_Gold |
OHIOS | 1103 | 026 | 59802 | NA | PLAY_Gold |
OHIOS | 1103 | 027 | 59806 | NA | PLAY_Gold |
OHIOS | 1103 | 028 | 59820 | NA | PLAY_Gold |
OHIOS | 1103 | 029 | 59858 | NA | PLAY_Gold |
OHIOS | 1103 | 030 | 59966 | NA | PLAY_Gold |
NA | 1103 | 001 | 44638 | NA | PLAY_Silver |
STANF | 1362 | 001 | 57209 | NA | PLAY_Silver |
STANF | 1362 | 002 | 58017 | NA | PLAY_Gold |
PURDU | 1363 | 001 | 56367 | NA | PLAY_Silver |
PURDU | 1363 | 003 | 58740 | NA | PLAY_Gold |
PURDU | 1363 | 004 | 58918 | NA | PLAY_Gold |
PURDU | 1363 | 005 | 59150 | NA | PLAY_Silver |
PURDU | 1363 | 006 | 60049 | NA | PLAY_Silver |
PURDU | 1363 | 006 | 60049 | NA | PLAY_Silver |
CHOPH | 1370 | 001 | 57863 | NA | PLAY_Silver |
CHOPH | 1370 | 002 | 60897 | NA | PLAY_Gold |
CSULB | 1376 | 001 | 56400 | NA | PLAY_Gold |
CSULB | 1376 | 002 | 56399 | NA | PLAY_Gold |
CSULB | 1376 | 003 | 57852 | NA | PLAY_Gold |
CSULB | 1376 | 004 | 58612 | NA | PLAY_Gold |
CSULB | 1376 | 005 | 59735 | NA | PLAY_Silver |
CSULB | 1376 | 007 | 57857 | NA | PLAY_Gold |
CSULB | 1376 | 010 | 60080 | NA | PLAY_Gold |
VBLTU | 1391 | 001 | 59779 | NA | PLAY_Silver |
VBLTU | 1391 | 003 | 60236 | NA | PLAY_Gold |
VBLTU | 1391 | 004 | 60243 | NA | PLAY_Gold |
VBLTU | 1391 | 005 | 60311 | NA | PLAY_Gold |
UHOUS | 1397 | 001 | 57374 | NA | PLAY_Silver |
UHOUS | 1397 | 002 | 57916 | NA | PLAY_Gold |
UHOUS | 1397 | 004 | 58465 | NA | PLAY_Gold |
UHOUS | 1397 | 005 | 59144 | NA | PLAY_Gold |
UHOUS | 1397 | 006 | 60333 | NA | PLAY_Gold |
UHOUS | 1397 | 007 | 61697 | NA | PLAY_Gold |
UHOUS | 1397 | 008 | 61748 | NA | PLAY_Gold |
INDNA | 1400 | 001 | 58458 | NA | PLAY_Silver |
INDNA | 1400 | 002 | 62176 | NA | PLAY_Gold |
UIOWA | 1422 | 001 | 57544 | NA | PLAY_Gold |
UIOWA | 1422 | 002 | 58798 | NA | PLAY_Gold |
UIOWA | 1422 | 003 | 59206 | NA | PLAY_Gold |
UIOWA | 1422 | 004 | 59892 | NA | PLAY_Gold |
UIOWA | 1422 | 005 | 60749 | NA | PLAY_Silver |
CSUFL | 1481 | 001 | 60393 | NA | PLAY_Silver |
NA | 1481 | 006 | 64916 | NA | PLAY_Silver |
NA | 1576 | 003 | 64939 | NA | PLAY_Silver |
NA | 1656 | 001 | 70116 | NA | PLAY_Silver |
NA | 1657 | 003 | 72525 | NA | PLAY_Silver |
NYUNI | 899 | 001 | 41534 | NA | PLAY_Gold |
NYUNI | 899 | 001 | 41534 | NA | PLAY_Gold |
NYUNI | 899 | 002 | 41800 | NA | PLAY_Silver |
NYUNI | 899 | 002 | 41800 | NA | PLAY_Silver |
NYUNI | 899 | 003 | 41455 | NA | PLAY_Silver |
NYUNI | 899 | 003 | 41455 | NA | PLAY_Silver |
NYUNI | 899 | 003 | 41455 | NA | PLAY_Silver |
NYUNI | 899 | 004 | 41535 | NA | PLAY_Gold |
NYUNI | 899 | 005 | 41608 | NA | PLAY_Gold |
NYUNI | 899 | 005 | 41608 | NA | PLAY_Gold |
NYUNI | 899 | 006 | 41808 | NA | PLAY_Gold |
NYUNI | 899 | 006 | 41808 | NA | PLAY_Gold |
NYUNI | 899 | 007 | 41894 | NA | PLAY_Gold |
NYUNI | 899 | 007 | 41894 | NA | PLAY_Gold |
NYUNI | 899 | 013 | 43207 | NA | PLAY_Gold |
NYUNI | 899 | 014 | 43530 | NA | PLAY_Gold |
NYUNI | 899 | 017 | 55842 | NA | PLAY_Gold |
NYUNI | 899 | 018 | 55863 | NA | PLAY_Silver |
NYUNI | 899 | 020 | 56064 | NA | PLAY_Silver |
NYUNI | 899 | 021 | 56065 | NA | PLAY_Gold |
NYUNI | 899 | 022 | 56103 | NA | PLAY_Gold |
NYUNI | 899 | 023 | 56104 | NA | PLAY_Gold |
NYUNI | 899 | 024 | 56311 | NA | PLAY_Silver |
NYUNI | 899 | 026 | 56417 | NA | PLAY_Silver |
NYUNI | 899 | 028 | 56526 | NA | PLAY_Gold |
NYUNI | 899 | 029 | 56571 | NA | PLAY_Gold |
NYUNI | 899 | 031 | 57373 | NA | PLAY_Silver |
NYUNI | 899 | 032 | 57410 | NA | PLAY_Gold |
NYUNI | 899 | 035 | 57894 | NA | PLAY_Gold |
NYUNI | 899 | 037 | 59428 | NA | PLAY_Gold |
NYUNI | 899 | 227 | 38196 | NA | PLAY_Gold |
NYUNI | 899 | 228 | 38197 | NA | PLAY_Gold |
NYUNI | 899 | 229 | 38215 | NA | PLAY_Gold |
NYUNI | 899 | 230 | 38236 | NA | PLAY_Gold |
NYUNI | 899 | 231 | 38485 | NA | PLAY_Gold |
GEORG | 954 | 002 | 40510 | NA | PLAY_Silver |
GEORG | 954 | 002 | 40510 | NA | PLAY_Silver |
GEORG | 954 | 003 | 41417 | NA | PLAY_Gold |
GEORG | 954 | 003 | 41417 | NA | PLAY_Gold |
GEORG | 954 | 004 | 41428 | NA | PLAY_Gold |
GEORG | 954 | 005 | 41873 | NA | PLAY_Gold |
GEORG | 954 | 005 | 41873 | NA | PLAY_Gold |
GEORG | 954 | 005 | 41873 | NA | PLAY_Gold |
GEORG | 954 | 008 | 42127 | NA | PLAY_Gold |
GEORG | 954 | 008 | 42127 | NA | PLAY_Gold |
GEORG | 954 | 009 | 42354 | NA | PLAY_Gold |
GEORG | 954 | 009 | 42354 | NA | PLAY_Gold |
GEORG | 954 | 010 | 42353 | NA | PLAY_Gold |
GEORG | 954 | 011 | 42694 | NA | PLAY_Gold |
GEORG | 954 | 012 | 57864 | NA | PLAY_Gold |
GEORG | 954 | 015 | 58803 | NA | PLAY_Gold |
GEORG | 954 | 016 | 59778 | NA | PLAY_Silver |
GEORG | 954 | 021 | 60059 | NA | PLAY_Gold |
NA | 954 | 001 | 39302 | NA | PLAY_Silver |
UCRIV | 966 | 001 | 43624 | NA | PLAY_Gold |
VCOMU | 982 | 002 | 42292 | NA | PLAY_Silver |
VCOMU | 982 | 003 | 42940 | NA | PLAY_Gold |
UMIAM | 996 | 003 | 65077 | NA | PLAY_Gold |
There are duplicate entries for OHIOS 1103 56674 002.
We should add duplicate checking to the cleaning code.
There are 1 variables with information about exclusion status:
ex_dups <- stringr::str_detect(names(screen_df), "exclusion")
names(screen_df)[ex_dups]
## [1] "exclusion_reason"
These probably result from some bug in the *_join
operation in the cleaning process.
They should be merged.
Home visit data
The home_visit_df
data have some field names that are inconsistent with the screening/demographic data files.
We reconcile these differences first.
home_df <- home_visit_df |>
dplyr::rename("play_id" = "participant_id") |>
dplyr::rename("participant_ID" = "subject_number")
Since the home_visit_df
data have not yet been merged with Databrary information, the vol_id
and group_name
variables are not available.
The following rows have incomplete or missing site_id
values:
home_df |>
dplyr::filter(is.na(site_id) | is.null(site_id)) |>
dplyr::select(site_id, participant_ID, play_id) |>
dplyr::arrange(site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | participant_ID | play_id |
---|---|---|
NA | NA | NA |
:——- | :————– | :——- |
The following rows have incomplete or missing participant_ID
values:
home_df |>
dplyr::filter(is.na(participant_ID) | is.null(participant_ID)) |>
dplyr::select(site_id, participant_ID, play_id) |>
dplyr::arrange(site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | participant_ID | play_id |
---|---|---|
NA | NA | NA |
:——- | :————– | :——- |
The following rows have incomplete or missing play_id
values:
home_df |>
dplyr::filter(is.na(play_id) | is.null(play_id)) |>
dplyr::select(site_id, participant_ID, play_id) |>
dplyr::arrange(site_id, participant_ID) |>
knitr::kable() |>
kableExtra::kable_classic()
site_id | participant_ID | play_id |
---|---|---|
NA | NA | NA |
:——- | :————– | :——- |